Existing scene text spotters are designed to locate and transcribe texts from images. However, it is challenging for a spotter to achieve precise detection and recognition of scene texts simultaneously. Inspired by the glimpse-focus spotting pipeline of human beings and impressive performances of Pre-trained Language Models (PLMs) on visual tasks, we ask: 1) "Can machines spot texts without precise detection just like human beings?", and if yes, 2) "Is text block another alternative for scene text spotting other than word or character?" To this end, our proposed scene text spotter leverages advanced PLMs to enhance performance without fine-grained detection. Specifically, we first use a simple detector for block-level text detection to obtain rough positional information. Then, we finetune a PLM using a large-scale OCR dataset to achieve accurate recognition. Benefiting from the comprehensive language knowledge gained during the pre-training phase, the PLM-based recognition module effectively handles complex scenarios, including multi-line, reversed, occluded, and incomplete-detection texts. Taking advantage of the fine-tuned language model on scene recognition benchmarks and the paradigm of text block detection, extensive experiments demonstrate the superior performance of our scene text spotter across multiple public benchmarks. Additionally, we attempt to spot texts directly from an entire scene image to demonstrate the potential of PLMs, even Large Language Models (LLMs).
现有的场景文本检测器旨在从图像中定位和转录文本。然而,对于检测器来说,同时实现精确的检测和识别场景文本是非常具有挑战性的。受到人类瞥视焦点检测流程和预训练语言模型(PLMs)在视觉任务上的出色表现启发,我们提出以下问题:1)“机器能否像人类一样准确地检测到文本,而无需进行精确的检测?”如果答案是肯定的,2)“文本块是否是场景文本检测的另一种选择,除了单词或字符?”为此,我们提出的场景文本检测器利用预训练语言模型(PLMs)增强性能,而无需进行微细检测。具体来说,我们首先使用简单的基于块级的文本检测器获得粗略的位置信息。然后,我们使用一个大型的OCR数据集对PLM进行微调,以实现精确的识别。得益于在预训练阶段获得的全面语言知识,基于PLM的识别模块有效地处理复杂的场景,包括多行、反向、遮挡和不完整的检测文本。利用场景识别基准测试中的微调后的语言模型以及文本块检测的范式,广泛的实验证明了我们在多个公共基准测试中的优越性能。此外,我们还尝试从整个场景图像中直接检测文本,以展示PLMs的潜力,即使是大语言模型(LLMs)。
https://arxiv.org/abs/2403.10047
This paper introduces semantic features as a general conceptual framework for fully explainable neural network layers. A well-motivated proof of concept model for relevant subproblem of MNIST consists of 4 such layers with the total of 4.8K learnable parameters. The model is easily interpretable, achieves human-level adversarial test accuracy with no form of adversarial training, requires little hyperparameter tuning and can be quickly trained on a single CPU. The general nature of the technique bears promise for a paradigm shift towards radically democratised and truly generalizable white box neural networks. The code is available at this https URL
本文介绍了一种作为完全可解释神经网络层的一般概念框架的语义特征。一个动机强烈的证明概念模型,用于解释与MNIST相关子问题,包括4层,具有总4.8K可学习参数。该模型易于解释,在没有任何形式的对抗训练的情况下,实现了与人类水平相当的对抗测试准确性。 hyperparameter无需太多调整,可以在单个CPU上快速训练。这种技术的一般性质为彻底民主化和真正通用白皮书神经网络范式带来了希望。代码可在此链接处获取:
https://arxiv.org/abs/2403.09863
Multimodal large language models (MLLMs) have shown impressive reasoning abilities, which, however, are also more vulnerable to jailbreak attacks than their LLM predecessors. Although still capable of detecting unsafe responses, we observe that safety mechanisms of the pre-aligned LLMs in MLLMs can be easily bypassed due to the introduction of image features. To construct robust MLLMs, we propose ECSO(Eyes Closed, Safety On), a novel training-free protecting approach that exploits the inherent safety awareness of MLLMs, and generates safer responses via adaptively transforming unsafe images into texts to activate intrinsic safety mechanism of pre-aligned LLMs in MLLMs. Experiments on five state-of-the-art (SoTA) MLLMs demonstrate that our ECSO enhances model safety significantly (e.g., a 37.6% improvement on the MM-SafetyBench (SD+OCR), and 71.3% on VLSafe for the LLaVA-1.5-7B), while consistently maintaining utility results on common MLLM benchmarks. Furthermore, we show that ECSO can be used as a data engine to generate supervised-finetuning (SFT) data for MLLM alignment without extra human intervention.
多模态大型语言模型(MLLMs)表现出惊人的推理能力,然而,与LLM前身的易受攻击性相比,它们也更容易受到黑客攻击。虽然MLLM中的预对齐LLM的安全机制仍然能够检测到不安全的响应,但我们的观察发现,由于引入了图像特征,MLLM预对齐LLM的安全机制很容易被绕过。为了构建稳健的MLLM,我们提出了ECSO(闭眼,安全开启),一种新颖的训练免费的保护方法,它利用了MLLMs固有的安全意识,并通过自适应地将不安全的图像转换为文本来激活预对齐LLM的安全机制。在五个最先进的(SoTA)MLLM上的实验表明,我们的ECSO显著增强了模型安全性(例如,在MM-SafetyBench (SD+OCR)上的改进率为37.6%,在VLSafe上的改进率为71.3%)。同时,我们在常见MLLM基准上保持了使用价值结果。此外,我们还证明了ECSO可以作为数据引擎,用于为MLLM的对齐生成监督微调(SFT)数据,而无需额外的人干预。
https://arxiv.org/abs/2403.09572
Scene-Text Visual Question Answering (ST-VQA) aims to understand scene text in images and answer questions related to the text content. Most existing methods heavily rely on the accuracy of Optical Character Recognition (OCR) systems, and aggressive fine-tuning based on limited spatial location information and erroneous OCR text information often leads to inevitable overfitting. In this paper, we propose a multimodal adversarial training architecture with spatial awareness capabilities. Specifically, we introduce an Adversarial OCR Enhancement (AOE) module, which leverages adversarial training in the embedding space of OCR modality to enhance fault-tolerant representation of OCR texts, thereby reducing noise caused by OCR errors. Simultaneously, We add a Spatial-Aware Self-Attention (SASA) mechanism to help the model better capture the spatial relationships among OCR tokens. Various experiments demonstrate that our method achieves significant performance improvements on both the ST-VQA and TextVQA datasets and provides a novel paradigm for multimodal adversarial training.
场景文本视觉问答(ST-VQA)旨在理解图像中的场景文本,并回答与文本内容相关的问题。现有方法很大程度上依赖于光学字符识别(OCR)系统的准确性,而且基于有限的空间位置信息和错误的OCR文本信息进行激进的微调往往会导致不可预测的过拟合。在本文中,我们提出了一个具有空间感知能力的多模态对抗训练架构。具体来说,我们引入了一个对抗性OCR增强(AOE)模块,它利用OCR模态的嵌入空间中的对抗性训练来增强OCR文本的容错表示,从而减少由OCR错误引起的噪声。同时,我们还添加了一个空间感知自注意力(SASA)机制,帮助模型更好地捕捉OCR词汇之间的空间关系。各种实验结果表明,我们的方法在ST-VQA和TextVQA数据集上都取得了显著的性能提升,并为多模态对抗训练树立了新的范例。
https://arxiv.org/abs/2403.09288
Chinese Spell Checking (CSC) is a widely used technology, which plays a vital role in speech to text (STT) and optical character recognition (OCR). Most of the existing CSC approaches relying on BERT architecture achieve excellent performance. However, limited by the scale of the foundation model, BERT-based method does not work well in few-shot scenarios, showing certain limitations in practical applications. In this paper, we explore using an in-context learning method named RS-LLM (Rich Semantic based LLMs) to introduce large language models (LLMs) as the foundation model. Besides, we study the impact of introducing various Chinese rich semantic information in our framework. We found that by introducing a small number of specific Chinese rich semantic structures, LLMs achieve better performance than the BERT-based model on few-shot CSC task. Furthermore, we conduct experiments on multiple datasets, and the experimental results verified the superiority of our proposed framework.
中文 Spell Checking(CSC)是一种广泛使用的技术,对语音到文本(STT)和光学字符识别(OCR)起着关键作用。大多数现有的 CSC 方法都依赖 BERT 架构,实现出色的性能。然而,由于基础模型规模有限,基于 BERT 的方法在少样本场景下表现不佳,在实际应用中存在局限性。在本文中,我们探讨了使用名为 RS-LLM(基于丰富语义的 LLM)的上下文学习方法来引入大型语言模型(LLMs)作为基础模型。此外,我们研究了在我们的框架中引入各种中文丰富语义信息的影响。我们发现,通过引入少量特定的中文丰富语义结构,LLMs 在少样本 CSC 任务上比基于 BERT 的模型实现更好的性能。此外,我们对多个数据集进行了实验,并验证了我们提出框架的优越性。
https://arxiv.org/abs/2403.08492
Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.
工业项目依赖于详细、复杂的规范文档,这使得手动提取结构化信息成为一个主要瓶颈。本文介绍了一种创新的方法来自动化这个过程,利用了两个最先进的AI模型的能力:Donut,一种直接从扫描文档中提取信息的模型,以及OpenAI GPT-3.5 Turbo,一种强大的大型语言模型。所提出的方法是从施工规范文档中获取目录(ToCs),然后将ToCs文本结构化为JSON数据。令人印象深刻的精度达到,Donut达到85%,GPT-3.5 Turbo达到89%有效地组织ToCs。这一里程碑成就是对文档索引的重大跃进,证明了AI在各种文档类型中自动提取信息任务的巨大潜力,提高了效率并释放了各种行业中的关键资源。
https://arxiv.org/abs/2403.07553
Scene text recognition is an important and challenging task in computer vision. However, most prior works focus on recognizing pre-defined words, while there are various out-of-vocabulary (OOV) words in real-world applications. In this paper, we propose a novel open-vocabulary text recognition framework, Pseudo-OCR, to recognize OOV words. The key challenge in this task is the lack of OOV training data. To solve this problem, we first propose a pseudo label generation module that leverages character detection and image inpainting to produce substantial pseudo OOV training data from real-world images. Unlike previous synthetic data, our pseudo OOV data contains real characters and backgrounds to simulate real-world applications. Secondly, to reduce noises in pseudo data, we present a semantic checking mechanism to filter semantically meaningful data. Thirdly, we introduce a quality-aware margin loss to boost the training with pseudo data. Our loss includes a margin-based part to enhance the classification ability, and a quality-aware part to penalize low-quality samples in both real and pseudo data. Extensive experiments demonstrate that our approach outperforms the state-of-the-art on eight datasets and achieves the first rank in the ICDAR2022 challenge.
场景文本识别是计算机视觉中一个重要而具有挑战性的任务。然而,大多数先前的作品都专注于识别预定义的单词,而在现实应用中存在各种不在词汇表中的(OOV)单词。在本文中,我们提出了一个新颖的开放词汇文本识别框架,称为伪-OCR,以识别OOV单词。这个任务的关键挑战是缺乏OOV训练数据。为解决这个问题,我们首先提出了一个伪标签生成模块,利用字符检测和图像修复技术从现实世界的图像中产生大量伪OOV训练数据。与之前的合成数据不同,我们的伪OOV数据包含真实字符和背景,以模拟真实世界的应用。其次,为了减少伪数据中的噪声,我们提出了一个语义检查机制来过滤语义上有意义的数据。第三,我们引入了质量感知边距损失来提高带有伪数据的训练。我们的损失包括基于边距的质量和基于质量的损失。大量实验证明,我们的方法在八个数据集上的表现超过了现有技术的水平,在ICDAR2022挑战中获得了第一名的成绩。
https://arxiv.org/abs/2403.07518
The rise of large language models (LLMs) and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.
大规模语言模型(LLMs)和指令调整的发展导致出现了一种以指令为单位的 large language 和视觉模型(LLVMs)的当前趋势。这种趋势涉及要么精心挑选针对特定目标的指令调整数据集,要么将 LLVMs 扩展以处理大量视觉语言(VL)数据。然而,现有的 LLVMs 忽略了从专用计算机视觉(CV)模型在视觉感知任务(如分割、检测、场景图生成(SGG)和光学字符识别(OCR)中获得的详细和全面的真实场景理解。相反,现有的 LLVMs 主要依赖其 LLM 骨干的 大容量和新兴功能。因此,我们提出了一个新的 LLVM,混合智能(MoAI),它利用外部分割、检测、SGG 和 OCR 模型的输出获得辅助视觉信息。MoAI 通过两个新模块运行:MoAI-Compressor 和 MoAI-Mixer。在对外部 CV 模型的输出进行口头说明后,MoAI-Compressor 对其进行对齐和压缩,以便为 VL 任务有效地使用相关辅助视觉信息。然后,MoAI-Mixer 通过利用专家混合的概念,将三种智能(1)视觉特征、(2)外部 CV 模型的辅助特征和(3)语言特征进行混合。通过这种整合,MoAI 在许多零散 VL 任务中显著优于开源和闭源 LLVMs,尤其是与现实场景理解相关的任务,例如物体存在、位置、关系和 OCR,而不需要扩大模型大小或额外视觉指令调整数据集。
https://arxiv.org/abs/2403.07508
Medical image classification requires labeled, task-specific datasets which are used to train deep learning networks de novo, or to fine-tune foundation models. However, this process is computationally and technically demanding. In language processing, in-context learning provides an alternative, where models learn from within prompts, bypassing the need for parameter updates. Yet, in-context learning remains underexplored in medical image analysis. Here, we systematically evaluate the model Generative Pretrained Transformer 4 with Vision capabilities (GPT-4V) on cancer image processing with in-context learning on three cancer histopathology tasks of high importance: Classification of tissue subtypes in colorectal cancer, colon polyp subtyping and breast tumor detection in lymph node sections. Our results show that in-context learning is sufficient to match or even outperform specialized neural networks trained for particular tasks, while only requiring a minimal number of samples. In summary, this study demonstrates that large vision language models trained on non-domain specific data can be applied out-of-the box to solve medical image-processing tasks in histopathology. This democratizes access of generalist AI models to medical experts without technical background especially for areas where annotated data is scarce.
医学图像分类需要带有标签、任务特定数据集,用于训练自适应深度学习网络或微调基础模型。然而,这一过程具有计算和技术上的要求。在自然语言处理中,上下文学习提供了一种替代方法,使模型在提示中学习,而无需进行参数更新。然而,在医学图像分析中,上下文学习仍然是一个未被探索的领域。在这里,我们系统地评估了具有视觉功能的生成预训练Transformer 4(GPT-4V)在癌症图像处理上的应用,特别是在三个具有重要性的癌症病理学任务上:结肠癌组织亚型分类、结肠乳头状变性和乳腺癌淋巴结 section 检测。我们的结果表明,上下文学习已经足够能够匹配甚至超过为特定任务训练的专项神经网络,而只需要很少的样本。总之,本研究证明了,通过训练在非领域特定数据上的大型视觉语言模型,可以将其应用于解剖学图像处理任务之外。这使得没有技术背景的普通AI模型能够访问医疗专家,特别是在缺乏标注数据的情况下。
https://arxiv.org/abs/2403.07407
This paper unravels the potential of sketches for diffusion models, addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process, enabling amateur sketches to generate precise images, living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity, revealing that deformities in existing models stem from spatial-conditioning. To rectify this, we propose an abstraction-aware framework, utilising a sketch adapter, adaptive time-step sampling, and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model, working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple, rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control, introducing an abstraction-aware framework, and leveraging discriminative guidance, validated through extensive experiments.
本文探讨了对于扩散模型的 sketches 的潜在功能,并解决了生成式 AI 中直接绘制控制所带来的误导性承诺。我们重要的是使过程民主化,使业余 sketches 能够生成精确的图像,达到“你画什么,你就得到什么”的承诺。一个试点研究证实了必要性,揭示了现有模型的畸形源于空间约束。为了纠正这个问题,我们提出了一个抽象感知框架,利用了插图适配器、自适应时间步采样和预训练的精细颗粒插图基于图像检索模型的歧视性指导,协同工作以强化细粒度插图与照片的关联。在推理过程中,我们的方法无需文本提示操作顺畅;类似于我们和您可以创建的简单而粗糙的插图足够了!我们欢迎所有人研究论文及其补充。贡献包括使插图控制民主化、引入了抽象感知框架以及利用了歧视性指导,并通过广泛的实验验证了其有效性。
https://arxiv.org/abs/2403.07234
In recent years, there has been a notable advancement in the integration of healthcare and technology, particularly evident in the field of medical image analysis. This paper introduces a pioneering approach in dermatology, presenting a robust method for the detection of hair and scalp diseases using state-of-the-art deep learning techniques. Our methodology relies on Convolutional Neural Networks (CNNs), well-known for their efficacy in image recognition, to meticulously analyze images for various dermatological conditions affecting the hair and scalp. Our proposed system represents a significant advancement in dermatological diagnostics, offering a non-invasive and highly efficient means of early detection and diagnosis. By leveraging the capabilities of CNNs, our model holds the potential to revolutionize dermatology, providing accessible and timely healthcare solutions. Furthermore, the seamless integration of our trained model into a web-based platform developed with the Django framework ensures broad accessibility and usability, democratizing advanced medical diagnostics. The integration of machine learning algorithms into web applications marks a pivotal moment in healthcare delivery, promising empowerment for both healthcare providers and patients. Through the synergy between technology and healthcare, our paper outlines the meticulous methodology, technical intricacies, and promising future prospects of our system. With a steadfast commitment to advancing healthcare frontiers, our goal is to significantly contribute to leveraging technology for improved healthcare outcomes globally. This endeavor underscores the profound impact of technological innovation in shaping the future of healthcare delivery and patient care, highlighting the transformative potential of our approach.
近年来,医疗与科技之间的融合在医疗图像分析领域取得了显著进展。本文介绍了一种在皮肤病领域领先的方法,利用最先进的深度学习技术检测头发的疾病和头皮疾病。我们的方法依赖于卷积神经网络(CNNs),这些网络因在图像识别方面的有效性而闻名,以精心分析各种影响头发的皮肤病的图像。我们提出的系统在皮肤病诊断方面取得了显著的进步,提供了一种非侵入性和高效早期检测和诊断的方法。通过利用CNNs的功能,我们的模型具有可能彻底改变皮肤病诊断,为患者提供可及且及时的医疗服务。此外,将训练好的模型无缝集成到使用Django框架开发的网页平台中,确保广泛的可用性和易用性,使高级医疗诊断民主化。将机器学习算法集成到网页应用程序中标志着医疗交付的重要转折点,为医疗提供者和患者带来了 empower。通过科技与医疗之间的协同作用,本文概述了我们的系统的 meticulous方法、技术复杂性和具有潜力的未来前景。我们坚定不移地推进医疗前沿的发展,目标是显著地贡献于利用技术改善全球医疗保健成果。这一努力突显了技术创新在塑造医疗交付和患者护理未来的深刻影响,强调了我们方法的 transformative潜力。
https://arxiv.org/abs/2403.07940
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios including web screenshots, PDFs, OCR, charts, and knowledge-based content, aiming for a comprehensive representation of practical contexts. Further, we create a use case taxonomy from real user scenarios and construct an instruction tuning dataset accordingly. The fine-tuning with this dataset substantially improves the model's user experience in practical applications. Considering efficiency and the demands of most real-world scenarios, DeepSeek-VL incorporates a hybrid vision encoder that efficiently processes high-resolution images (1024 x 1024), while maintaining a relatively low computational overhead. This design choice ensures the model's ability to capture critical semantic and detailed information across various visual tasks. We posit that a proficient Vision-Language Model should, foremost, possess strong language abilities. To ensure the preservation of LLM capabilities during pretraining, we investigate an effective VL pretraining strategy by integrating LLM training from the beginning and carefully managing the competitive dynamics observed between vision and language modalities. The DeepSeek-VL family (both 1.3B and 7B models) showcases superior user experiences as a vision-language chatbot in real-world applications, achieving state-of-the-art or competitive performance across a wide range of visual-language benchmarks at the same model size while maintaining robust performance on language-centric benchmarks. We have made both 1.3B and 7B models publicly accessible to foster innovations based on this foundation model.
我们提出了DeepSeek-VL,一个面向实时视觉和语言理解的开放源代码Vision-Language(VL)模型。我们的方法围绕三个关键维度展开:我们努力确保我们的数据具有多样性、可扩展性,并覆盖真实世界场景,包括网页截图、PDF、OCR、图表和基于知识的内容,旨在实现对实际场景的全面表示。此外,我们根据实际用户场景创建了用例分类,并相应地构建了指令调整数据集。这个数据集的微调显著提高了模型在实际应用中的用户体验。考虑到效率和真实世界场景的需求,DeepSeek-VL采用了一种混合视觉编码器,它高效地处理高分辨率图像(1024 x 1024),同时保持较低的计算开销。这种设计选择确保了模型在各种视觉任务中捕捉到关键语义和详细信息的能力。我们认为,一个高效的Vision-Language模型首先应该具备强大的语言能力。为了确保在预训练过程中保留LLM的功能,我们研究了通过从开始就集成LLM训练并小心管理视觉和语言模式之间的竞争动态来有效构建VL预训练策略。DeepSeek-VL家族(1.3B和7B模型)在真实世界应用中的视觉语言聊天机器人表现出卓越的用户体验,在相同模型大小下实现最先进的性能,或在各种视觉语言基准测试中保持稳健的性能。我们已经将1.3B和7B模型公开发布,以促进基于这一基础模型的创新。
https://arxiv.org/abs/2403.05525
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at this https URL.
我们提出了TextMonkey,一个专为文本中心任务(包括文档问题回答和场景文本分析)而设计的大型多模态模型(LMM)。我们的方法在几个维度上进行了增强:通过采用Shifted Window Attention且初始化为零,我们在较高输入分辨率上实现了跨窗口连接并稳定了早期的训练;我们假设图像中可能包含冗余词,通过使用相似性过滤掉显著的词,我们不仅可以简化词长,而且还可以提高模型的性能。此外,通过扩展我们的模型的能力包括文本定位和 grounded,并将位置信息融入回答,我们提高了可解释性和减少了幻觉。此外,TextMonkey还可以微调以具有理解截图按键命令的能力。总的来说,我们的方法在各种基准数据集上显著提高了性能,在Scene Text-Centric VQA、Document Oriented VQA和KIE等任务中分别实现了5.2%、6.9%和2.8%的提高,尤其是OCRBench上的得分561,超过了之前开源的大型多模态模型对于文档理解的性能。代码将在此处发布:https:// this URL.
https://arxiv.org/abs/2403.04473
The language diversity in India's education sector poses a significant challenge, hindering inclusivity. Despite the democratization of knowledge through online educational content, the dominance of English, as the internet's lingua franca, limits accessibility, emphasizing the crucial need for translation into Indian languages. Despite existing Speech-to-Speech Machine Translation (SSMT) technologies, the lack of intonation in these systems gives monotonous translations, leading to a loss of audience interest and disengagement from the content. To address this, our paper introduces a dataset with stress annotations in Indian English and also a Text-to-Speech (TTS) architecture capable of incorporating stress into synthesized speech. This dataset is used for training a stress detection model, which is then used in the SSMT system for detecting stress in the source speech and transferring it into the target language speech. The TTS architecture is based on FastPitch and can modify the variances based on stressed words given. We present an Indian English-to-Hindi SSMT system that can transfer stress and aim to enhance the overall quality and engagement of educational content.
印度教育部门的语言多样性是一个显著的挑战,阻碍了包容性。尽管通过在线教育内容实现了知识的民主化,但英语作为互联网的通用语,限制了其可访问性,突显了翻译成印度语言的关键性。尽管存在Speech-to-Speech Machine Translation(SSMT)技术,但这些系统中缺少音调,导致单调的翻译,使观众失去兴趣并从内容中疏离。为了应对这个问题,我们的论文介绍了一个带有压力注解的印度英语数据集和一个能将压力融入合成语音的文本-到-语音(TTS)架构。这个数据集用于训练一个压力检测模型,然后用于SSMT系统中检测源语音中的压力并将它转移到目标语言语音中。TTS架构基于FastPitch,可以根据给定的压力词汇修改变异性。我们提出了一个印度英语到印地语的SSMT系统,可以转移压力,并试图提高教育内容整体质量和参与度。
https://arxiv.org/abs/2403.04178
This work explores a closure task in comics, a medium where visual and textual elements are intricately intertwined. Specifically, Text-cloze refers to the task of selecting the correct text to use in a comic panel, given its neighboring panels. Traditional methods based on recurrent neural networks have struggled with this task due to limited OCR accuracy and inherent model limitations. We introduce a novel Multimodal Large Language Model (Multimodal-LLM) architecture, specifically designed for Text-cloze, achieving a 10% improvement over existing state-of-the-art models in both its easy and hard variants. Central to our approach is a Domain-Adapted ResNet-50 based visual encoder, fine-tuned to the comics domain in a self-supervised manner using SimCLR. This encoder delivers comparable results to more complex models with just one-fifth of the parameters. Additionally, we release new OCR annotations for this dataset, enhancing model input quality and resulting in another 1% improvement. Finally, we extend the task to a generative format, establishing new baselines and expanding the research possibilities in the field of comics analysis.
这项工作探讨了漫画中的一种关闭任务,这是一种将视觉和文本元素紧密交织的媒体。具体来说,文本填空是指在漫画区域选择正确文本的任务,考虑到其相邻区域。基于循环神经网络的传统方法由于 OCR 准确度有限和固有模型限制而遇到困难。我们引入了一种名为多模态大型语言模型(Multimodal-LLM)的新颖架构,专门为文本填空设计,在容易和困难变体中实现了10%的性能提升。我们方法的核心是基于领域自适应的ResNet-50视觉编码器,通过自监督方式对漫画领域进行微调。这个编码器与更复杂的模型具有相近的结果,其参数仅为五分之一。此外,我们还为这个数据集发布了新的 OCR 注释,提高了模型的输入质量,从而实现了 another 1%的性能提升。最后,我们将任务扩展到生成格式,建立了新的基线,并扩大了漫画分析领域的研究可能性。
https://arxiv.org/abs/2403.03719
Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.
尽管全球医疗知识库的主要语言是英语,但本地语言对于提供定制化医疗服务特别是资源有限的地区至关重要。为了将医疗人工智能进步扩展到更广泛的受众,我们旨在开发六种最广泛使用的语言的医疗LLM,包括全球61亿人。这一努力最终导致了ApolloCorpora多语言医疗数据集和XMedBench基准的创建。在多语言医疗基准中,发布的Apollo模型在各种相对较小的规模(即0.5B、1.8B、2B、6B和7B)下取得了最佳性能。特别是,Apollo-7B是截至70B的顶尖多语言医疗LLM。此外,这些轻量级模型可用于在不进行微调的情况下改善大型模型的多语言医疗能力。我们将开放源代码的训练数据集、代码、模型权重和评估基准。
https://arxiv.org/abs/2403.03640
Real-time flood forecasting plays a crucial role in enabling timely and effective emergency responses. However, a significant challenge lies in bridging the gap between complex numerical flood models and practical decision-making. Decision-makers often rely on experts to interpret these models for optimizing flood mitigation strategies. And the public requires complex techniques to inquiry and understand socio-cultural and institutional factors, often hinders the public's understanding of flood risks. To overcome these challenges, our study introduces an innovative solution: a customized AI Assistant powered by the GPT-4 Large Language Model. This AI Assistant is designed to facilitate effective communication between decision-makers, the general public, and flood forecasters, without the requirement of specialized knowledge. The new framework utilizes GPT-4's advanced natural language understanding and function calling capabilities to provide immediate flood alerts and respond to various flood-related inquiries. Our developed prototype integrates real-time flood warnings with flood maps and social vulnerability data. It also effectively translates complex flood zone information into actionable risk management advice. To assess its performance, we evaluated the prototype using six criteria within three main categories: relevance, error resilience, and understanding of context. Our research marks a significant step towards a more accessible and user-friendly approach in flood risk management. This study highlights the potential of advanced AI tools like GPT-4 in democratizing information and enhancing public engagement in critical social and environmental issues.
实时洪水预报在及时和有效地进行紧急响应方面发挥着关键作用。然而,将复杂数值洪水模型与实际决策之间的差距仍然是关键挑战。决策者通常依赖专家来解释这些模型,以优化洪水减轻策略。公众需要复杂的技术来调查和理解社会和文化因素,这往往阻碍了公众对洪水风险的理解。为了克服这些挑战,我们的研究介绍了一种创新解决方案:由GPT-4大型语言模型驱动的自定义AI助手。这个AI助手旨在促进决策者、普通公众和洪水预报员之间的有效沟通,无需专业知识。新框架利用GPT-4的先进自然语言理解和调用功能,提供即时的洪水警报和对各种洪水相关问题的回答。我们开发的原型将实时洪水警告与洪水地图和社会脆弱性数据相结合。它还有效地将复杂洪水区域信息转化为可操作的风险管理建议。为了评估其性能,我们将其使用六个标准进行了评估,分为三个主要类别:相关性、错误恢复和理解背景。我们的研究向前迈出了一大步,以实现更易接近和用户友好的洪水风险管理方法。这项研究突出了GPT-4等先进AI工具在推动信息和公众参与关键社会和环境问题方面的潜力。
https://arxiv.org/abs/2403.03188
Reframing a negative into a positive thought is at the crux of several cognitive approaches to mental health and psychotherapy that could be made more accessible by large language model-based solutions. Such reframing is typically non-trivial and requires multiple rationalization steps to uncover the underlying issue of a negative thought and transform it to be more positive. However, this rationalization process is currently neglected by both datasets and models which reframe thoughts in one step. In this work, we address this gap by augmenting open-source datasets for positive text rewriting with synthetically-generated Socratic rationales using a novel framework called \textsc{SocraticReframe}. \textsc{SocraticReframe} uses a sequence of question-answer pairs to rationalize the thought rewriting process. We show that such Socratic rationales significantly improve positive text rewriting for different open-source LLMs according to both automatic and human evaluations guided by criteria from psychotherapy research.
将负面的想法转化为积极的想法是几种关注心理健康和心理治疗的方法的核心,这些方法可以通过大型语言模型来实现,使其更加易于使用。这种转化通常是不简单的,并需要多个推理步骤来揭示负面想法的潜在问题,并将其转化为更积极的想法。然而,目前的数据集和模型都忽略了这一推理过程。在这项工作中,我们通过在积极文本重写开放源代码数据集中增加使用新框架 \textsc{SocraticReframe} 来合成仿生学 Socratic 推理,从而填补这一空白。\textsc{SocraticReframe} 使用一系列问题-答案对来推理思考重写过程。我们证明了这样的 Socratic 推理对不同开放源代码 LLM 进行积极文本重写,根据心理治疗研究的标准,自动和人类评估均显著改善了这种重写。
https://arxiv.org/abs/2403.03029
Academic documents are packed with texts, equations, tables, and figures, requiring comprehensive understanding for accurate Optical Character Recognition (OCR). While end-to-end OCR methods offer improved accuracy over layout-based approaches, they often grapple with significant repetition issues, especially with complex layouts in Out-Of-Domain (OOD) this http URL tackle this issue, we propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. LOCR adeptly handles various formatting elements and generates content in Markdown language. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.LOCR also reduces repetition frequency from 4.4% of pages to 0.5% in the arXiv dataset, from 13.2% to 1.3% in OOD quantum physics documents and from 8.1% to 1.8% in OOD marketing documents. Additionally, LOCR features an interactive OCR mode, facilitating the generation of complex documents through a few location prompts from human.
学术文档充满了文本、方程、表格和图形,需要全面理解才能实现准确的光学字符识别(OCR)。虽然端到端OCR方法在布局基础上提供了更高的准确性,但它们往往面临着显著的重复问题,尤其是在复杂的离散域(OOD)中。为了解决这个问题,我们提出了LOCR,一种将位置引导集成到自回归架构中的模型。我们在包括125K个学术文档页面的770M个文本位置对训练模型,包括单词、表格和数学符号的边界框。LOCR擅长处理各种格式要素并生成Markdown语言的内容。它在我们的测试集中超越了所有现有方法,以通过编辑距离、BLEU、METEOR和F-分数进行衡量。LOCR还在arXiv数据集中将重复频率从每页的4.4%降低到0.5%,从13.2%降低到1.3%,从8.1%降低到1.8%。此外,LOCR具有交互式OCR模式,通过来自人类的一些位置提示来生成复杂的文档。
https://arxiv.org/abs/2403.02127
Document-level Relation Extraction (DocRE) aims to identify relation labels between entities within a single document. It requires handling several sentences and reasoning over them. State-of-the-art DocRE methods use a graph structure to connect entities across the document to capture dependency syntax information. However, this is insufficient to fully exploit the rich syntax information in the document. In this work, we propose to fuse constituency and dependency syntax into DocRE. It uses constituency syntax to aggregate the whole sentence information and select the instructive sentences for the pairs of targets. It exploits the dependency syntax in a graph structure with constituency syntax enhancement and chooses the path between entity pairs based on the dependency graph. The experimental results on datasets from various domains demonstrate the effectiveness of the proposed method. The code is publicly available at this url.
文档级别关系提取(DocRE)旨在在单个文档内识别实体之间的关系标签。它需要处理几句话,并对它们进行推理。最先进的DocRE方法使用图结构将文档中的实体连接起来,以捕捉依赖关系语法的信息。然而,这不足以充分利用文档中的丰富语法信息。在这项工作中,我们提出将句法信息与DocRE相结合。它使用句法信息来汇总整个句子的信息,并选择目标句的对。它利用基于句法的图结构进行增强,并选择基于依赖关系图的路由。各种领域数据集的实验结果证明了所提出方法的有效性。代码可在此处下载。
https://arxiv.org/abs/2403.01886