Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at this https URL under AI2 ImpACT Licenses.
像GPT-4和ChatGPT这样的聊天机器人现在服务着数百万用户。尽管它们在范围内得到了广泛应用,但目前还没有公开的数据集展示这些工具如何在实际用户中使用。为了填补这个空白,我们向在线用户免费提供ChatGPT,条件是他们同意匿名收集他们的聊天记录并请求头。从中,我们编写了WildChat,一个由100万用户与ChatGPT的对话组成的语料库,包括超过250万交互回合。我们比较WildChat与其他受欢迎的用户聊天机器人交互数据集,发现我们的数据集提供了最丰富的用户提示,包含了最多的语言,以及研究人员可以研究的最可能具有破坏性的用例的丰富多样性。除了时间戳化的聊天记录外,我们还通过包括人口统计学数据(包括州、国家 和哈希IP地址)来丰富这个数据集。这使得我们可以更详细地分析用户行为在不同地理区域和时间维度上的差异。最后,因为它涵盖了广泛的用例,我们证明了该数据集在微调指令跟随模型的潜在用途上的价值。WildChat现已在https://this URL上发布,符合AI2 ImpACT许可证。
https://arxiv.org/abs/2405.01470
This paper introduces UQA, a novel dataset for question answering and text comprehension in Urdu, a low-resource language with over 70 million native speakers. UQA is generated by translating the Stanford Question Answering Dataset (SQuAD2.0), a large-scale English QA dataset, using a technique called EATS (Enclose to Anchor, Translate, Seek), which preserves the answer spans in the translated context paragraphs. The paper describes the process of selecting and evaluating the best translation model among two candidates: Google Translator and Seamless M4T. The paper also benchmarks several state-of-the-art multilingual QA models on UQA, including mBERT, XLM-RoBERTa, and mT5, and reports promising results. For XLM-RoBERTa-XL, we have an F1 score of 85.99 and 74.56 EM. UQA is a valuable resource for developing and testing multilingual NLP systems for Urdu and for enhancing the cross-lingual transferability of existing models. Further, the paper demonstrates the effectiveness of EATS for creating high-quality datasets for other languages and domains. The UQA dataset and the code are publicly available at this http URL.
本文介绍了一个名为UQA的新型数据集,用于 Urdu 语料库中问题回答和文本理解。UQA 由翻译斯坦福问题回答数据集 (SQuAD2.0) 生成,这是一个大型英语问题回答数据集,使用一种称为 EATS (将括号内保留翻译文本上下文中的答案范围) 的技术生成。本文描述了从两个候选者(Google 翻译器和 Seamless M4T)中选择和评估最佳翻译模型的过程。此外,本文还在 UQA 上基准了多种最先进的跨语言 QA 模型,包括 mBERT、XLM-RoBERTa 和 mT5,并报告了有希望的结果。对于 XLM-RoBERTa-XL,我们的 F1 分数为 85.99 和 74.56。UQA 是一个有价值的资源,可用于开发和测试 Urdu 和其他多语言 NLP 系统,以及增强现有模型的跨语言可转移性。此外,本文还证明了 EATS 在创建高质量数据集在其他语言和领域中的有效性。UQA 数据集和代码可在此处下载:<http://www.aclweb.org/anthology/N18-1196>
https://arxiv.org/abs/2405.01458
Bridging the significant gap between large language model's English and non-English performance presents a great challenge. While some previous studies attempt to mitigate this gap with translated training data, the recently proposed question alignment approach leverages the model's English expertise to improve multilingual performance with minimum usage of expensive, error-prone translation. In this paper, we explore how broadly this method can be applied by examining its effects in reasoning with executable code and reasoning with common sense. We also explore how to apply this approach efficiently to extremely large language models using proxy-tuning. Experiment results on multilingual reasoning benchmarks mGSM, mSVAMP and xCSQA demonstrate that the question alignment approach can be used to boost multilingual performance across diverse reasoning scenarios, model families, and sizes. For instance, when applied to the LLaMA2 models, our method brings an average accuracy improvements of 12.2% on mGSM even with the 70B model. To understand the mechanism of its success, we analyze representation space, chain-of-thought and translation data scales, which reveals how question translation training strengthens language alignment within LLMs and shapes their working patterns.
跨越大型语言模型英语和非英语性能之间的显著差距提出了一个巨大的挑战。虽然一些以前的研究试图通过翻译训练数据来弥合这一差距,但最近提出的疑问对齐方法利用模型的英语专业知识来提高多语言性能,同时最小化使用昂贵且容易出错的翻译。在本文中,我们研究了这种方法在推理执行代码和推理与常识中的应用效果。我们还研究了如何使用代理调整来有效地应用于极其大型语言模型。在多语言推理基准测试mGSM、mSVAMP和xCSQA上进行实验结果表明,疑问对齐方法可以用于提高各种推理场景、模型家族和大小下的多语言性能。例如,当应用于LLLA2模型时,我们的方法在mGSM基准测试上平均提高了12.2%的准确性,即使只有70B模型。为了了解其成功的原因,我们分析了表示空间、推理数据规模以及翻译数据规模,揭示了疑问翻译训练如何加强LLM中的语言对齐并塑造其工作方式。
https://arxiv.org/abs/2405.01345
Numerous studies have shown that existing Face Recognition Systems (FRS), including commercial ones, often exhibit biases toward certain ethnicities due to under-represented data. In this work, we explore ethnicity alteration and skin tone modification using synthetic face image generation methods to increase the diversity of datasets. We conduct a detailed analysis by first constructing a balanced face image dataset representing three ethnicities: Asian, Black, and Indian. We then make use of existing Generative Adversarial Network-based (GAN) image-to-image translation and manifold learning models to alter the ethnicity from one to another. A systematic analysis is further conducted to assess the suitability of such datasets for FRS by studying the realistic skin-tone representation using Individual Typology Angle (ITA). Further, we also analyze the quality characteristics using existing Face image quality assessment (FIQA) approaches. We then provide a holistic FRS performance analysis using four different systems. Our findings pave the way for future research works in (i) developing both specific ethnicity and general (any to any) ethnicity alteration models, (ii) expanding such approaches to create databases with diverse skin tones, (iii) creating datasets representing various ethnicities which further can help in mitigating bias while addressing privacy concerns.
许多研究都表明,现有的Face Recognition系统(包括商业系统)往往因为代表性数据不足而倾向于针对某些民族产生偏见。在这项工作中,我们使用合成面部图像生成方法来探讨种族改变和肤色修改,以增加数据集的多样性。我们首先构建了一个代表三个民族的平衡面部图像数据集,然后利用现有的基于生成对抗网络(GAN)的图像到图像转换和多态学习模型,将一种民族的肤色改变为另一种民族。我们进一步研究了这种数据集对Face Recognition System(FRS)的适用性,通过研究个体典型角度(ITA)来评估肤色现实主义表示。此外,我们还分析了使用现有的面部图像质量评估(FIQA)方法来评估质量特征。然后,我们使用四种不同的系统提供了全面的FRS性能分析。我们的研究结果为未来研究奠定了基础:(一)开发既针对特定民族又针对任意民族改变模型的可能性;(二)将这种方法扩展到创建具有不同肤色的数据库的可能性;(三)创建代表各种民族的數據集,从而在减轻偏见的同时解决隐私问题。
https://arxiv.org/abs/2405.01273
In various real-world scenarios, interactions among agents often resemble the dynamics of general-sum games, where each agent strives to optimize its own utility. Despite the ubiquitous relevance of such settings, decentralized machine learning algorithms have struggled to find equilibria that maximize individual utility while preserving social welfare. In this paper we introduce Learning with Opponent Q-Learning Awareness (LOQA), a novel, decentralized reinforcement learning algorithm tailored to optimizing an agent's individual utility while fostering cooperation among adversaries in partially competitive environments. LOQA assumes the opponent samples actions proportionally to their action-value function Q. Experimental results demonstrate the effectiveness of LOQA at achieving state-of-the-art performance in benchmark scenarios such as the Iterated Prisoner's Dilemma and the Coin Game. LOQA achieves these outcomes with a significantly reduced computational footprint, making it a promising approach for practical multi-agent applications.
在各种现实场景中,智能体之间的交互通常类似于一般博弈论中的动态,其中每个智能体都试图最大化自己的效用。尽管这种设置的普遍性至关重要,但分散式机器学习算法在保持社会福利的同时找到最优个人效用平衡方面遇到了困难。在本文中,我们引入了学习与对手Q学习意识(LOQA)算法,这是一种专为在部分竞争环境中促进对抗体合作而设计的分布式强化学习算法。LOQA假定对抗体样本的动作比例与其价值函数Q成比例。实验结果表明,LOQA在标准场景(如迭代囚徒困境和硬币游戏)中的表现已经达到最先进的水平。LOQA通过显著减少计算负担实现了这些成果,使得它成为实用多智能体应用程序的有前景的方法。
https://arxiv.org/abs/2405.01035
Image Quality Assessment (IQA) is essential in various Computer Vision tasks such as image deblurring and super-resolution. However, most IQA methods require reference images, which are not always available. While there are some reference-free IQA metrics, they have limitations in simulating human perception and discerning subtle image quality variations. We hypothesize that the JPEG quality factor is representatives of image quality measurement, and a well-trained neural network can learn to accurately evaluate image quality without requiring a clean reference, as it can recognize image degradation artifacts based on prior knowledge. Thus, we developed a reference-free quality evaluation network, dubbed "Quality Factor (QF) Predictor", which does not require any reference. Our QF Predictor is a lightweight, fully convolutional network comprising seven layers. The model is trained in a self-supervised manner: it receives JPEG compressed image patch with a random QF as input, is trained to accurately predict the corresponding QF. We demonstrate the versatility of the model by applying it to various tasks. First, our QF Predictor can generalize to measure the severity of various image artifacts, such as Gaussian Blur and Gaussian noise. Second, we show that the QF Predictor can be trained to predict the undersampling rate of images reconstructed from Magnetic Resonance Imaging (MRI) data.
图像质量评估(IQA)在各种计算机视觉任务中(如图像去噪和超分辨率)非常重要。然而,大多数IQA方法都需要参考图像,这些图像并不总是可用的。虽然有一些无需参考图像的IQA指标,但它们在模拟人类感知和辨别细微图像质量变化方面存在局限性。我们假设JPEG质量因子是图像质量测量的代表,并且经过良好训练的神经网络可以准确评估图像质量,而不需要干净的参考,因为它可以根据先验知识识别图像退化伪像。因此,我们开发了一个无需参考的图像质量评估网络,名为“质量因子(QF)预测器”,它包含七个层。该模型以自监督的方式训练:它接收经过随机QF的压缩JPEG图像补丁作为输入,并通过准确预测相应的QF进行训练。我们通过应用该模型到各种任务来展示其多才性。首先,我们的QF预测器可以推广用于衡量各种图像伪像的严重程度,如高斯平滑和高斯噪声。其次,我们证明了QF预测器可以被训练预测从磁共振成像(MRI)数据中重构的图像的降采样率。
https://arxiv.org/abs/2405.02208
Vision language models (VLMs) have recently emerged and gained the spotlight for their ability to comprehend the dual modality of image and textual data. VLMs such as LLaVA, ChatGPT-4, and Gemini have recently shown impressive performance on tasks such as natural image captioning, visual question answering (VQA), and spatial reasoning. Additionally, a universal segmentation model by Meta AI, Segment Anything Model (SAM) shows unprecedented performance at isolating objects from unforeseen images. Since medical experts, biologists, and materials scientists routinely examine microscopy or medical images in conjunction with textual information in the form of captions, literature, or reports, and draw conclusions of great importance and merit, it is indubitably essential to test the performance of VLMs and foundation models such as SAM, on these images. In this study, we charge ChatGPT, LLaVA, Gemini, and SAM with classification, segmentation, counting, and VQA tasks on a variety of microscopy images. We observe that ChatGPT and Gemini are impressively able to comprehend the visual features in microscopy images, while SAM is quite capable at isolating artefacts in a general sense. However, the performance is not close to that of a domain expert - the models are readily encumbered by the introduction of impurities, defects, artefact overlaps and diversity present in the images.
近年来,随着Vision语言模型(VLMs)的出现,它们在理解图像和文本数据的双模态方面得到了关注。例如,LLaVA、ChatGPT-4和Gemini等VLM最近在自然图像描述性、视觉问答(VQA)和空间推理等任务中表现出色。此外,由元人工智能(Meta AI)开发的普遍分割模型Semantic Anywhere Model(SAM)在从未见过的图像中隔离物体方面表现出史无前例的性能。由于医疗专家、生物学家和材料科学家通常将显微镜图像或医学图像与文本信息(标题、文献或报告)一起检查,并从中得出重要且有益的结论,因此测试VLM和基础模型(如SAM)在这些图像上的性能无疑至关重要。在这项研究中,我们对ChatGPT、LLaVA、Gemini和SAM在各种显微镜图像上执行分类、分割、计数和VQA任务。我们观察到,ChatGPT和Gemini在显微镜图像的视觉特征方面表现出惊人的理解能力,而SAM在分离总体上的伪影方面表现相当出色。然而,这些模型的性能与领域专家的相当距离,模型很容易受到图像中存在的杂质、缺陷、伪影和多样性等因素的影响。
https://arxiv.org/abs/2405.00876
Conventional image quality metrics (IQMs), such as PSNR and SSIM, are designed for perceptually uniform gamma-encoded pixel values and cannot be directly applied to perceptually non-uniform linear high-dynamic-range (HDR) colors. Similarly, most of the available datasets consist of standard-dynamic-range (SDR) images collected in standard and possibly uncontrolled viewing conditions. Popular pre-trained neural networks are likewise intended for SDR inputs, restricting their direct application to HDR content. On the other hand, training HDR models from scratch is challenging due to limited available HDR data. In this work, we explore more effective approaches for training deep learning-based models for image quality assessment (IQA) on HDR data. We leverage networks pre-trained on SDR data (source domain) and re-target these models to HDR (target domain) with additional fine-tuning and domain adaptation. We validate our methods on the available HDR IQA datasets, demonstrating that models trained with our combined recipe outperform previous baselines, converge much quicker, and reliably generalize to HDR inputs.
传统的图像质量度量(IQMs)如 PSNR 和 SSIM 是为了感知上均匀伽马编码的像素值而设计的,无法直接应用于感知非均匀线性高动态范围(HDR)颜色。同样,大多数可用的数据集包括在标准和非控制条件下收集的标准动态范围(SDR)图像。受欢迎的预训练神经网络也是为了 SDR 输入而设计的,从而限制了它们对 HDR 内容的直接应用。另一方面,从零开始训练 HDR 模型具有挑战性,因为可用的 HDR 数据有限。在本文中,我们探讨了用于训练基于深度学习的图像质量评估(IQA)模型的更有效方法。我们利用预训练在 SDR 数据上的网络(源域),并将这些模型重新定位到 HDR(目标域)进行微调和域适应。我们在可用的 HDR IQA 数据集上评估我们的方法,证明了使用我们结合的食谱训练的模型比以前的基线更优,收敛速度更快,并且能够可靠地将 HDR 输入。
https://arxiv.org/abs/2405.00670
Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, attracting increasing research efforts aiming to enhance VQA accuracy through the deployment of advanced models such as Transformers. Despite this growing interest, there has been limited exploration into the comparative analysis and impact of textual modalities within VQA, particularly in terms of model complexity and its effect on performance. In this work, we conduct a comprehensive comparison between complex textual models that leverage long dependency mechanisms and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we introduce an improved model, ConvGRU, which incorporates convolutional layers to enhance the representation of question text. Tested on the VQA-v2 dataset, ConvGRU achieves better performance without substantially increasing parameter complexity.
视觉问题回答(VQA)近年来成为了一个高度有趣的领域,吸引了越来越多的研究努力,通过部署先进的模型如Transformer来提高VQA的准确性。尽管如此,对VQA中文本模式的比较分析和影响的研究仍然有限,特别是在模型复杂性和其对性能的影响方面。在这项工作中,我们全面比较了在VQA框架内利用长距离依赖机制的复杂文本模型和关注局部文本特征的简单模型的性能。我们的研究结果表明,使用复杂的文本编码器并不一定是最优策略,尤其是在VQA-v2数据集上。为了应对这一见解,我们引入了一个改进的模型ConvGRU,它通过添加卷积层来增强问题文本的表示。在VQA-v2数据集上进行测试,ConvGRU实现了与参数复杂度相当的好性能。
https://arxiv.org/abs/2405.00479
Learning never ends, and there is no age limit to grow yourself. However, the educational landscape may face challenges in effectively catering to students' inclusion and diverse learning needs. These students should have access to state-of-the-art methods for lecture delivery, online resources, and technology needs. However, with all the diverse learning sources, it becomes harder for students to comprehend a large amount of knowledge in a short period of time. Traditional assistive technologies and learning aids often lack the dynamic adaptability required for individualized education plans. Large Language Models (LLM) have been used in language translation, text summarization, and content generation applications. With rapid growth in AI over the past years, AI-powered chatbots and virtual assistants have been developed. This research aims to bridge this gap by introducing an innovative study buddy we will be calling the 'SAMCares'. The system leverages a Large Language Model (LLM) (in our case, LLaMa-2 70B as the base model) and Retriever-Augmented Generation (RAG) to offer real-time, context-aware, and adaptive educational support. The context of the model will be limited to the knowledge base of Sam Houston State University (SHSU) course notes. The LLM component enables a chat-like environment to interact with it to meet the unique learning requirements of each student. For this, we will build a custom web-based GUI. At the same time, RAG enhances real-time information retrieval and text generation, in turn providing more accurate and context-specific assistance. An option to upload additional study materials in the web GUI is added in case additional knowledge support is required. The system's efficacy will be evaluated through controlled trials and iterative feedback mechanisms.
学习永无止境,没有年龄限制去发展自己。然而,教育领域可能会面临照顾学生多样性需求和理解能力不足的挑战。这些学生应享有最先进的大课教学方法、在线资源和科技需求。然而,尽管有各种各样的学习资源,学生在短时间内理解大量知识仍然变得更加困难。传统的辅助技术和学习辅助工具通常缺乏个性化的教育计划所需的动态适应性。近年来,随着人工智能的快速发展,已经开发出了一些基于人工智能的聊天机器人或虚拟助手。这项研究旨在通过介绍我们称之为“SAMCares”的创新学习伙伴来填补这一空白。该系统利用大型语言模型(LLM)(在我们的案例中,LLaMa-2 70B作为基础模型)和Retriever-Augmented Generation(RAG)为每个学生提供实时的、上下文感知和自适应的教育支持。模型的上下文将局限于德克萨斯州休斯顿州立大学(SHSU)的课程笔记知识库。LLM部分使学生能够以类似聊天室的环境与它互动,满足每个学生的独特学习需求。为此,我们将构建一个自定义的网页GUI。同时,RAG通过实时信息检索和文本生成增强,提供更准确、上下文相关的帮助。在网页GUI中增加上传额外学习材料的选项,以便需要额外知识支持时使用。系统将通过控制试验和迭代反馈机制来评估其有效性。
https://arxiv.org/abs/2405.00330
Teaching programming in early childhood (4-9) to enhance computational thinking has gained popularity in the recent movement of computer science for all. However, current practices ignore some fundamental issues resulting from young children's developmental readiness, such as the sustained capability to keyboarding, the decomposition of complex tasks to small tasks, the need for intuitive mapping from abstract programming to tangible outcomes, and the limited amount of screen time exposure. To address these issues in this paper, we present a novel methodology with an AI-powered integration platform to effectively teach computational thinking for young children. The system features a hybrid pedagogy that supports both the top-down and bottom-up approach for teaching computational thinking. Young children can describe their desired task in natural language, while the system can respond with an easy-to-understand program consisting of the right level of decomposed sub-tasks. A tangible robot can immediately execute the decomposed program and demonstrate the program's outcomes to young children. The system is equipped with an intelligent chatbot that can interact with young children through natural languages, and children can speak to the chatbot to complete all the needed programming tasks, while the chatbot orchestrates the execution of the program onto the robot. This would completely eliminates the need of keyboards for young children to program. By developing such a system, we aim to make the concept of computational thinking more accessible to young children, fostering a natural understanding of programming concepts without the need of explicit programming skills. Through the interactive experience provided by the robotic agent, our system seeks to engage children in an effective manner, contributing to the field of educational technology for early childhood computer science education.
在最近的教育技术运动中,教授4-9岁儿童编程以增强计算思维已经变得越来越受欢迎。然而,现有的做法忽视了儿童发展准备阶段的一些基本问题,例如持续的键盘能力、将复杂任务分解为小任务、从抽象编程到直观成果的直觉映射以及屏幕时间受限等问题。为了解决这些问题,本文提出了一种新的方法,该方法配备了一个AI驱动的集成平台,以有效教授计算思维给儿童。 该系统采用混合教育方法,支持从上到下的教学方法和从下到上的教学方法,以教授计算思维。儿童可以使用自然语言描述他们的所需任务,而系统会以易于理解的程序回答,该程序包括适当的分解子任务。一个实体机器人可以立即执行分解程序并展示其成果给儿童。系统配备了一个智能聊天机器人,可以通过自然语言与儿童交互,儿童可以与聊天机器人完成所有编程任务,而聊天机器人将程序执行给机器人。这将完全消除年轻孩子编程时使用键盘的需求。 通过开发这样一个系统,我们的目标是让计算思维的概念更容易为儿童所理解,促进对编程概念的自然理解,而无需具备显式的编程技能。通过机器人代理提供的交互式体验,我们的系统试图以有效的方式激发儿童参与,为早期 childhood计算机科学教育领域做出贡献。
https://arxiv.org/abs/2405.00750
Unlike traditional educational chatbots that rely on pre-programmed responses, large-language model-driven chatbots, such as ChatGPT, demonstrate remarkable versatility and have the potential to serve as a dynamic resource for addressing student needs from understanding advanced concepts to solving complex problems. This work explores the impact of such technology on student learning in an interdisciplinary, project-oriented data visualization course. Throughout the semester, students engaged with ChatGPT across four distinct projects, including data visualizations and implementing them using a variety of tools including Tableau, D3, and Vega-lite. We collected conversation logs and reflection surveys from the students after each assignment. In addition, we conducted interviews with selected students to gain deeper insights into their overall experiences with ChatGPT. Our analysis examined the advantages and barriers of using ChatGPT, students' querying behavior, the types of assistance sought, and its impact on assignment outcomes and engagement. Based on the findings, we discuss design considerations for an educational solution that goes beyond the basic interface of ChatGPT, specifically tailored for data visualization education.
与传统教育聊天机器人依赖预设回答不同,大型语言模型驱动的聊天机器人(如ChatGPT)表现出非凡的灵活性,并有可能成为解决学生需求(从理解高级概念到解决复杂问题)的动态资源。本文探讨了这种技术对学生在跨学科、项目导向的数据可视化课程中的学习的影响。在整个学期里,学生与ChatGPT在四个不同的项目中进行互动,包括数据可视化和使用各种工具(包括Tableau、D3和Vega-lite)实现它们。我们收集了每个作业后的对话记录和反思调查。此外,我们还与选定的学生进行了访谈,以更深入地了解他们与ChatGPT的整体经验。我们的分析探讨了使用ChatGPT的优势和障碍,学生的查询行为,寻求的协助类型以及其对作业成果和参与度的影响。根据这些发现,我们讨论了为教育解决方案,超越了ChatGPT的基本界面,特别是为数据可视化教育定制的设计考虑。
https://arxiv.org/abs/2405.00748
In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.
在这项研究中,我们提出了一个无需光学字符识别(OCR)的视觉文档理解(VDU)序列生成模型。我们的模型不仅解析了文档图像中的文本,还根据多头架构提取了文本的空间坐标。我们为其命名为“ Coordinate-aware End-to-end Document Parser (CREPE)”,通过引入一个特殊标记来标记OCR文本,并实现标记触发的位置解码。我们还提出了一种弱监督的训练框架,只需解析无高成本坐标注释的标注即可。我们的实验评估结果表明,CREPE在文档解析任务中具有最先进的性能。除此之外,CREPE在其他文档理解任务(如布局分析、文档视觉问答等)中的应用也进一步证明了其灵活性。CREPE的包括OCR和语义解析的能力不仅减轻了现有OCR依赖方法中的错误传播问题,还显著增强了序列生成模型的功能,引领了文档理解研究的新纪元。
https://arxiv.org/abs/2405.00260
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
目前,为视觉内容设计的自动摘要方法面临着缺乏细节、内容偏差和差劲的指令等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),一种灵活的训练免费管道,为2D图像和3D对象生成高保真度和详细摘要。VFC包括三个步骤:1)提议,其中图像到文本摘要模型提出多个初始摘要;2)验证,其中大型语言模型(LLM)利用诸如物体检测和VQA模型等工具对提议的摘要进行验证;3)摘要,其中LLM通过总结摘要建议和验证结果生成最终的摘要。在这一步骤,VFC可以根据复杂指令灵活生成各种风格的摘要。我们使用四个指标对全面摘要评估:1)CLIP-Score,衡量图像与文本相似度;2)CLIP-Image-Score,衡量原图像和由文本到图像模型生成的图像之间的图像图像相似度;3)在Amazon Mechanical Turk上的人类研究;4)GPT-4V进行微细化评估。评估结果显示,VFC在COCO数据集上的2D图像上的表现优于最先进的开源摘要方法,而在Objaverse数据集上的3D资产上的表现也优于最先进的开放式源代码方法。我们的研究证明了通过将开源模型集成到管道中,我们可以实现与 proprietary 模型如GPT-4V相当的摘要能力,尽管模型的规模是开源模型的10倍以上。
https://arxiv.org/abs/2404.19752
We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time, based upon the local deviation from harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. We conduct human annotation experiments to show the positive correlation of $\gamma$ with false or misleading answers, and demonstrate that following the gradient of $\gamma$ in stochastic gradient ascent efficiently exposes adversarial prompts. Measuring $\gamma$ across thousands of queries in popular LLMs (GPT-4, ChatGPT, Claude-2.1, Mixtral-8x7B, Smaug-72B, Llama2-7B, and MPT-7B) allows us to estimate the liklihood of wrong or hallucinatory answers automatically and quantitatively rank the reliability of these models in various objective domains (Web QA, TruthfulQA, and Programming QA). Across all models and domains tested, human ratings confirm that $\gamma \to 0$ indicates trustworthiness, and the low-$\gamma$ leaders among these models are GPT-4, ChatGPT, and Smaug-72B.
我们提出了一种直观的方法来实时测试任何黑盒LLM的稳健性(稳定性和可解释性),基于离散偏差,称为$\gamma$。据我们所知,这是基于模型本身遵循纯数学标准来测量任何给定LLM响应稳健性的第一个完全模型无关和无监督的方法。我们进行了人类注释实验来表明$\gamma$与虚假或误导性答案之间的正相关性,并证明在随机梯度上升过程中,沿着$\gamma$的梯度可以有效地揭示对抗性提示。通过对流行LLM(GPT-4,ChatGPT,Claude-2.1,Mixtral-8x7B,Smaug-72B,Llama2-7B和MPT-7B)成千上万个查询的$\gamma$进行测量,使我们能够自动估计错误或幻觉性答案的概率,并定量排名这些模型的可靠性在各种客观领域(Web QA,Truthful QA和编程 QA)上。在所有模型和领域测试中,人类评分证实了$\gamma \to 0$表示可信度,这些模型中低$\gamma$的领导是GPT-4,ChatGPT和Smaug-72B。
https://arxiv.org/abs/2404.19708
In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM's parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.
在本文中,我们证明了大型语言模型(LLMs)在需要额外上下文来回答给定问题时,可以有效地学习使用标准的信息检索(IR)系统。考虑到IR系统的性能,问题回答的最佳策略并不总是涉及外部信息检索,而是通常利用LLM本身的参数化记忆。之前的研究已经发现了这个现象在PopQA数据集中,其中最流行的问题有效地使用LLM的参数化记忆来回答,而较不流行的问题则需要使用IR系统。接着,我们为LLMs提出了一个针对现有开放领域问题回答数据集的定制化训练方法。在这里,LLMs在不知道答案时生成一个特殊标记<RET>。我们对PopQA数据集上的自适应检索LLM(Adapt-LLM)的评估展示了在三种配置下的改进:(i)检索所有问题,(ii)始终使用LLM的参数化记忆,(iii)根据流行度阈值来决定何时使用检索器。通过我们的分析,我们证明了Adapt-LLM能够生成<RET>标记,当它确定自己无法回答问题时,表明需要IR,而当它仅依赖参数化记忆时,其准确率显著提高。
https://arxiv.org/abs/2404.19705
3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.
3D视觉 groundeding 是一个具有挑战性的任务,通常需要直接和密集的监督,特别是场景中的每个对象的语义标签。在本文中,我们研究了一个自然监督设置,该设置仅从3D场景和QA对中学习,而之前的工作在这些设置上表现不佳。我们提出了 Language-Regularized Concept Learner (LARC),它使用语言约束作为正则化,显著提高了自然监督设置中神经符号学习者的准确性。我们的方法基于两个核心见解:语言约束(例如,一个单词与其他单词的关系)可以作为对结构化表示的有效正则化;第二个是,我们可以向大型语言模型查询,从中提取这样的约束从语言属性中。我们证明了 LARC 能够提高之前在自然监督3D视觉 groundeding 中的工作的性能,并展示了广泛的3D视觉推理能力-从零散的组合到数据效率和可转移性。我们的方法代表了一个有前景的步骤,将语言基于先验的视觉推理框架 regularize,以在缺乏密集监督的学习环境中学习。
https://arxiv.org/abs/2404.19696
Image quality assessment often relies on raw opinion scores provided by subjects in subjective experiments, which can be noisy and unreliable. To address this issue, postprocessing procedures such as ITU-R BT.500, ITU-T P.910, and ITU-T P.913 have been standardized to clean up the original opinion scores. These methods use annotator-based statistical priors, but they do not take into account extensive information about the image itself, which limits their performance in less annotated scenarios. Generally speaking, image quality datasets usually contain similar scenes or distortions, and it is inevitable for subjects to compare images to score a reasonable score when scoring. Therefore, In this paper, we proposed Subjective Image Quality Score Preprocessing Method perceptual similarity Subjective Preprocessing (PSP), which exploit the perceptual similarity between images to alleviate subjective bias in less annotated scenarios. Specifically, we model subjective scoring as a conditional probability model based on perceptual similarity with previously scored images, called subconscious reference scoring. The reference images are stored by a neighbor dictionary, which is obtained by a normalized vector dot-product based nearest neighbor search of the images' perceptual depth features. Then the preprocessed score is updated by the exponential moving average (EMA) of the subconscious reference scoring, called similarity regularized EMA. Our experiments on multiple datasets (LIVE, TID2013, CID2013) show that this method can effectively remove the bias of the subjective scores. Additionally, Experiments prove that the Preprocesed dataset can improve the performance of downstream IQA tasks very well.
图像质量评估通常依赖于参与者在主观实验中提供的原始意见得分,这些得分可能嘈杂且不可靠。为解决这个问题,已标准化了诸如ITU-R BT.500、ITU-T P.910和ITU-T P.913等后处理过程,以清理原始意见得分。这些方法基于注释者的统计先验,但它们没有考虑到图像本身 extensive 的信息,这限制了它们在较少注释的场景中的性能。 总的来说,图像质量数据集通常包含相似的场景或畸变,因此当评分时,受试者不可避免地会将与图像进行比较以给出合理的分数。因此,在本文中,我们提出了 Subjective Image Quality Score Preprocessing Method perceptual similarity Subjective Preprocessing (PSP) ,它利用图像之间的感知相似性来减轻在较少标注的场景中的主观偏差。具体来说,我们将主观评分建模为基于先前评分图像的感知相似性的条件概率模型,称为潜意识参考评分。参考图像通过基于感知深度的图像特征的最近邻搜索的标准化矢量点积得到。然后,通过潜意识参考评分的指数移动平均(EMA)对预处理得分进行更新,称为相似正则化EMA。 在多个数据集(LIVE,TID2013,CID2013)上的实验证明,这种方法可以有效地去除主观评分的偏差。此外,实验结果表明,预处理的数据集对于下游IQA任务的性能非常有利。
https://arxiv.org/abs/2404.19666
In this paper, we propose a highly efficient method to estimate an image's mean opinion score (MOS) from a single opinion score (SOS). Assuming that each SOS is the observed sample of a normal distribution and the MOS is its unknown expectation, the MOS inference is formulated as a maximum likelihood estimation problem, where the perceptual correlation of pairwise images is considered in modeling the likelihood of SOS. More specifically, by means of the quality-aware representations learned from the self-supervised backbone, we introduce a learnable relative quality measure to predict the MOS difference between two images. Then, the current image's maximum likelihood estimation towards MOS is represented by the sum of another reference image's estimated MOS and their relative quality. Ideally, no matter which image is selected as the reference, the MOS of the current image should remain unchanged, which is termed perceptual cons tancy constrained calibration (PC3). Finally, we alternatively optimize the relative quality measure's parameter and the current image's estimated MOS via backpropagation and Newton's method respectively. Experiments show that the proposed method is efficient in calibrating the biased SOS and significantly improves IQA model learning when only SOSs are available.
在本文中,我们提出了一种从单个意见分数(SOS)估计图像平均评分(MOS)的高效方法。假设每个SOS是正态分布的观察样本,而MOS是它的未知期望。因此,MOS推理被视为最大似然估计问题,其中考虑了成对图像的感知相关性以建模SOS的概率。具体来说,通过自监督骨架学习到的质量感知表示,我们引入了一个可学习的相对质量度量以预测两个图像之间的MOS差。那么,当前图像对MOS的最大似然估计就可以表示为另一个参考图像的估计MOS和它们之间的相对质量之和。理想情况下,无论选择哪个图像作为参考,当前图像的MOS都应该保持不变,这被称为感知一致性约束调节(PC3)。最后,我们分别通过反向传播和牛顿法对相对质量度量的参数和当前图像的估计MOS进行优化。实验证明,与仅使用SOS时相比,所提出的方法在调节带有偏差SOS方面非常有效,并且当仅可用SOS时,IQA模型的学习显著提高。
https://arxiv.org/abs/2404.19595
Despite great success in modeling visual perception, deep neural network based image quality assessment (IQA) still remains unreliable in real-world applications due to its vulnerability to adversarial perturbations and the inexplicit black-box structure. In this paper, we propose to build a trustworthy IQA model via Causal Perception inspired Representation Learning (CPRL), and a score reflection attack method for IQA model. More specifically, we assume that each image is composed of Causal Perception Representation (CPR) and non-causal perception representation (N-CPR). CPR serves as the causation of the subjective quality label, which is invariant to the imperceptible adversarial perturbations. Inversely, N-CPR presents spurious associations with the subjective quality label, which may significantly change with the adversarial perturbations. To extract the CPR from each input image, we develop a soft ranking based channel-wise activation function to mediate the causally sufficient (beneficial for high prediction accuracy) and necessary (beneficial for high robustness) deep features, and based on intervention employ minimax game to optimize. Experiments on four benchmark databases show that the proposed CPRL method outperforms many state-of-the-art adversarial defense methods and provides explicit model interpretation.
尽管在建模视觉感知方面取得了巨大的成功,但基于深度神经网络的图像质量评估(IQA)仍然不可靠,因为在实际应用中容易受到对抗扰动的影响,并且具有难以解释的黑盒结构。在本文中,我们提出了一种通过Causal Perception启发下的表示学习(CPRL)构建可靠IQA模型的方法,以及一种IQA模型得分反射攻击方法。具体来说,我们假设每个图像由Causal Perception表示(CPR)和非对称感知表示(N-CPR)组成。CPR作为主观质量标签的因果关系,对不可感知的主观扰动具有不变性。相反,N-CPR表现出与主观质量标签的伪相关性,随着对抗扰动的变化,可能会显著改变。为了从每个输入图像中提取CPR,我们基于通道的激活函数开发了一种软排名方法,以介导足够因果(提高预测准确性)和必要(提高稳健性)的深度特征,并且通过干预采用最小最大游戏进行优化。在四个基准数据库上的实验表明,与最先进的对抗防御方法相比,所提出的CPRL方法具有更好的性能,并提供了明确的模型解释。
https://arxiv.org/abs/2404.19567