Coral reefs are among the most diverse ecosystems on our planet, and are depended on by hundreds of millions of people. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular SfM photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego-motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high-precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: a 100 m video transect acquired within 5 minutes of diving with a cheap consumer-grade camera can be fully automatically analyzed within 5 minutes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method democratizes coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning-based Structure-from-Motion has broad implications for fast low-cost mapping of underwater environments other than coral reefs.
珊瑚礁是地球上最多样化的生态系统之一,是数百万人的依赖者。不幸的是,大多数珊瑚礁都受到全球气候变化和当地人类活动的压力的严重威胁。为了更好地理解珊瑚礁恶化的动态机制,提高空间和时间分辨率的监测是关键。然而,用于量化珊瑚覆盖和物种数量的常规监测方法由于需要大量的手动劳动而 scale 受到限制。尽管计算机视觉工具被用于协助这个过程,特别是 SfM 照相测量和图像分割深度学习网络,但数据分析造成了瓶颈,有效地限制了其 scalability。本文提出了从自我运动视频映射水下环境的新方法,将 3D 映射系统统一起来,使用机器学习适应水下挑战条件,并与现代方法之一,图像语义分割相结合。该方法在红海北部阿喀巴湾的珊瑚礁区举例说明,展示了前所未有的高精度 3D 语义映射,且所需劳动成本 significantly 减少了:使用廉价的消费级摄像机在潜水5分钟内采集的100米视频线可以在5分钟内完全自动分析。我们的方法 significantly 提高了珊瑚礁监测的规模,通过迈向完全自动分析视频线一大步。该方法通过减少劳动、设备、后勤和计算成本,实现了珊瑚礁线民主化。这种方法可以帮助更有效地传达保护政策。基于学习的结构自运动计算方法对于快速、低成本映射水下环境除珊瑚礁以外的其他生态系统也有广泛的影响。
https://arxiv.org/abs/2309.12804
Large Language Models (LLMs), acting as a powerful reasoner and generator, exhibit extraordinary performance across various natural language tasks, such as question answering (QA). Among these tasks, Multi-Hop Question Answering (MHQA) stands as a widely discussed category, necessitating seamless integration between LLMs and the retrieval of external knowledge. Existing methods employ LLM to generate reasoning paths and plans, and utilize IR to iteratively retrieve related knowledge, but these approaches have inherent flaws. On one hand, Information Retriever (IR) is hindered by the low quality of generated queries by LLM. On the other hand, LLM is easily misguided by the irrelevant knowledge by IR. These inaccuracies, accumulated by the iterative interaction between IR and LLM, lead to a disaster in effectiveness at the end. To overcome above barriers, in this paper, we propose a novel pipeline for MHQA called Furthest-Reasoning-with-Plan-Assessment (FuRePA), including an improved framework (Furthest Reasoning) and an attached module (Plan Assessor). 1) Furthest reasoning operates by masking previous reasoning path and generated queries for LLM, encouraging LLM generating chain of thought from scratch in each iteration. This approach enables LLM to break the shackle built by previous misleading thoughts and queries (if any). 2) The Plan Assessor is a trained evaluator that selects an appropriate plan from a group of candidate plans proposed by LLM. Our methods are evaluated on three highly recognized public multi-hop question answering datasets and outperform state-of-the-art on most metrics (achieving a 10%-12% in answer accuracy).
大型语言模型(LLM)作为一种强大的推理和生成工具,在各种自然语言任务中表现出非凡的性能,例如问答(QA)。在这些任务中,MHQA是一个被广泛讨论的类别,需要进行LLM和外部知识检索的无缝集成。现有的方法使用LLM生成推理路径和计划,并使用IR迭代地检索相关知识,但这些方法具有固有的缺陷。一方面,信息检索(IR)受到LLM生成低质量查询的限制。另一方面,LLM很容易受到IR生成的无关知识的影响。这些不准确的误差通过IR和LLM的迭代交互不断增加,最终导致 effectiveness 的灾难。为了克服上述障碍,在本文中,我们提出了MHQA的新型管道,称为“最短推理与计划评估(FuRePA)”,包括改进的框架(最短推理)和一个附加模块(计划评估器)。1) 最短推理通过掩盖前推理路径和生成LLM的查询,鼓励LLM在每次迭代中从头生成思考链。这种方法使LLM能够打破由以前误导性思考和查询(如果有)构建的束缚。2) 计划评估器是一个经过训练的评估者,从LLM提出的一组备选计划中选择适当的计划。我们的方法在三个备受认可的公共多级问答数据集上进行了评估,并在大多数指标上优于最先进的方法(实现回答准确性10%-12%)。
https://arxiv.org/abs/2309.12767
Barren plateaus are a central bottleneck in the scalability of variational quantum algorithms (VQAs), and are known to arise in various ways, from circuit depth and hardware noise to global observables. However, a caveat of most existing results is the requirement of t-design circuit assumptions that are typically not satisfied in practice. In this work, we loosen these assumptions altogether and derive tight upper and lower bounds on gradient concentration, for a large class of parameterized quantum circuits and arbitrary observables. By requiring only a couple of design choices that are constructive and easily verified, our results can readily be leveraged to rule out barren plateaus for explicit circuits and mixed observables, namely, observables containing a non-vanishing local term. This insight has direct implications for hybrid Quantum Generative Adversarial Networks (qGANs), a generative model that can be reformulated as a VQA with an observable composed of local and global terms. We prove that designing the discriminator appropriately leads to 1-local weights that stay constant in the number of qubits, regardless of discriminator depth. Combined with our first contribution, this implies that qGANs with shallow generators can be trained at scale without suffering from barren plateaus -- making them a promising candidate for applications in generative quantum machine learning. We demonstrate this result by training a qGAN to learn a 2D mixture of Gaussian distributions with up to 16 qubits, and provide numerical evidence that global contributions to the gradient, while initially exponentially small, may kick in substantially over the course of training.
荒芜的凸起是Variational Quantum 算法(VQA) scalability 的核心瓶颈,它们通常以不同的方式出现,从电路深度和硬件噪声到全局观测器。然而,大多数现有结果的一个缺点是要求 t 设计电路假设,这些假设在实际应用中通常无法满足。在本文中,我们放松了这些假设,并推导了梯度聚集的 tight upper and lower bounds,适用于大量参数化的量子电路和任意观测器。只需要几个 constructive 且容易验证的设计选择,我们就能够利用结果来排除 explicit 电路和混合观测器的荒芜凸起,即包含局部变量的观测器。这一见解直接适用于混合量子生成对抗网络(qGANs),一种可以重新表述为 VQA 的生成模型,观测器由 local 和全局变量组成。我们证明,适当地设计分岔器会导致 1 局部权重,在qubit 数量不变的情况下,无论分岔器深度如何,都保持恒定。与我们的第一项贡献相结合,这意味着 qGAN 的浅生成器可以在规模上训练,而无需经历荒芜凸起,使其成为生成量子机器学习应用程序的有前途的选择。我们训练了一个 qGAN 来学习一个 2D 混合高斯分布,高达 16 qubit,并提供了数值证据, global 对梯度的贡献,尽管开始时呈指数级小,可能在训练过程中显著增加。
https://arxiv.org/abs/2309.12681
The bin packing is a well-known NP-Hard problem in the domain of artificial intelligence, posing significant challenges in finding efficient solutions. Conversely, recent advancements in quantum technologies have shown promising potential for achieving substantial computational speedup, particularly in certain problem classes, such as combinatorial optimization. In this study, we introduce QAL-BP, a novel Quadratic Unconstrained Binary Optimization (QUBO) formulation designed specifically for bin packing and suitable for quantum computation. QAL-BP utilizes the augmented Lagrangian method to incorporate the bin packing constraints into the objective function while also facilitating an analytical estimation of heuristic, but empirically robust, penalty multipliers. This approach leads to a more versatile and generalizable model that eliminates the need for empirically calculating instance-dependent Lagrangian coefficients, a requirement commonly encountered in alternative QUBO formulations for similar problems. To assess the effectiveness of our proposed approach, we conduct experiments on a set of bin-packing instances using a real Quantum Annealing device. Additionally, we compare the results with those obtained from two different classical solvers, namely simulated annealing and Gurobi. The experimental findings not only confirm the correctness of the proposed formulation but also demonstrate the potential of quantum computation in effectively solving the bin-packing problem, particularly as more reliable quantum technology becomes available.
装箱问题是人工智能领域的一个著名NP-困难问题,在找到高效解决方案方面面临巨大挑战。反之,最近的量子技术进步表明,可以实现显著的计算速度提升,特别是在组合优化等某些问题类中,例如。在本研究中,我们介绍了QL-BP,一个专门设计用于装箱和适合量子计算的新无约束二元优化(QUBO)方案。QL-BP使用增强拉格朗日方法将装箱约束融入目标函数中,同时还方便对启发式但经验上稳健的代价乘数进行 analytical 估计。这种方法导致更灵活和可泛化的模式,消除经验上计算实例相关拉格朗日系数的需求,这在类似问题的QUBO替代方案中经常出现。为了评估我们提出的方案的有效性,我们使用了一个真实的量子退火装置对一组装箱实例进行了实验。此外,我们与模拟退火和Gurobi两个不同的古典求解器进行了比较。实验结果不仅确认了 proposed 方案的正确性,还展示了量子计算在有效地解决装箱问题方面的潜力,尤其是当更可靠的量子技术出现时。
https://arxiv.org/abs/2309.12678
Answering numerical questions over hybrid contents from the given tables and text(TextTableQA) is a challenging task. Recently, Large Language Models (LLMs) have gained significant attention in the NLP community. With the emergence of large language models, In-Context Learning and Chain-of-Thought prompting have become two particularly popular research topics in this field. In this paper, we introduce a new prompting strategy called Hybrid prompt strategy and Retrieval of Thought for TextTableQA. Through In-Context Learning, we prompt the model to develop the ability of retrieval thinking when dealing with hybrid data. Our method achieves superior performance compared to the fully-supervised SOTA on the MultiHiertt dataset in the few-shot setting.
从给定的表格和文本中回答混合内容的问题是一项挑战性的任务。最近,大型语言模型(LLM)在自然语言处理社区中引起了广泛关注。随着大型语言模型的出现,上下文学习和思维链提示已成为该领域的两个最受欢迎的研究主题。在本文中,我们介绍了一种新的提示策略,称为混合提示策略,并介绍了在TextTableQA问题中的思维提取方法。通过上下文学习,我们提示模型在处理混合数据时发展检索思维的能力。我们的方法和在少量样本情况下 MultiHiertt 数据集上的全监督顶级结果相比,取得了更好的表现。
https://arxiv.org/abs/2309.12669
Generative Artificial Intelligence is set to revolutionize healthcare delivery by transforming traditional patient care into a more personalized, efficient, and proactive process. Chatbots, serving as interactive conversational models, will probably drive this patient-centered transformation in healthcare. Through the provision of various services, including diagnosis, personalized lifestyle recommendations, and mental health support, the objective is to substantially augment patient health outcomes, all the while mitigating the workload burden on healthcare providers. The life-critical nature of healthcare applications necessitates establishing a unified and comprehensive set of evaluation metrics for conversational models. Existing evaluation metrics proposed for various generic large language models (LLMs) demonstrate a lack of comprehension regarding medical and health concepts and their significance in promoting patients' well-being. Moreover, these metrics neglect pivotal user-centered aspects, including trust-building, ethics, personalization, empathy, user comprehension, and emotional support. The purpose of this paper is to explore state-of-the-art LLM-based evaluation metrics that are specifically applicable to the assessment of interactive conversational models in healthcare. Subsequently, we present an comprehensive set of evaluation metrics designed to thoroughly assess the performance of healthcare chatbots from an end-user perspective. These metrics encompass an evaluation of language processing abilities, impact on real-world clinical tasks, and effectiveness in user-interactive conversations. Finally, we engage in a discussion concerning the challenges associated with defining and implementing these metrics, with particular emphasis on confounding factors such as the target audience, evaluation methods, and prompt techniques involved in the evaluation process.
生成人工智能计划通过将传统病人护理转变为更个性化、高效和主动的过程来革命性地改变医疗保健交付。聊天机器人将成为交互对话模型,可能会推动这种病人为中心的医疗保健变革。通过提供各种服务,包括诊断、个性化生活方式建议和心理健康支持,的目标是实质性增加患者的健康成果,同时减轻医疗保健提供者的工作负担。医疗保健应用程序的生命重要性迫使建立统一和全面的评估 metrics 对对话模型。针对各种通用大型语言模型(LLM)提出的现有评估 metrics 表明对医学和健康概念的理解不足,以及它们在促进患者福祉方面的重要性。此外,这些 metrics 忽略了关键用户中心方面,包括建立信任、伦理、个性化、同理心、用户理解和情感支持。本文的目的是探索适用于医疗保健交互对话模型评估的最新 LLM based 评估 metrics。随后,我们介绍了一个全面 set 旨在从用户的角度全面评估医疗保健聊天机器人的性能。这些 metrics 包括对语言处理能力的评估、对现实世界临床任务的影响以及用户交互对话的有效性。最后,我们参与讨论与定义和实施这些 metrics 相关的挑战,特别注重令人困惑的因素,例如目标受众、评估方法以及评估过程中涉及的快速技巧。
https://arxiv.org/abs/2309.12444
Large Language Models (LLMs) have demonstrated impressive zero shot performance on a wide range of NLP tasks, demonstrating the ability to reason and apply commonsense. A relevant application is to use them for creating high quality synthetic datasets for downstream tasks. In this work, we probe whether GPT-4 can be used to augment existing extractive reading comprehension datasets. Automating data annotation processes has the potential to save large amounts of time, money and effort that goes into manually labelling datasets. In this paper, we evaluate the performance of GPT-4 as a replacement for human annotators for low resource reading comprehension tasks, by comparing performance after fine tuning, and the cost associated with annotation. This work serves to be the first analysis of LLMs as synthetic data augmenters for QA systems, highlighting the unique opportunities and challenges. Additionally, we release augmented versions of low resource datasets, that will allow the research community to create further benchmarks for evaluation of generated datasets.
大型语言模型(LLMs)在多种自然语言处理任务中表现出了令人印象深刻的零样本性能,这表明了它们能够进行推理并应用常识的能力。一个相关的应用是使用它们创建高质量的合成数据集,为后续任务提供支持。在这个工作中,我们探索了GPT-4是否可以用于增加现有的提取阅读理解数据集。自动化数据标注过程有潜力节省大量手动标注数据集的时间、金钱和努力。在本文中,我们评估了GPT-4作为低资源阅读理解任务的人因工程替代品的性能,通过比较调整后的性能以及与标注相关的成本来进行评价。这项工作是LLMs作为合成数据增强器为QA系统进行验证的第一次分析,强调了独特的机会和挑战。此外,我们发布了低资源数据集的增强版本,这将允许研究社区创建进一步的基准用于评估生成数据集。
https://arxiv.org/abs/2309.12426
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
我们提出了 LongLoRA,一种高效的微调方法,可以扩展训练前已训练的大型语言模型(LLM)的上下文大小,同时以有限的计算成本实现。通常,训练上下文大小较长的LLM具有计算成本高的特点,需要大量的训练时间和GPU资源。例如,训练8192上下文长度的LLM需要比2048上下文长度的计算成本更高的 self-attention 层。在本文中,我们加速了LLM的上下文扩展。一方面,虽然推理时需要密集的注意力,但稀疏的注意力可以在模型微调方面有效地和高效地进行。 proposed 的简短注意力有效地促进了上下文扩展,导致与仅使用常规注意力的微调相比,显著的计算节省,类似于仅使用常规注意力的微调。特别地,在训练时只需要两行代码,而在推理时则是可选的。另一方面,我们重新考虑了上下文扩展参数高效的微调模式。值得注意的是,我们发现 LoRA 对上下文扩展的工作在可训练嵌入和归一化的前提下良好运作。LongLoRA 在多个任务上展示了强 empirical 结果,包括从 7B/13B 到 70B 的 LLaMA2 模型。LongLoRA 将 LLaMA2 7B 从 4k 上下文扩展到 100k,或在一个 8x A100 机器上将 LLaMA2 70B 扩展到 32k。LongLoRA 在保持模型原架构的情况下扩展了模型的上下文,并与大多数现有技术,如 FlashAttention-2 兼容。此外,为了使 LongLoRA 实现,我们收集了一个数据集 LongQA,用于监督微调。该数据集包含超过 3k 长的上下文问答对。
https://arxiv.org/abs/2309.12307
Large language models (LLMs) have pushed the limits of natural language understanding and exhibited excellent problem-solving ability. Despite the great success, most existing open-source LLMs (\eg, LLaMA-2) are still far away from satisfactory for solving mathematical problem due to the complex reasoning procedures. To bridge this gap, we propose \emph{MetaMath}, a fine-tuned language model that specializes in mathematical reasoning. Specifically, we start by bootstrapping mathematical questions by rewriting the question from multiple perspectives without extra knowledge, which results in a new dataset called {MetaMathQA}. Then we fine-tune the LLaMA-2 models on MetaMathQA. Experimental results on two popular benchmarks (\ie, GSM8K and MATH) for mathematical reasoning demonstrate that MetaMath outperforms a suite of open-source LLMs by a significant margin. Our MetaMath-7B model achieves $66.4\%$ on GSM8K and $19.4\%$ on MATH, exceeding the state-of-the-art models of the same size by $11.5\%$ and $8.7\%$. Particularly, {MetaMath-70B} achieves an accuracy of $82.3\%$ on {GSM8K}, slightly better than {GPT-3.5-Turbo}. We release the {MetaMathQA} dataset, the {MetaMath} models with different model sizes and the training code for public use.
大型语言模型(LLM)已经突破了自然语言理解的极限,并展现出了出色的解决问题的能力。尽管取得了巨大的成功,但大部分现有的开源LLM(例如LLaMA-2)在解决数学问题方面仍然无法令人满意,因为这涉及到复杂的推理过程。为了填补这一差距,我们提出了“MetaMath”,这是一个专门用于数学推理的优化语言模型。具体来说,我们从头开始通过在没有额外知识的情况下从多个角度改写问题来Bootstrap数学问题,从而产生了一个新的数据集{MetaMathQA}。然后,我们在{MetaMathQA}上优化了LLaMA-2模型。在两个流行的基准(例如GSM8K和Math)上进行数学推理的实验结果表明,MetaMath在同类模型中表现优异,比开源LLM套件还要好。我们的MetaMath-7B模型在GSM8K上达到了66.4%,在Math上达到了19.4%,超过了相同大小的最先进的模型的11.5%和8.7%。特别是,{MetaMath-70B}在{GSM8K}上实现了82.3%的准确率,比{GPT-3.5-Turbo}略微更好。我们发布了{MetaMathQA}数据集、不同模型大小的{MetaMath}模型和训练代码,以供公众使用。
https://arxiv.org/abs/2309.12284
Evaluation of QA systems is very challenging and expensive, with the most reliable approach being human annotations of correctness of answers for questions. Recent works (AVA, BEM) have shown that transformer LM encoder based similarity metrics transfer well for QA evaluation, but they are limited by the usage of a single correct reference answer. We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation), using multiple reference answers (combining multiple correct and incorrect references) for sentence-form QA. We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems, across multiple academic and industrial datasets, and show that it outperforms previous baselines and obtains the highest correlation with human annotations.
评估QA系统非常困难且昂贵,最可靠的方法是对人类对问题答案的正确性进行注释。最近的工作(AVA, BEM)表明,基于Transformer LM编码器的特征相似度度量在QA评估中表现良好,但受限于使用单个正确参考答案。我们提出了一个新的评估指标:SQuArE(句子级别问答评估),使用多个参考答案(组合多个正确和不正确的参考答案)用于句子形式QA。我们在不同的学术和工业数据集上评估了SQuArE,并在句子级别提取(答案选择)和生成(GenQA)QA系统中进行了测试,结果表明它比以前的基准表现更好,并与人类注释具有很高的相关性。
https://arxiv.org/abs/2309.12250
Children typically learn to identify and express emotions through sharing their stories and feelings with others, particularly their family. However, it is challenging for parents or siblings to have emotional communication with children since children are still developing their communication skills. We present ChaCha, a chatbot that encourages and guides children to share personal events and associated emotions. ChaCha combines a state machine and large language models (LLMs) to keep the dialogue on track while carrying on free-form conversations. Through an exploratory study with 20 children (aged 8-12), we examine how ChaCha prompts children to share personal events and guides them to describe associated emotions. Participants perceived ChaCha as a close friend and shared their stories on various topics, such as family trips and personal achievements. Based on the quantitative and qualitative findings, we discuss opportunities for leveraging LLMs to design child-friendly chatbots to support children in sharing their emotions.
孩子们通常通过与他人分享故事和感受,特别是与家人分享,来识别和表达情感。然而,对于父母或兄弟姐妹来说,与孩子们进行情感交流是很困难的,因为孩子们仍在发展他们的沟通能力。我们介绍了ChaCha,一个聊天机器人,它鼓励并指导孩子们分享个人事件和相关的情感上的内容。ChaCha结合了状态机和大型语言模型(LLM),以保持对话跟踪,同时进行自由对话。通过与20名年龄在8-12岁的孩子们进行探索性研究,我们研究了ChaCha如何引导孩子们分享个人事件,并指导他们描述相关的情感。参与者认为ChaCha是他们的好朋友,并在不同的主题上,如家庭旅行和个人成就,分享了他们的故事。基于定量和定性研究结果,我们讨论了利用LLM设计儿童友好的聊天机器人的机会,以支持孩子们分享他们的情感。
https://arxiv.org/abs/2309.12244
The increase in the availability of online videos has transformed the way we access information and knowledge. A growing number of individuals now prefer instructional videos as they offer a series of step-by-step procedures to accomplish particular tasks. The instructional videos from the medical domain may provide the best possible visual answers to first aid, medical emergency, and medical education questions. Toward this, this paper is focused on answering health-related questions asked by the public by providing visual answers from medical videos. The scarcity of large-scale datasets in the medical domain is a key challenge that hinders the development of applications that can help the public with their health-related questions. To address this issue, we first proposed a pipelined approach to create two large-scale datasets: HealthVidQA-CRF and HealthVidQA-Prompt. Later, we proposed monomodal and multimodal approaches that can effectively provide visual answers from medical videos to natural language questions. We conducted a comprehensive analysis of the results, focusing on the impact of the created datasets on model training and the significance of visual features in enhancing the performance of the monomodal and multi-modal approaches. Our findings suggest that these datasets have the potential to enhance the performance of medical visual answer localization tasks and provide a promising future direction to further enhance the performance by using pre-trained language-vision models.
在线视频的可用性增加已经改变了我们获取信息和知识的方式。越来越多的人现在更喜欢观看教学视频,因为它们提供了完成特定任务的逐步步骤。医疗领域的教学视频可能能够提供最好的视觉回答,解决急救、紧急情况和医疗教育问题。为了回答这些问题,本文专注于从医疗视频中提供视觉回答,以回答公众提出的问题。医疗领域的大规模数据短缺是一个关键挑战,妨碍了能够帮助公众解决健康相关问题的应用的开发。为了解决这个问题,我们提出了一种管道化的方法,以创建两个大规模数据集:HealthVidQA-CRF和HealthVidQA-Prompt。后来,我们提出了单模态和多模态方法,能够有效地从医疗视频中提供自然语言问题的视觉回答。我们进行了一项全面分析,重点分析了创建的数据集对模型训练的影响,以及视觉特征在增强单模态和多模态方法表现的重要性。我们的发现表明,这些数据集有潜力增强医疗视觉回答本地化任务的表现,并提供一个有前途的未来方向,通过使用预先训练的语言-视觉模型进一步提高表现。
https://arxiv.org/abs/2309.12224
We introduce quantum algorithms able to sample equilibrium water solvent molecules configurations within proteins thanks to analog quantum computing. To do so, we combine a quantum placement strategy to the 3D Reference Interaction Site Model (3D-RISM), an approach capable of predicting continuous solvent distributions. The intrinsic quantum nature of such coupling guarantees molecules not to be placed too close to each other, a constraint usually imposed by hand in classical approaches. We present first a full quantum adiabatic evolution model that uses a local Rydberg Hamiltonian to cast the general problem into an anti-ferromagnetic Ising model. Its solution, an NP-hard problem in classical computing, is embodied into a Rydberg atom array Quantum Processing Unit (QPU). Following a classical emulator implementation, a QPU portage allows to experimentally validate the algorithm performances on an actual quantum computer. As a perspective of use on next generation devices, we emulate a second hybrid quantum-classical version of the algorithm. Such a variational quantum approach (VQA) uses a classical Bayesian minimization routine to find the optimal laser parameters. Overall, these Quantum-3D-RISM (Q-3D-RISM) algorithms open a new route towards the application of analog quantum computing in molecular modelling and drug design.
我们介绍了利用模拟量子计算技术能够采样蛋白质中平衡的 water 溶剂分子构型的方法。为此,我们结合了量子位置规划与 3D 参考相互作用点模型(3D-RISM),这是一种能够预测连续溶剂分布的方法。这种量子耦合的内在量子性质保证了分子不应该彼此放置得太接近,这在经典方法中通常是手动添加的限制。我们首先介绍了一种完整的量子渐近演化模型,该模型使用局部Rydberghamiltonian将一般问题转化为反铁磁李纳模型。它的解决方案在经典计算中是一个NP-困难问题,将其转化为Rydberg原子数组量子处理单元(QPU)。在经典模拟实现之后,QPU端口允许在真实的量子计算机上实验验证算法性能。作为未来设备使用的视角,我们模拟了另一种混合量子-经典版本的算法。这种可变量子方法(VQA)使用经典的贝叶斯最小化 routine 来找到最优激光参数。总的来说,这些量子-3D-RISM(Q-3D-RISM)算法开辟了将模拟量子计算应用于分子模拟和药物设计的新途径。
https://arxiv.org/abs/2309.12129
Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at \url{this https URL}.
研究人们在真实场景中与大型语言模型(LLM)的互动变得越来越重要,因为这些模型在多种应用程序中得到广泛应用。在本文中,我们介绍了LMSYS-Chat-1M,这是一个包含一千万真实场景对话,与25个最先进的LLM的大型数据集。该数据集从我们在vicuna演示和机器人竞技场网站上收集的210,000个独特的IP地址。我们提供了该数据集的内容概述,包括其编辑过程、基本统计量和主题分布,强调了其多样性、原创性和规模。通过四个使用案例,我们展示了其 versatility,通过建立与vicuna相似的性能模型、建立与vicuna相似的训练指令跟随模型,以及创造具有挑战性基准问题,我们相信,该数据集将成为理解和推进LLM能力的宝贵资源。该数据集可在\url{this https URL}上公开可用。
https://arxiv.org/abs/2309.11998
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
https://arxiv.org/abs/2309.11751
Answer grounding is the task of locating relevant visual evidence for the Visual Question Answering task. While a wide variety of attention methods have been introduced for this task, they suffer from the following three problems: designs that do not allow the usage of pre-trained networks and do not benefit from large data pre-training, custom designs that are not based on well-grounded previous designs, therefore limiting the learning power of the network, or complicated designs that make it challenging to re-implement or improve them. In this paper, we propose a novel architectural block, which we term Sentence Attention Block, to solve these problems. The proposed block re-calibrates channel-wise image feature-maps by explicitly modeling inter-dependencies between the image feature-maps and sentence embedding. We visually demonstrate how this block filters out irrelevant feature-maps channels based on sentence embedding. We start our design with a well-known attention method, and by making minor modifications, we improve the results to achieve state-of-the-art accuracy. The flexibility of our method makes it easy to use different pre-trained backbone networks, and its simplicity makes it easy to understand and be re-implemented. We demonstrate the effectiveness of our method on the TextVQA-X, VQS, VQA-X, and VizWiz-VQA-Grounding datasets. We perform multiple ablation studies to show the effectiveness of our design choices.
Answering grounded 是视觉问答任务中定位相关视觉证据的任务。虽然已经介绍了多种注意力方法来实现这一任务,但它们面临着以下三个问题:设计不允许使用预先训练的网络,并且未从大规模预训练数据中受益;自定义设计,其基础建立在不充分扎实的先前设计中,因此限制了网络的学习能力;或者复杂的设计,使其难以重新实现或改进。在本文中,我们提出了一种新的建筑模块,我们称之为句子注意力模块,以解决这些问题。该模块通过明确建模图像特征映射和句子嵌入之间的依赖关系,重新校准通道上的图像特征映射。我们通过视觉演示了该模块如何基于句子嵌入过滤无关的特征映射通道。我们先从已知的注意力方法开始设计,通过轻微的修改,提高了结果的准确性,达到了最先进的精度。我们的方法的灵活性使其容易使用不同的预先训练主干网络,其简单性使其容易理解和重新实现。我们证明了我们的方法在文本VQA-X、VQS、VQA-X和VizWiz-VQA-grounding数据集上的有效性。我们进行了多个 ablation研究,以展示我们的设计选择的有效性。
https://arxiv.org/abs/2309.11593
Generation of plausible yet incorrect factual information, termed hallucination, is an unsolved issue in large language models. We study the ability of language models to deliberate on the responses they give in order to correct their mistakes. We develop the Chain-of-Verification (CoVe) method whereby the model first (i) drafts an initial response; then (ii) plans verification questions to fact-check its draft; (iii) answers those questions independently so the answers are not biased by other responses; and (iv) generates its final verified response. In experiments, we show CoVe decreases hallucinations across a variety of tasks, from list-based questions from Wikidata, closed book MultiSpanQA and longform text generation.
生成看似合理但实际上不正确的事实信息,称为幻觉,是大型语言模型中的一个未解决的难题。我们研究语言模型是否能够仔细思考他们给出的回答,以纠正自己的错误。我们开发了 Chain-of-Verification (CoVe)方法,该方法使得模型首先(i)起草一个初始回答;然后(ii)计划验证问题,以事实检查其起草;(iii)独立地回答这些问题,以避免其他回答的影响;最后(iv)生成其最终验证回答。在实验中,我们表明CoVe在不同任务中减少了幻觉,包括从 Wikidata 的列表问题、闭卷 MultiSpanQA 和长篇文本生成等多种任务。
https://arxiv.org/abs/2309.11495
We address the integration of storytelling and Large Language Models (LLMs) to develop engaging and believable Social Chatbots (SCs) in community settings. Motivated by the potential of fictional characters to enhance social interactions, we introduce Storytelling Social Chatbots (SSCs) and the concept of story engineering to transform fictional game characters into "live" social entities within player communities. Our story engineering process includes three steps: (1) Character and story creation, defining the SC's personality and worldview, (2) Presenting Live Stories to the Community, allowing the SC to recount challenges and seek suggestions, and (3) Communication with community members, enabling interaction between the SC and users. We employed the LLM GPT-3 to drive our SSC prototypes, "David" and "Catherine," and evaluated their performance in an online gaming community, "DE (Alias)," on Discord. Our mixed-method analysis, based on questionnaires (N=15) and interviews (N=8) with community members, reveals that storytelling significantly enhances the engagement and believability of SCs in community settings.
我们解决故事和大型语言模型(LLM)的集成,以在社区环境中开发具有互动性和可相信性的社交机器人(SC)。基于虚构人物增强社交互动的潜力,我们介绍了 storytelling Social Chatbots(SSCs)和故事工程概念,将虚构游戏角色 within 玩家社区中的“实时”社交实体转化为“真实”的社交角色。我们的故事工程过程包括三个步骤:(1) 人物和故事创作,定义SC的个性和世界观(2) 向社区发布实时故事,使SC回顾挑战并寻求建议(3) 与社区成员沟通,使SC与用户之间的互动。我们使用LLM GPT-3驱动我们的 SSC原型,“David”和“Catherine”,并在 Discord 上的在线游戏社区“DE ( alias)”中评估了它们的性能。我们采用混合方法分析,基于问卷调查(N=15)和访谈(N=8)与社区成员,表明故事 significantly 增强SC 在社区环境中的参与和可相信性。
https://arxiv.org/abs/2309.11478
We present a comprehensive benchmark dataset for Knowledge Graph Question Answering in Materials Science (KGQA4MAT), with a focus on metal-organic frameworks (MOFs). A knowledge graph for metal-organic frameworks (MOF-KG) has been constructed by integrating structured databases and knowledge extracted from the literature. To enhance MOF-KG accessibility for domain experts, we aim to develop a natural language interface for querying the knowledge graph. We have developed a benchmark comprised of 161 complex questions involving comparison, aggregation, and complicated graph structures. Each question is rephrased in three additional variations, resulting in 644 questions and 161 KG queries. To evaluate the benchmark, we have developed a systematic approach for utilizing ChatGPT to translate natural language questions into formal KG queries. We also apply the approach to the well-known QALD-9 dataset, demonstrating ChatGPT's potential in addressing KGQA issues for different platforms and query languages. The benchmark and the proposed approach aim to stimulate further research and development of user-friendly and efficient interfaces for querying domain-specific materials science knowledge graphs, thereby accelerating the discovery of novel materials.
我们提出了一个 comprehensive 基准数据集,用于材料科学中的知识图问答(KGQA4MAT),重点关注金属有机框架(MOF)。MOF-KG 是一个基于整合结构化数据库和从文献中获取的知识的知识图。为了提高专家对于 MOF-KG 的访问,我们的目标是开发一个自然语言界面来查询知识图。我们开发了一组包含 161 个复杂问题,涉及比较、聚合和复杂的图形结构,每个问题都被重新表述为三个额外的变体,最终导致 644 个问题和 161 个 KG 查询。为了评估基准,我们开发了一种新的系统方法,利用 ChatGPT 将自然语言问题转换为正式的 KG 查询。我们还将这种方法应用于著名的 QALD-9 数据集,展示了 ChatGPT 在不同平台和查询语言下解决 KGQA 问题的潜力。基准和所提出的方法旨在刺激进一步研究和开发用户友好且高效的界面,以查询特定材料科学知识图,从而加速发现新材料。
https://arxiv.org/abs/2309.11361
Despite their competitive performance on knowledge-intensive tasks, large language models (LLMs) still have limitations in memorizing all world knowledge especially long tail knowledge. In this paper, we study the KG-augmented language model approach for solving the knowledge graph question answering (KGQA) task that requires rich world knowledge. Existing work has shown that retrieving KG knowledge to enhance LLMs prompting can significantly improve LLMs performance in KGQA. However, their approaches lack a well-formed verbalization of KG knowledge, i.e., they ignore the gap between KG representations and textual representations. To this end, we propose an answer-sensitive KG-to-Text approach that can transform KG knowledge into well-textualized statements most informative for KGQA. Based on this approach, we propose a KG-to-Text enhanced LLMs framework for solving the KGQA task. Experiments on several KGQA benchmarks show that the proposed KG-to-Text augmented LLMs approach outperforms previous KG-augmented LLMs approaches regarding answer accuracy and usefulness of knowledge statements.
尽管在知识密集型任务中表现出优异的性能,大型语言模型(LLMs)仍然无法全面记忆所有世界知识,特别是长尾巴知识。在本文中,我们研究了基于KG增强的语言模型方法来解决需要丰富世界知识的Knowledge Graph问答任务(KGQA任务)。现有研究表明,通过检索KG知识增强LLMs的提示能力,可以显著提高LLMs在KGQA任务中的表现。然而,他们的方法缺乏KG知识的准确口头表达,即忽略了KG表示和文本表示之间的差异。为此,我们提出了一种答案敏感KG到文本的方法,可以将KG知识转换为最 informative 的文本表述。基于这种方法,我们提出了一个KG到文本增强的LLMs框架,以解决KGQA任务。对多个KGQA基准测试数据的实验表明, proposed KG到文本增强的LLMs方法在答案准确性和知识陈述的有用性方面超越了以前的KG增强LLMs方法。
https://arxiv.org/abs/2309.11206