Numerous parties are calling for the democratisation of AI, but the phrase is used to refer to a variety of goals, the pursuit of which sometimes conflict. This paper identifies four kinds of AI democratisation that are commonly discussed: (1) the democratisation of AI use, (2) the democratisation of AI development, (3) the democratisation of AI profits, and (4) the democratisation of AI governance. Numerous goals and methods of achieving each form of democratisation are discussed. The main takeaway from this paper is that AI democratisation is a multifarious and sometimes conflicting concept that should not be conflated with improving AI accessibility. If we want to move beyond ambiguous commitments to democratising AI, to productive discussions of concrete policies and trade-offs, then we need to recognise the principal role of the democratisation of AI governance in navigating tradeoffs and risks across decisions around use, development, and profits.
"许多政党呼吁民主化人工智能,但这个词通常用于指称多种目标,这些目标的实现有时会引起冲突。本文识别了常被人讨论的四种人工智能民主化形式:(1) 民主化人工智能使用,(2) 民主化人工智能发展,(3) 民主化人工智能利润,(4) 民主化人工智能治理。探讨了每个形式的民主化目标和方法。本文的主要结论是,人工智能民主化是一个复杂且有时存在冲突的概念,不应与提高人工智能可用性混淆。如果我们想要超越含糊不清的人工智能民主化承诺,走向具体的政策和权衡的讨论,那么我们需要认识到民主化人工智能治理在处理使用、发展和利润等决策中的主要功能。
https://arxiv.org/abs/2303.12642
Domain gaps are among the most relevant roadblocks in the clinical translation of machine learning (ML)-based solutions for medical image analysis. While current research focuses on new training paradigms and network architectures, little attention is given to the specific effect of prevalence shifts on an algorithm deployed in practice. Such discrepancies between class frequencies in the data used for a method's development/validation and that in its deployment environment(s) are of great importance, for example in the context of artificial intelligence (AI) democratization, as disease prevalences may vary widely across time and location. Our contribution is twofold. First, we empirically demonstrate the potentially severe consequences of missing prevalence handling by analyzing (i) the extent of miscalibration, (ii) the deviation of the decision threshold from the optimum, and (iii) the ability of validation metrics to reflect neural network performance on the deployment population as a function of the discrepancy between development and deployment prevalence. Second, we propose a workflow for prevalence-aware image classification that uses estimated deployment prevalences to adjust a trained classifier to a new environment, without requiring additional annotated deployment data. Comprehensive experiments based on a diverse set of 30 medical classification tasks showcase the benefit of the proposed workflow in generating better classifier decisions and more reliable performance estimates compared to current practice.
领域差异是在医学图像分析中临床翻译中最相关的障碍之一。当前研究主要关注新的训练范式和网络架构,但很少关注 prevalence shift 对在实践中部署的算法的具体影响。例如,在人工智能(AI)民主化的背景下,疾病 prevalence 可能在时间和地理位置上 widely 变化。我们的贡献有两个。首先,我们 empirical 地证明了缺失 prevalence 处理的潜在严重后果,通过分析(i) 校准误差的程度,(ii) 决策阈值从最优值偏离的程度,以及(iii) 验证指标在部署人口中反映神经网络性能的能力,以反映开发与部署 prevalence 的差异。其次,我们提出了一种 prevalence aware 的图像分类工作流程,使用估计的部署 prevalence 调整训练分类器到新的环境,而无需额外的标注部署数据。基于 diverse 一组 30 个医学分类任务的综合实验展示了 proposed 工作流程在生成更好的分类器和比当前 practice 更可靠的性能估计方面的优势。
https://arxiv.org/abs/2303.12540
The mass aggregation of knowledge embedded in large language models (LLMs) holds the promise of new solutions to problems of observability and measurement in the social sciences. We examine the utility of one such model for a particularly difficult measurement task: measuring the latent ideology of lawmakers, which allows us to better understand functions that are core to democracy, such as how politics shape policy and how political actors represent their constituents. We scale the senators of the 116th United States Congress along the liberal-conservative spectrum by prompting ChatGPT to select the more liberal (or conservative) senator in pairwise comparisons. We show that the LLM produced stable answers across repeated iterations, did not hallucinate, and was not simply regurgitating information from a single source. This new scale strongly correlates with pre-existing liberal-conservative scales such as NOMINATE, but also differs in several important ways, such as correctly placing senators who vote against their party for far-left or far-right ideological reasons on the extreme ends. The scale also highly correlates with ideological measures based on campaign giving and political activists' perceptions of these senators. In addition to the potential for better-automated data collection and information retrieval, our results suggest LLMs are likely to open new avenues for measuring latent constructs like ideology that rely on aggregating large quantities of data from public sources.
大型语言模型(LLMs)的知识聚合潜力解决了社会科学中难以观察和测量的问题。我们研究了一种模型在一项特定的测量任务中的实用性:测量立法者的潜台词意识形态,这使我们更好地理解民主的核心功能,例如政治如何影响政策,以及政治行动者如何代表其选民。我们通过促使聊天机器人选择更加自由主义(或保守主义的)议员,将116届美国国会的议员按照自由主义-保守主义光谱进行分组。我们表明,LLM在多次迭代中提供了稳定的答案,没有幻觉,并且不是简单地从单一来源重复地获取信息。这个新尺度 strongly corrrelated with existing自由主义-保守主义尺度,如NOMINATE,但也存在多个重要差异,例如正确地将投票反对其政党的极左或极右意识形态原因放置在极端两端的议员。该尺度还高度与基于竞选捐赠和政治活动家对这些议员的看法的意识形态测量手段相关。除了更好的自动化数据收集和信息检索潜力外,我们的结果表明,LLM可能打开新的测量潜台词 construct,例如意识形态,依赖从公共来源聚合大量数据的方法。
https://arxiv.org/abs/2303.12057
Sports analytics benefits from recent advances in machine learning providing a competitive advantage for teams or individuals. One important task in this context is the performance measurement of individual players to provide reports and log files for subsequent analysis. During sport events like basketball, this involves the re-identification of players during a match either from multiple camera viewpoints or from a single camera viewpoint at different times. In this work, we investigate whether it is possible to transfer the out-standing zero-shot performance of pre-trained CLIP models to the domain of player re-identification. For this purpose we reformulate the contrastive language-to-image pre-training approach from CLIP to a contrastive image-to-image training approach using the InfoNCE loss as training objective. Unlike previous work, our approach is entirely class-agnostic and benefits from large-scale pre-training. With a fine-tuned CLIP ViT-L/14 model we achieve 98.44 % mAP on the MMSports 2022 Player Re-Identification challenge. Furthermore we show that the CLIP Vision Transformers have already strong OCR capabilities to identify useful player features like shirt numbers in a zero-shot manner without any fine-tuning on the dataset. By applying the Score-CAM algorithm we visualise the most important image regions that our fine-tuned model identifies when calculating the similarity score between two images of a player.
体育 analytics 受益于最近的机器学习进展,为团队或个人提供了竞争优势。在这个背景下,一个重要的任务是对个人球员的性能测量,以提供报告和日志文件,为后续分析提供便利。在篮球等运动比赛中,这涉及在一场比赛的不同时间内从多个相机视角或单个相机视角多次识别球员。在这项工作中,我们研究如何将前训练的 CLIP 模型的零样本性能转移到球员识别领域。为此,我们重新阐述了对比性语言-图像前训练方法从 CLIP 到对比性图像-图像训练方法,使用 InfoNCE 损失作为训练目标。与以前的工作不同,我们的 approach 是完全类别无关的,并从大规模的预训练中获得好处。使用优化的 CLIP ViT-L/14 模型,我们在 MMSports 2022 球员重新识别挑战中取得了 98.44 % 的 mAP。此外,我们表明, CLIP 视觉转换器已经具有强大的 OCR 能力,以零样本方式识别有用的球员特征,例如球衣号码,而无需在数据集上进行微调。通过应用评分-CAM 算法,我们可视化了我们的优化模型在计算球员两个图像之间的相似性评分时识别的最重要图像区域。
https://arxiv.org/abs/2303.11855
Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at this https URL.
受到体积三维姿态估计的成功启发,一些最近的人网格估计器提议将三维骨骼估计作为中间表示,利用网格拓扑,以促进Dense 3DMesh的增长。然而,在提取骨骼的过程中,身体形状信息被丢失,导致表现平平。高级运动捕捉系统解决这个问题的方法是在身体表面放置Dense的物理标记点,使其从非定长的运动时提取真实的Mesh。然而,如果没有标记点,就不能应用于野生图像。在本研究中,我们提出了一种中间表示,称为虚拟标记点,它基于大规模运动捕捉数据,以生成风格的方式模拟物理标记点的影响,学习64个关键地形点在身体表面的分布,以此模仿物理标记点的影响。虚拟标记点可以从野生图像中准确检测,并通过简单的插值恢复完整的Mesh,其表现优于现有的三个数据集的方法。特别是,它在SURREAL数据集上显著超越了现有的方法。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.11726
Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise.
给定像我一样未经培训的业余爱好者生成的抽象、变形的普通 Sketch,本文将其转换为一张逼真的图像 - 与 Fig. 1(a)所示的图像一样,全部非 cherry-pick 的。我们与已有的文献有很大的不同,我们没有先指定一个边缘地图式的 Sketch,而是旨在与抽象手绘 human Sketch 合作。这样做,我们实质上将 Sketch 到照片的管道民主化,“想象”出 Sketch 不论 Sketch 质量如何。我们的贡献从一开始就是分离编码-解码训练范式,其中编码器仅训练基于照片的 StyleGAN。这重要的是确保生成的结果是 always 逼真的。剩下的都围绕着如何最好地处理 Sketch 和照片的抽象差距。为此,我们提出一种自回归 Sketch 映射器,训练基于 Sketch 和照片配对的 StyleGAN 隐藏空间映射。我们还介绍了一些设计来解决人类 Sketch 抽象性质,包括一个精细的辨别损失的训练 Sketch 照片检索模型的背后,以及一个部分自我意识的 Sketch 增强策略。最后,我们展示了一些我们的生成模型能够执行的下游任务,其中之一是展示如何通过精细的 Sketch 检索,一个在 Sketch 社区中广受研究的问题解决精细的 Sketch 图像检索问题可以归结为图像(生成)的任务,超越了现有水平。我们呈现了生成的结果,以便每个人审查。
https://arxiv.org/abs/2303.11162
Joint entity and relation extraction (JERE) is one of the most important tasks in information extraction. However, most existing works focus on sentence-level coarse-grained JERE, which have limitations in real-world scenarios. In this paper, we construct a large-scale document-level fine-grained JERE dataset DocRED-FE, which improves DocRED with Fine-Grained Entity Type. Specifically, we redesign a hierarchical entity type schema including 11 coarse-grained types and 119 fine-grained types, and then re-annotate DocRED manually according to this schema. Through comprehensive experiments we find that: (1) DocRED-FE is challenging to existing JERE models; (2) Our fine-grained entity types promote relation classification. We make DocRED-FE with instruction and the code for our baselines publicly available at this https URL.
联合实体和关系提取(JERE)是信息提取中的最重要任务之一。然而,大多数现有工作集中在句子级别的粗粒度JERE,在实际应用中存在一些限制。在本文中,我们建立了一个大规模的文档级别的精细粒度JERE数据集 DocRED-FE,以改进基于精细实体类型的DocRED。具体来说,我们重新设计了包括11个粗粒度类型和119个精细粒度类型的层级实体类型 schema,然后根据这个 schema 手动重新注释 DocRED。通过全面实验,我们发现:(1) DocRED-FE对现有的JERE模型具有挑战性;(2)我们的精细实体类型促进了关系分类。我们将 DocRED-FE与指令和我们的基准代码在此https URL上公开发布。
https://arxiv.org/abs/2303.11141
It is likely that AI systems driven by pre-trained language models (PLMs) will increasingly be used to assist humans in high-stakes interactions with other agents, such as negotiation or conflict resolution. Consistent with the goals of Cooperative AI \citep{dafoe_open_2020}, we wish to understand and shape the multi-agent behaviors of PLMs in a pro-social manner. An important first step is the evaluation of model behaviour across diverse cooperation problems. Since desired behaviour in an interaction depends upon precise game-theoretic structure, we focus on generating scenarios with particular structures with both crowdworkers and a language model. Our work proceeds as follows. First, we discuss key methodological issues in the generation of scenarios corresponding to particular game-theoretic structures. Second, we employ both crowdworkers and a language model to generate such scenarios. We find that the quality of generations tends to be mediocre in both cases. We additionally get both crowdworkers and a language model to judge whether given scenarios align with their intended game-theoretic structure, finding mixed results depending on the game. Third, we provide a dataset of scenario based on our data generated. We provide both quantitative and qualitative evaluations of UnifiedQA and GPT-3 on this dataset. We find that instruct-tuned models tend to act in a way that could be perceived as cooperative when scaled up, while other models seemed to have flat scaling trends.
可能的是,基于预训练语言模型(PLMs)驱动的人工智能系统将 increasingly 被用来协助人类与其他agent之间的高级别的交互,例如谈判或冲突解决。与合作人工智能(Cooperative AI)的目标相一致,我们希望理解并塑造PLMs的多方行为,以 pro-social 的方式影响它们。一个重要的步骤是评估不同合作问题的模型行为。由于在交互中期望的行为取决于精确的博弈论结构,我们重点处理生成具有特定结构的情境,同时雇用群众演员和语言模型。我们的工作按照以下步骤进行:首先,我们讨论了生成特定博弈论结构的方法和关键方法论问题。其次,我们使用群众演员和语言模型生成这样的情境。我们发现,在两个情况下,生成的质量都相对较低。我们还让群众演员和语言模型判断给定情境是否与它们的预期的博弈论结构对齐,发现根据游戏结果会出现不同结果。第三,我们提供了基于我们生成的数据的情境数据集。我们在这个数据集中提供了 UnifiedQA 和 GPT-3 的定量和定性评估。我们发现,经过调整的模型往往会在扩大规模时表现出可以被视为合作的方式,而其他模型似乎呈现出平增长趋势。
https://arxiv.org/abs/2303.13360
Recent work has shown the promise of creating generalist, transformer-based, policies for language, vision, and sequential decision-making problems. To create such models, we generally require centralized training objectives, data, and compute. It is of interest if we can more flexibly create generalist policies, by merging together multiple, task-specific, individually trained policies. In this work, we take a preliminary step in this direction through merging, or averaging, subsets of Decision Transformers in weight space trained on different MuJoCo locomotion problems, forming multi-task models without centralized training. We also propose that when merging policies, we can obtain better results if all policies start from common, pre-trained initializations, while also co-training on shared auxiliary tasks during problem-specific finetuning. In general, we believe research in this direction can help democratize and distribute the process of which forms generally capable agents.
最近的研究表明,创造适用于语言、视觉和顺序决策问题通用性、Transformer-based的政策具有潜力。要创造这些模型,通常需要集中性的训练目标、数据以及计算。如果我们可以更加灵活地创造通用性政策,通过合并多个任务特定的个体训练政策。在这项工作中,我们迈出了在重量空间合并或平均决策 Transformers 训练不同 MuJoCo 运动问题的部分,形成无集中性训练的多功能模型的第一步。我们还建议,如果所有政策都从 common 预先训练的初始化开始,同时在整个训练过程中共同训练共享的辅助任务,可以在问题特定的微调中获得更好的结果。总的来说,我们相信这一方向的研究可以帮助民主化和分散这些构成一般能力的主体的过程。
https://arxiv.org/abs/2303.07551
This paper introduces the system submitted by dun_oscar team for the ICPR MSR Challenge. Three subsystems for task1-task3 are descripted respectively. In task1, we develop a visual system which includes a OCR model, a text tracker, and a NLP classifier for distinguishing subtitles and non-subtitles. In task2, we employ an ASR system which includes an AM with 18 layers and a 4-gram LM. Semi-supervised learning on unlabeled data is also vital. In task3, we employ the ASR system to improve the visual system, some false subtitles can be corrected by a fusion module.
本论文介绍了dun_oscar团队提交用于ICPR MSR挑战的系统。任务1-任务3分别描述了三个子系统。在任务1中,我们开发了一个视觉系统,包括一个OCR模型、一个文本跟踪器和一个NLP分类器,用于区分字幕和非字幕。在任务2中,我们使用了ASR系统,其中包含一个具有18层和一个四元组LM的AM。未标记数据的深度半监督学习也是至关重要的。在任务3中,我们使用了ASR系统来改进视觉系统,通过 fusion 模块可以纠正一些错误的字幕。
https://arxiv.org/abs/2303.06878
While strides have been made in deep learning based Bengali Optical Character Recognition (OCR) in the past decade, the absence of large Document Layout Analysis (DLA) datasets has hindered the application of OCR in document transcription, e.g., transcribing historical documents and newspapers. Moreover, rule-based DLA systems that are currently being employed in practice are not robust to domain variations and out-of-distribution layouts. To this end, we present the first multidomain large Bengali Document Layout Analysis Dataset: BaDLAD. This dataset contains 33,695 human annotated document samples from six domains - i) books and magazines, ii) public domain govt. documents, iii) liberation war documents, iv) newspapers, v) historical newspapers, and vi) property deeds, with 710K polygon annotations for four unit types: text-box, paragraph, image, and table. Through preliminary experiments benchmarking the performance of existing state-of-the-art deep learning architectures for English DLA, we demonstrate the efficacy of our dataset in training deep learning based Bengali document digitization models.
虽然深度学习基于孟加拉文Optical Character Recognition(OCR)技术在过去十年中取得了进展,但缺乏大型文档布局分析(DLA)数据集限制了OCR在文档识别中的应用,例如识别历史文档和报纸等。此外,目前在实践中使用的基于规则的DLA系统无法应对域 variations 和分布外的布局。为此,我们提出了第一个多域大型孟加拉文文档布局分析数据集:BaDLAD。该数据集包含来自六个领域的33,695份人类标注文档样本,包括文本框、段落、图像和表格的710,000个多边形的标注。通过初步实验基准当前最先进的英语DLA技术的性能,我们证明了我们的数据集在训练基于深度学习的孟加拉文文档数字化模型的有效性。
https://arxiv.org/abs/2303.05325
Pretraining a neural network on a large dataset is becoming a cornerstone in machine learning that is within the reach of only a few communities with large-resources. We aim at an ambitious goal of democratizing pretraining. Towards that goal, we train and release a single neural network that can predict high quality ImageNet parameters of other neural networks. By using predicted parameters for initialization we are able to boost training of diverse ImageNet models available in PyTorch. When transferred to other datasets, models initialized with predicted parameters also converge faster and reach competitive final performance.
在大型数据集上训练神经网络已经成为机器学习中一个关键的基石,只有少数拥有大量资源的社区能够触及。我们的目标是实现一个雄心勃勃的目标,即民主化预训练。为了实现这个目标,我们训练并发布了一个单张神经网络,它可以预测其他神经网络的质量 ImageNet 参数。通过使用预测参数初始化,我们能够提高在 PyTorch 中可用的多种 ImageNet 模型的训练效果。如果将模型转移到其他数据集上,初始化使用预测参数的模型也更快地收敛并达到竞争的终极表现。
https://arxiv.org/abs/2303.04143
Internet memes are characterised by the interspersing of text amongst visual elements. State-of-the-art multimodal meme classifiers do not account for the relative positions of these elements across the two modalities, despite the latent meaning associated with where text and visual elements are placed. Against two meme sentiment classification datasets, we systematically show performance gains from incorporating the spatial position of visual objects, faces, and text clusters extracted from memes. In addition, we also present facial embedding as an impactful enhancement to image representation in a multimodal meme classifier. Finally, we show that incorporating this spatial information allows our fully automated approaches to outperform their corresponding baselines that rely on additional human validation of OCR-extracted text.
互联网Meme的特点是将文本穿插于视觉元素之间。最先进的多模式Meme分类器并未考虑到这些元素在两种模式之间的相对位置,尽管它们在放置文本和视觉元素时的潜在含义相关。在面对两个Meme情感分类数据集时,我们系统地展示了从Meme中提取的视觉对象、人脸和文本簇的 spatial 位置对图像表示的影响。此外,我们还提出了面部嵌入作为多模式Meme分类器中图像表示的一种重要增强方法。最后,我们表明,包括这种空间信息使我们的完全自动化方法能够比依赖于人工验证OCR提取的文本的对应基线表现更好。
https://arxiv.org/abs/2303.01781
In this paper, we present StrucTexTv2, an effective document image pre-training framework, by performing masked visual-textual prediction. It consists of two self-supervised pre-training tasks: masked image modeling and masked language modeling, based on text region-level image masking. The proposed method randomly masks some image regions according to the bounding box coordinates of text words. The objectives of our pre-training tasks are reconstructing the pixels of masked image regions and the corresponding masked tokens simultaneously. Hence the pre-trained encoder can capture more textual semantics in comparison to the masked image modeling that usually predicts the masked image patches. Compared to the masked multi-modal modeling methods for document image understanding that rely on both the image and text modalities, StrucTexTv2 models image-only input and potentially deals with more application scenarios free from OCR pre-processing. Extensive experiments on mainstream benchmarks of document image understanding demonstrate the effectiveness of StrucTexTv2. It achieves competitive or even new state-of-the-art performance in various downstream tasks such as image classification, layout analysis, table structure recognition, document OCR, and information extraction under the end-to-end scenario.
在本文中,我们介绍了StrucTexTv2,一种有效的文档图像预训练框架,通过进行Masked Visual-Textual预测实现。该框架由两个自监督的预训练任务组成:基于文本区域级别的图像掩膜建模。 proposed method随机掩膜一些图像区域,根据文本单词的 bounding box坐标。我们预训练任务的目标是同时重建掩膜图像区域和相应的掩膜代币的像素。因此,预训练编码器可以比通常预测掩膜图像补丁更多的文本语义。与基于图像和文本modal建模的文档图像理解依赖于图像和文本的混合模型相比,StrucTexTv2只依赖于图像输入,并可能处理更多的应用场景,无需OCR预处理。在文档图像理解的主要基准任务的广泛实验证明了StrucTexTv2的有效性。它在各种下游任务,如图像分类、布局分析、表格结构识别、文档OCR和信息提取的端到端场景下实现 competitive 或甚至新的先进性能。
https://arxiv.org/abs/2303.00289
A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
语言、多感官感知和行为世界的大汇聚是实现人工智能通用智能的关键步骤。在本研究中,我们介绍了 Kosmos-1,一个多感官大型语言模型(MLLM),它能够在上下文中感知一般感官,学习(即少量样本学习),并遵循指令(即零样本学习)。具体来说,我们从零开始训练 Kosmos-1在大规模多感官数据集上,包括任意组合的文本和图像、图像标题对和文本数据。我们评估了各种设置,包括零样本、少量样本和多感官思维 chain-of- thought prompt,在广泛的任务中不需要任何梯度更新或微调。实验结果显示, Kosmos-1在(i)语言理解、生成和甚至无OCR的NLP(直接通过文档图像)方面表现出令人印象深刻的表现,(ii)感知语言任务,包括多感官对话、图像标题对、视觉问答和(iii)视觉任务,如图像识别配有描述(通过文本指令指定分类)。我们还表明,MLLMs可以从跨感官传输中受益,即从语言到多感官和从多感官到语言的知识传输。此外,我们介绍了 Raven智商测试数据集,该数据集用于诊断MLLMs的非语言推理能力。
https://arxiv.org/abs/2302.14045
Past work in natural language processing interpretability focused mainly on popular classification tasks while largely overlooking generation settings, partly due to a lack of dedicated tools. In this work, we introduce Inseq, a Python library to democratize access to interpretability analyses of sequence generation models. Inseq enables intuitive and optimized extraction of models' internal information and feature importance scores for popular decoder-only and encoder-decoder Transformers architectures. We showcase its potential by adopting it to highlight gender biases in machine translation models and locate factual knowledge inside GPT-2. Thanks to its extensible interface supporting cutting-edge techniques such as contrastive feature attribution, Inseq can drive future advances in explainable natural language generation, centralizing good practices and enabling fair and reproducible model evaluations.
过去在自然语言处理可解释性方面,主要关注流行分类任务,而忽视了生成设置,这部分是因为缺乏专门工具。在本研究中,我们介绍了Inseq,一个Python库,以使普通序列生成模型的可解释性分析更容易获得。Inseq能够直觉优化地提取模型的内部信息和特征重要性评分,适用于流行的解码-编码Transformer架构。我们使用Inseq来展示其潜力,以突出机器翻译模型中的性别偏见,并指出GPT-2中的实际知识。由于其扩展接口支持先进的技术,如对比特征归因,Inseq可以推动可解释自然语言生成的未来进展,集中化良好实践,并使公平且可重复的模型评估变得可能。
https://arxiv.org/abs/2302.13942
Despite significant advances in deep models for music generation, the use of these techniques remains restricted to expert users. Before being democratized among musicians, generative models must first provide expressive control over the generation, as this conditions the integration of deep generative models in creative workflows. In this paper, we tackle this issue by introducing a deep generative audio model providing expressive and continuous descriptor-based control, while remaining lightweight enough to be embedded in a hardware synthesizer. We enforce the controllability of real-time generation by explicitly removing salient musical features in the latent space using an adversarial confusion criterion. User-specified features are then reintroduced as additional conditioning information, allowing for continuous control of the generation, akin to a synthesizer knob. We assess the performance of our method on a wide variety of sounds including instrumental, percussive and speech recordings while providing both timbre and attributes transfer, allowing new ways of generating sounds.
尽管在音乐生成方面的深度学习模型取得了显著进展,但这些技术仍然仅适用于专家用户。在音乐家之间实现民主化之前,生成模型必须首先提供表达性的控制,这保证了深度生成模型在创意工作流程中的集成。在本文中,我们解决这个问题的方法是引入了一种深度生成音频模型,提供表达性和连续特征描述控制,同时仍然足够轻量级,可以嵌入硬件合成器。我们通过使用对抗性混淆准则明确删除了潜在空间中的显著音乐特征,从而强制了实时生成控制的可编程性。我们还将用户指定的特征作为额外的 conditioning information 引入,从而可以持续控制生成,类似于合成器调音台。我们评估了我们的算法在包括乐器、打击乐器和语音录音等多种声音中的表现,同时提供了音色和属性传输,从而提供了生成声音的新方式。
https://arxiv.org/abs/2302.13542
There has been recent interest in improving optical character recognition (OCR) for endangered languages, particularly because a large number of documents and books in these languages are not in machine-readable formats. The performance of OCR systems is typically evaluated using automatic metrics such as character and word error rates. While error rates are useful for the comparison of different models and systems, they do not measure whether and how the transcriptions produced from OCR tools are useful to downstream users. In this paper, we present a human-centric evaluation of OCR systems, focusing on the Kwak'wala language as a case study. With a user study, we show that utilizing OCR reduces the time spent in the manual transcription of culturally valuable documents -- a task that is often undertaken by endangered language community members and researchers -- by over 50%. Our results demonstrate the potential benefits that OCR tools can have on downstream language documentation and revitalization efforts.
https://arxiv.org/abs/2302.13410
How can a robot efficiently extract a desired object from a shelf when it is fully occluded by other objects? Prior works propose geometric approaches for this problem but do not consider object semantics. Shelves in pharmacies, restaurant kitchens, and grocery stores are often organized such that semantically similar objects are placed close to one another. Can large language models (LLMs) serve as semantic knowledge sources to accelerate robotic mechanical search in semantically arranged environments? With Semantic Spatial Search on Shelves (S^4), we use LLMs to generate affinity matrices, where entries correspond to semantic likelihood of physical proximity between objects. We derive semantic spatial distributions by synthesizing semantics with learned geometric constraints. S^4 incorporates Optical Character Recognition (OCR) and semantic refinement with predictions from ViLD, an open-vocabulary object detection model. Simulation experiments suggest that semantic spatial search reduces the search time relative to pure spatial search by an average of 24% across three domains: pharmacy, kitchen, and office shelves. A manually collected dataset of 100 semantic scenes suggests that OCR and semantic refinement improve object detection accuracy by 35%. Lastly, physical experiments in a pharmacy shelf suggest 47.1% improvement over pure spatial search. Supplementary material can be found at this https URL.
当机器人在货架上无法提取被其他物体完全遮挡的目标时,如何高效地提取它?先前的研究提出了几何方法来解决这一问题,但并未考虑对象语义。在药房、餐厅厨房和超市货架上,通常将语义相似的物体放置在相邻的位置。大型语言模型(LLM)可以充当语义知识源,以加速在语义排序环境中的机器人机械搜索?使用货架上的语义空间搜索(S^4),我们使用LLM生成亲和力矩阵,其中 entries对应于两个物体之间语义亲近的可能性。通过合成语义与学习到的几何约束的摘要,我们推导出了语义空间分布。S^4将光学字符识别(OCR)和语义优化与VIDD(一个开放词汇表对象检测模型的预测)集成在一起。模拟实验表明,语义空间搜索相对于纯空间搜索平均减少了搜索时间24%,包括三个领域:药房、厨房和办公室货架。手动收集的100个语义场景数据集表明,OCR和语义优化可以提高物体检测精度35%。最后,在药房货架上进行的物理实验表明,相对于纯空间搜索,提高了47.1%。补充材料可以在此处找到。
https://arxiv.org/abs/2302.12915
In recent years, large strides have been taken in developing machine learning methods for dermatological applications, supported in part by the success of deep learning (DL). To date, diagnosing diseases from images is one of the most explored applications of DL within dermatology. Convolutional neural networks (ConvNets) are the most common (DL) method in medical imaging due to their training efficiency and accuracy, although they are often described as black boxes because of their limited explainability. One popular way to obtain insight into a ConvNet's decision mechanism is gradient class activation maps (Grad-CAM). A quantitative evaluation of the Grad-CAM explainability has been recently made possible by the release of DermXDB, a skin disease diagnosis explainability dataset which enables explainability benchmarking of ConvNet architectures. In this paper, we perform a literature review to identify the most common ConvNet architectures used for this task, and compare their Grad-CAM explanations with the explanation maps provided by DermXDB. We identified 11 architectures: DenseNet121, EfficientNet-B0, InceptionV3, InceptionResNetV2, MobileNet, MobileNetV2, NASNetMobile, ResNet50, ResNet50V2, VGG16, and Xception. We pre-trained all architectures on an clinical skin disease dataset, and fine-tuned them on a DermXDB subset. Validation results on the DermXDB holdout subset show an explainability F1 score of between 0.35-0.46, with Xception displaying the highest explainability performance. NASNetMobile reports the highest characteristic-level explainability sensitivity, despite it's mediocre diagnosis performance. These results highlight the importance of choosing the right architecture for the desired application and target market, underline need for additional explainability datasets, and further confirm the need for explainability benchmarking that relies on quantitative analyses.
近年来,在开发针对皮肤应用机器学习方法方面取得了巨大的进展,部分原因是深度学习的成功。目前,从图像诊断疾病是深度学习在皮肤学领域最探索的应用之一。卷积神经网络(ConvNets)是医学成像中最常见的(深度学习)方法,由于其训练效率和精度,尽管它们往往被称为黑盒子,因为它们 limited 的解释性。一种流行的方法是为了了解 ConvNet 的决策机制,是梯度类激活图(Grad-CAM)的量化评估。最近,由 DermXDB 发布的皮肤疾病诊断解释性数据集使得 ConvNet 架构的解释性基准项得以实现。在本文中,我们进行文献综述,以确定用于该任务最常见的 ConvNet 架构,并比较它们的grad-CAM解释与 DermXDB 提供的解释图。我们确定了11种架构:DenseNet121、EfficientNet-B0、InceptionV3、InceptionResNetV2、MobileNet、MobileNetV2、NASNetMobile、ResNet50、ResNet50V2、VGG16 和 Xception。我们首先在临床皮肤疾病数据集上预训练所有架构,并在 DermXDB 子集上微调。在 DermXDB 保留子集上的验证结果显示,解释性 F1 得分在0.35-0.46之间,Xception 表现出最高的解释性能。NASNetMobile 报告最高的特征级解释敏感性,尽管其诊断性能平庸。这些结果强调选择适合所需应用和目标市场的正确的架构的重要性,强调需要更多的解释性数据集,并进一步确认基于量化分析的解释性基准的必要性。
https://arxiv.org/abs/2302.12084