Given the enormous number of instructional videos available online, learning a diverse array of multi-step task models from videos is an appealing goal. We introduce a new pre-trained video model, VideoTaskformer, focused on representing the semantics and structure of instructional videos. We pre-train VideoTaskformer using a simple and effective objective: predicting weakly supervised textual labels for steps that are randomly masked out from an instructional video (masked step modeling). Compared to prior work which learns step representations locally, our approach involves learning them globally, leveraging video of the entire surrounding task as context. From these learned representations, we can verify if an unseen video correctly executes a given task, as well as forecast which steps are likely to be taken after a given step. We introduce two new benchmarks for detecting mistakes in instructional videos, to verify if there is an anomalous step and if steps are executed in the right order. We also introduce a long-term forecasting benchmark, where the goal is to predict long-range future steps from a given step. Our method outperforms previous baselines on these tasks, and we believe the tasks will be a valuable way for the community to measure the quality of step representations. Additionally, we evaluate VideoTaskformer on 3 existing benchmarks -- procedural activity recognition, step classification, and step forecasting -- and demonstrate on each that our method outperforms existing baselines and achieves new state-of-the-art performance.
由于网络上存在大量教学视频,学习从视频中呈现的多步骤任务模型是一个令人着迷的目标。我们引入了一个新的预训练视频模型,VideoTaskformer,专注于代表教学视频语义和结构。我们使用一个简单的有效目标来预训练VideoTaskformer:预测从教学视频中随机掩盖的步骤的弱监督文本标签(掩码步建模)。与以前 Local 学习的步骤表示方法相比,我们的方法涉及全球学习,利用整个任务周围的视频作为上下文。从这些学习表示中,我们可以验证未观测视频是否正确执行给定任务,并预测哪些步骤可能在给定步骤后执行。我们引入了两个新的基准来检测教学视频中的错误,以验证是否存在异常步骤,以及步骤是否按照正确的顺序执行。我们还引入了一个长期预测基准,其目标是从给定步骤预测长期步骤。我们的方法在这些任务中表现出色,我们认为这些任务将成为一个有价值的方式,用于衡量步骤表示质量。此外,我们评估了VideoTaskformer,针对三个现有基准,即操作活动识别、步骤分类和步骤预测,并在每个基准上证明了我们的方法和以前基准的卓越表现,实现了新的技术水平。
https://arxiv.org/abs/2303.13519
The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL
建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估,从而促进特定领域的演化。然而,我们指出,由于存在多个限制,现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性,我们介绍了BEAR,这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合,分为五个类别(异常、手势、日常、运动和教学),涵盖了多种实际应用场景。通过使用BEAR,我们全面评估了6个常见的时间空间模型,并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明,目前的最新技术无法完全保证接近实际应用场景的数据集的高表现,我们期望BEAR可以作为公正且具有挑战性的评估基准,以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:
https://arxiv.org/abs/2303.13505
This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self-supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of images). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.0%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images.
这篇文章重写了计算机视觉中用于视觉识别任务的标准预训练-再微调范式。通常,最先进的基础模型是通过大规模(较弱)监督数据集训练的,包含数百万图像。我们引入了一个简单的预训练阶段,并使用自监督MAE技术初始化模型。虽然MAE只表现出与模型大小的关系,但我们发现它与训练数据集大小也有关系。因此,我们的MAE基于预训练方法适用于训练基础模型。预训练 consistently 改善模型收敛和下游转移性能,涵盖了模型大小(数百万到数十亿参数)和数据大小(数百万到数十亿图像)。我们测试了10个不同的视觉识别任务,包括图像分类、视频识别、对象检测、低尺度分类和零尺度识别。我们最大的模型在iNaturalist-18上取得了新的最先进的结果(91.3%),在1-视角的ImageNet-1k上取得了62.1%的准确率,并在Food-101上实现了零视角转移(96.0%)。我们的研究表明,模型初始化在包含数十亿图像的大规模预训练任务中发挥着重要作用。
https://arxiv.org/abs/2303.13496
CLIP has enabled new and exciting joint vision-language applications, one of which is open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. Project page: this https URL.
CLIP已经实现了新的、令人兴奋的视觉语言联合应用,其中之一就是开放词汇分割,它可以在任何给定的文本查询中定位任意Segment。在我们的研究中,我们提出了一个新的问题:是否可以在没有用户指导的情况下,以文本查询或预先定义的类别形式发现语义Segment,并使用自然语言自动标签它们?我们提出了一个崭新的问题零指导分割和第一个基准,它利用DINO和CLIP两个预训练通用模型来解决这个问题,而不需要 Fine-tuning 或分割数据集。我们的主要想法是首先将图像分割成较小的过分割块,将它们编码到 CLIP 的视觉语言空间中,将其转换为文本标签,并将语义相似的Segment 合并在一起。然而,关键挑战是如何将一个视觉Segment 编码为特定的嵌入,以平衡全局和局部上下文信息,这对识别都有益。我们的主要贡献是一个新的注意遮蔽技术,通过分析 CLIP 内部的注意力层,平衡了两个上下文信息。我们还介绍了几个指标来评价这个新任务。利用 CLIP 固有的知识,我们的方法可以精确地在博物馆人群中定位蒙娜丽莎画作。项目页面:这个 https URL。
https://arxiv.org/abs/2303.13396
Quality assessment algorithms can be used to estimate the utility of a biometric sample for the purpose of biometric recognition. "Error versus Discard Characteristic" (EDC) plots, and "partial Area Under Curve" (pAUC) values of curves therein, are generally used by researchers to evaluate the predictive performance of such quality assessment algorithms. An EDC curve depends on an error type such as the "False Non Match Rate" (FNMR), a quality assessment algorithm, a biometric recognition system, a set of comparisons each corresponding to a biometric sample pair, and a comparison score threshold corresponding to a starting error. To compute an EDC curve, comparisons are progressively discarded based on the associated samples' lowest quality scores, and the error is computed for the remaining comparisons. Additionally, a discard fraction limit or range must be selected to compute pAUC values, which can then be used to quantitatively rank quality assessment algorithms. This paper discusses and analyses various details for this kind of quality assessment algorithm evaluation, including general EDC properties, interpretability improvements for pAUC values based on a hard lower error limit and a soft upper error limit, the use of relative instead of discrete rankings, stepwise vs. linear curve interpolation, and normalisation of quality scores to a [0, 100] integer range. We also analyse the stability of quantitative quality assessment algorithm rankings based on pAUC values across varying pAUC discard fraction limits and starting errors, concluding that higher pAUC discard fraction limits should be preferred. The analyses are conducted both with synthetic data and with real data for a face image quality assessment scenario, with a focus on general modality-independent conclusions for EDC evaluations.
质量评估算法可以用来估计一个生物特征样本用于生物特征识别的有用性。 "错误-排除特征" (EDC) 绘图和曲线中的 "部分平均面积" (pAUC) 值通常被研究人员用于评估这些质量评估算法的预测性能。 EDC 曲线取决于错误类型,例如 "False 不匹配率" (FNMR), 质量评估算法,生物特征识别系统,一组对应于生物特征样本对的比较,以及一个比较分数阈值对应于起始错误。为了计算 EDC 曲线,比较是按相关样本的最低质量分数逐步排除的,而错误是计算剩余的比较。此外,必须选择计算 pAUC 值的范围来计算 pAUC 值,然后用于定量评估质量评估算法。 本 paper 讨论和分析了评估这种质量评估算法评估的各种细节,包括 general EDC 特性,基于 hard 的低错误限制和 soft 的高等错误限制的 pAUC 值的解释性改进,使用相对排名而不是离散排名, stepwise 和线性曲线插值,以及质量分数的归一化到 [0, 100] 整数范围内。我们还分析基于 pAUC 值在不同 pAUC 丢弃 fraction 限制和起始错误的稳定性,得出结论较高 pAUC 丢弃 fraction 限制应该优先考虑。 分析使用了合成数据和真实数据的一个面部图像质量评估场景,专注于 EDC 评估的一般模式无关结论。
https://arxiv.org/abs/2303.13294
As one of the major branches of automatic speech recognition, attention-based models greatly improves the feature representation ability of the model. In particular, the multi-head mechanism is employed in the attention, hoping to learn speech features of more aspects in different attention subspaces. For speech recognition of complex languages, on the one hand, a small head size will lead to an obvious shortage of learnable aspects. On the other hand, we need to reduce the dimension of each subspace to keep the size of the overall feature space unchanged when we increase the number of heads, which will significantly weaken the ability to represent the feature of each subspace. Therefore, this paper explores how to use a small attention subspace to represent complete speech features while ensuring many heads. In this work we propose a novel neural network architecture, namely, pyramid multi-branch fusion DCNN with multi-head self-attention. The proposed architecture is inspired by Dilated Convolution Neural Networks (DCNN), it uses multiple branches with DCNN to extract the feature of the input speech under different receptive fields. To reduce the number of parameters, every two branches are merged until all the branches are merged into one. Thus, its shape is like a pyramid rotated 90 degrees. We demonstrate that on Aishell-1, a widely used Mandarin speech dataset, our model achieves a character error rate (CER) of 6.45% on the test sets.
作为自动语音识别的主要分支之一,注意力模型极大地提高了模型的特征表示能力。特别是,采用了多眼机制,希望在不同注意力 subspace 中学习更多的语音特征方面。对于复杂语言的语音识别,一方面,较小的头部大小会导致可学习方面的数量明显不足。另一方面,我们需要在每个 subspace 中减少维度,以保持整个特征空间的size不变,当头部数量增加时,这将会极大地减弱表示每个 subspace 特征的能力。因此,本文探讨了如何使用一个小的注意力 subspace 来代表完整的语音特征,同时确保许多头部。在这个工作中,我们提出了一种新型的神经网络架构,即金字塔多分支融合 DCNN 和多眼自注意力。该架构受到缩小卷积神经网络(DCNN)的启发,它使用多个 DCNN 分支从不同的接收域中提取输入语音的特征。为了减少参数数量,每个分支都合并直到所有分支都合并成一条。因此,它的形状就像金字塔旋转90度。我们证明了在广泛使用的 Mandarin 语音数据集 Aishell-1 中,我们的模型在测试集上实现了字符错误率(CER)为6.45%。
https://arxiv.org/abs/2303.13243
Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, \ie, CLIP~\cite{radford2021clip}, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at this https URL.
不完整标签的多标签识别(MLR)是非常具有挑战性的。最近的工作致力于探索图像到标签映射在视觉语言模型中的可能性,例如\cite{radford2021clip},以弥补缺乏标注的不足。尽管表现令人鼓舞,但他们通常忽略了关于标签到标签映射的宝贵先验。在本文中,我们倡导补救不完整标签的标签监督不足,通过从语义先验prompter中推导出结构化的语义先验来建立label-to-label映射的结构化先验。然后,我们提出了一个 novel Semantic Correspondence Prompt Network (SCPNet),它能够全面探索结构化语义先验。此外,我们还介绍了一种增强的自监督学习方法,以增强使用先验。对多个广泛使用基准数据集的全面实验和分析表明,我们的方法和所有方法在所有数据集上显著优于现有方法,充分证明了我们方法的有效性和优越性。我们的代码将在这个 https URL上可用。
https://arxiv.org/abs/2303.13223
Few-shot object detection (FSOD) aims to expand an object detector for novel categories given only a few instances for training. The few training samples restrict the performance of FSOD model. Recent text-to-image generation models have shown promising results in generating high-quality images. How applicable these synthetic images are for FSOD tasks remains under-explored. This work extensively studies how synthetic images generated from state-of-the-art text-to-image generators benefit FSOD tasks. We focus on two perspectives: (1) How to use synthetic data for FSOD? (2) How to find representative samples from the large-scale synthetic dataset? We design a copy-paste-based pipeline for using synthetic data. Specifically, saliency object detection is applied to the original generated image, and the minimum enclosing box is used for cropping the main object based on the saliency map. After that, the cropped object is randomly pasted on the image, which comes from the base dataset. We also study the influence of the input text of text-to-image generator and the number of synthetic images used. To construct a representative synthetic training dataset, we maximize the diversity of the selected images via a sample-based and cluster-based method. However, the severe problem of high false positives (FP) ratio of novel categories in FSOD can not be solved by using synthetic data. We propose integrating CLIP, a zero-shot recognition model, into the FSOD pipeline, which can filter 90% of FP by defining a threshold for the similarity score between the detected object and the text of the predicted category. Extensive experiments on PASCAL VOC and MS COCO validate the effectiveness of our method, in which performance gain is up to 21.9% compared to the few-shot baseline.
有限对象检测(FSOD)旨在扩展对新分类类别的对象检测器,仅提供少数实例进行训练。这些训练样本限制了FSOD模型的性能。最近,生成式文本到图像生成模型在生成高质量的图像方面表现出良好的结果。这些合成图像对于FSOD任务的应用仍然未被充分探索。本文深入研究了如何从先进的文本到图像生成模型中生成合成图像,以改善FSOD任务。我们关注两个方面:(1)如何以复制粘贴的方式使用合成数据进行FSOD任务?(2)如何从大型合成数据集中查找代表性样本?我们设计了一个基于复制粘贴的 pipeline 用于使用合成数据。具体而言,我们通过使用原始生成图像中的关注对象进行对象检测,并使用最小包围盒根据关注映射进行裁剪的主要对象。之后,裁剪对象随机粘贴到来自基础数据集的图像上。我们还研究了输入文本文本生成器和使用合成图像的数量对所选图像的影响。为了构建一个代表性的合成训练集,我们通过样本方法和簇方法最大地扩展了选择的样本的多样性。但在FSOD中,新分类类别的高误报率(FP)比例的严重问题不能用合成数据来解决。我们提出了将 CLIP 一种零次识别模型集成到FSOD管道中的方法,该方法可以通过定义相似性得分之间的检测到对象和预测类别文本的阈值过滤90%的FP。在PASCAL VOC 和 MS COCO 等数据集上的广泛实验验证了我们方法的有效性,其性能提升高达21.9%。与有限对象检测基准线相比。
https://arxiv.org/abs/2303.13221
Aiming to link natural language descriptions to specific regions in a 3D scene represented as 3D point clouds, 3D visual grounding is a very fundamental task for human-robot interaction. The recognition errors can significantly impact the overall accuracy and then degrade the operation of AI systems. Despite their effectiveness, existing methods suffer from the difficulty of low recognition accuracy in cases of multiple adjacent objects with similar this http URL address this issue, this work intuitively introduces the human-robot interaction as a cue to facilitate the development of 3D visual grounding. Specifically, a new task termed Embodied Reference Understanding (ERU) is first designed for this concern. Then a new dataset called ScanERU is constructed to evaluate the effectiveness of this idea. Different from existing datasets, our ScanERU is the first to cover semi-synthetic scene integration with textual, real-world visual, and synthetic gestural information. Additionally, this paper formulates a heuristic framework based on attention mechanisms and human body movements to enlighten the research of ERU. Experimental results demonstrate the superiority of the proposed method, especially in the recognition of multiple identical objects. Our codes and dataset are ready to be available publicly.
旨在将自然语言描述连接到以3D点云表示的3D场景中的特定区域,3D视觉接地是人类-机器人交互中的一个极为基本的任务。识别错误可能会显著影响整体准确性,进而降低AI系统的运行水平。尽管其有效性,现有的方法却面临着在多个相邻对象中具有类似http URL地址的多个类似对象的情况下低识别精度的困难。这项工作直觉地引入了人类-机器人交互作为线索,以促进3D视觉接地的发展。具体而言,名为Embodied Reference Understanding(ERU)的新任务首先被设计为解决这个问题。然后,名为ScanERU的新数据集被构建用于评估这个想法的 effectiveness。与现有的数据集不同,我们的ScanERU是第一个涵盖半合成场景与文本、现实世界视觉和合成手势信息整合的数据集。此外,本文基于注意力机制和身体运动提出了一个启发性框架,以阐明ERU的研究。实验结果显示,该提议方法的优势,特别是在识别多个相同的对象方面。我们的代码和数据集已准备公开发布。
https://arxiv.org/abs/2303.13186
This work is unique in the use of discrete wavelets that were built from or derived from Chebyshev polynomials of the second and third kind, filter the Discrete Second Chebyshev Wavelets Transform (DSCWT), and derive two effective filters. The Filter Discrete Third Chebyshev Wavelets Transform (FDTCWT) is used in the process of analyzing color images and removing noise and impurities that accompany the image, as well as because of the large amount of data that makes up the image as it is taken. These data are massive, making it difficult to deal with each other during transmission. However to address this issue, the image compression technique is used, with the image not losing information due to the readings that were obtained, and the results were satisfactory. Mean Square Error (MSE), Peak Signal Noise Ratio (PSNR), Bit Per Pixel (BPP), and Compression Ratio (CR) Coronavirus is the initial treatment, while the processing stage is done with network training for Convolutional Neural Networks (CNN) with Discrete Second Chebeshev Wavelets Convolutional Neural Network (DSCWCNN) and Discrete Third Chebeshev Wavelets Convolutional Neural Network (DTCWCNN) to create an efficient algorithm for face recognition, and the best results were achieved in accuracy and in the least amount of time. Two samples of color images that were made or implemented were used. The proposed theory was obtained with fast and good results; the results are evident shown in the tables below.
这项工作的独特之处在于使用了从或基于Chebyshev多项式第二类和第三类的离散小波,过滤Discrete Second Chebyshev Wavelets Transform(DSCWT),并推导出两个有效的过滤器。滤波Discrete Third Chebyshev Wavelets Transform(FDTCWT)用于分析彩色图像,并去除伴随图像的噪声和杂质,以及由于在拍摄时构成图像的大量数据。这些数据是巨大的,因此在传输期间很难相互处理。然而,为了解决这一问题,使用图像压缩技术,由于图像没有因获得的阅读而丢失信息,结果令人满意。Mean Square Error(MSE)、Peak Signal-to-Noise Ratio(PSNR)、每个像素的比特数(BPP)和压缩比(CR)是新冠病毒的初始治疗,而处理阶段使用网络训练Convolutional Neural Networks(CNN)与Discrete Second Chebyshev Wavelets Convolutional Neural Network(DSCWCNN)和Discrete Third Chebyshev Wavelets Convolutional Neural Network(DTCWCNN)创建高效的人脸识别算法,并且最好的结果是在精度和最少的时间内实现。使用了制作或实现的彩色图像的两个样本。提出的理论以快速和良好的结果得出,结果在以下表格中显而易见。
https://arxiv.org/abs/2303.13158
Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA$^+$ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA$^+$ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark.
Open-vocabulary detection (OVD) 是一种目标检测任务,旨在检测对象来自新分类类别超越了检测器训练的基础类别。最近的 OVD 方法依赖于大型视觉语言预训练模型,如 CLIP,以识别新对象。我们识别了两个核心障碍,当将这些模型融入检测器训练时需要解决:(1) 应用整个图像训练的 VL-模型用于区域识别任务时的分布不匹配;(2) 难以定位未观测过类的对象。为了克服这些障碍,我们提出了 CORA,一种 DETR 风格框架,通过区域提示和Anchor 前匹配来适应 CLIP 以进行 Open-vocabulary 检测。区域提示缓解了整个到区域的分布差距,通过提示 CLIP 基于区域分类器的区域特征。Anchor 前匹配帮助学习基于类元匹配机制的可通用对象定位。我们在 COCO OVD 基准上评估了 CORA,在 novel 类上取得了 41.7 AP50,比先前的 SOTA 提高了 2.4 AP50 甚至不需要额外的训练数据。当有额外的训练数据时,我们训练了 CORA$^+$,在真实的基类注释和由 CORA 计算的额外的伪边界框标签上训练。 CORA$^+$ 在 COCO OVD 基准上取得了 43.1 AP50,在 LVIS OVD 基准上取得了 28.1 框 APr。
https://arxiv.org/abs/2303.13076
Transformer-based models have recently made significant achievements in the application of end-to-end (E2E) automatic speech recognition (ASR). It is possible to deploy the E2E ASR system on smart devices with the help of Transformer-based models. While these models still have the disadvantage of requiring a large number of model parameters. To overcome the drawback of universal Transformer models for the application of ASR on edge devices, we propose a solution that can reuse the block in Transformer models for the occasion of the small footprint ASR system, which meets the objective of accommodating resource limitations without compromising recognition accuracy. Specifically, we design a novel block-reusing strategy for speech Transformer (BRST) to enhance the effectiveness of parameters and propose an adapter module (ADM) that can produce a compact and adaptable model with only a few additional trainable parameters accompanying each reusing block. We conducted an experiment with the proposed method on the public AISHELL-1 corpus, and the results show that the proposed approach achieves the character error rate (CER) of 9.3%/6.63% with only 7.6M/8.3M parameters without and with the ADM, respectively. In addition, we also make a deeper analysis to show the effect of ADM in the general block-reusing method.
Transformer-based models 最近在端到端(E2E)自动语音识别(ASR)的应用方面取得了重要成就。借助Transformer-based模型,可以在智能设备上部署E2E ASR系统。尽管这些模型仍然具有需要大量模型参数的缺点,但我们希望克服通用Transformer模型在边缘设备上ASR应用的缺点,并提出一种解决方案,可以在Transformer模型中重用块以实现小 footprint ASR系统,满足适应资源限制并不影响识别精度的目标。具体来说,我们设计了一种Speech Transformer(BRST)的块重用策略,以提高参数的有效性,并提出了适应模块(ADM),该模块可以产生紧凑且可适应的模型,每个重用块仅有几个训练参数相随。我们在公共AIShell-1语料库上进行了实验,结果表明,没有ADM的情况下,该方法实现了字符错误率(CER)9.3%/6.63%,而有了ADM的情况下,仅使用7.6M/8.3M参数分别实现了9.3%/6.63%。此外,我们还进行了深入分析,以显示通用块重用方法中的ADM效应。
https://arxiv.org/abs/2303.13072
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs, long inference times, and strong requirements for the data set and accessibility of the face recognition model. Through an analysis of the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively. Our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process and does not suffer from any of the common shortcomings from competing methods.
人脸识别模型将面部图像嵌入低维身份向量中,其中包含身份特定的面部特征抽象编码,以便能够区分个体。我们解决了没有全模型访问(即黑盒设置)的挑战,即反转训练后的面部识别模型的潜在空间。在文献中提出了多种方法来完成这项工作,但它们都有严重的缺陷,例如缺乏实际输出、推理时间长、以及对数据集和面部识别模型访问的强烈要求。通过对黑盒反转问题的分析,我们表明条件扩散模型损失自然地出现,并且即使没有身份特定的损失,我们仍然可以从逆分布中有效地采样。我们的方法名为身份去噪扩散概率模型(ID3PM),利用去噪扩散过程的随机性质,以产生各种背景、照明、姿势和面部表情的高质量、身份保留的面部图像。我们证明了身份保留和多样性的高水平、高定量表现。我们的方法是第一个黑盒面部识别模型反转方法,提供了直观的控制生成过程,并且没有与竞争方法的普遍缺点的任何影响。
https://arxiv.org/abs/2303.13006
The ever-increasing demands for intuitive interactions in Virtual Reality has triggered a boom in the realm of Facial Expression Recognition (FER). To address the limitations in existing approaches (e.g., narrow receptive fields and homogenous supervisory signals) and further cement the capacity of FER tools, a novel multifarious supervision-steering Transformer for FER in the wild is proposed in this paper. Referred as FER-former, our approach features multi-granularity embedding integration, hybrid self-attention scheme, and heterogeneous domain-steering supervision. In specific, to dig deep into the merits of the combination of features provided by prevailing CNNs and Transformers, a hybrid stem is designed to cascade two types of learning paradigms simultaneously. Wherein, a FER-specific transformer mechanism is devised to characterize conventional hard one-hot label-focusing and CLIP-based text-oriented tokens in parallel for final classification. To ease the issue of annotation ambiguity, a heterogeneous domains-steering supervision module is proposed to make image features also have text-space semantic correlations by supervising the similarity between image features and text features. On top of the collaboration of multifarious token heads, diverse global receptive fields with multi-modal semantic cues are captured, thereby delivering superb learning capability. Extensive experiments on popular benchmarks demonstrate the superiority of the proposed FER-former over the existing state-of-the-arts.
日益增加对虚拟现实中直觉交互的需求引发了 facial expression recognition (FER) 领域的繁荣。为了解决现有方法(例如狭隘接收域和同质监督信号)的局限性,并进一步巩固 FER 工具的能力,本论文提出了一种独特的多粒度监督驱动Transformer,称为 FER former。该方法采用多粒度嵌入集成、混合自注意力机制和异质领域监督。具体来说,为了深入探究普遍存在的CNN和Transformer特征组合的优点,一种混合基线被设计用于同时递归两种学习范式。其中,一个特别的Transformer机制旨在描述传统的硬一hot标签聚焦和CLIP-based文本方向 tokens 的最终分类。为了缓解标注混淆的问题,我们提出了一种异质领域监督模块,通过监督图像特征和文本特征之间的相似性,使图像特征也具有文本空间语义相关性。此外,除了多种 token 头的合作,具有多种 global 接收域和各种多模态语义线索,从而提供了卓越的学习能力。在流行的基准测试数据上进行广泛的实验证明了所述 FER former 相对于现有技术水平的优越性。
https://arxiv.org/abs/2303.12997
Facial expression is a way of communication that can be used to interact with computers or other electronic devices and the recognition of emotion from faces is an emerging practice with application in many fields. There are many cloud-based vision application programming interfaces available that recognize emotion from facial images and video. In this article, the performances of two well-known APIs were compared using a public dataset of 980 images of facial emotions. For these experiments, a client program was developed which iterates over the image set, calls the cloud services, and caches the results of the emotion detection for each image. The performance was evaluated in each class of emotions using prediction accuracy. It has been found that the prediction accuracy for each emotion varies according to the cloud service being used. Similarly, each service provider presents a strong variation of performance according to the class being analyzed, as can be seen with more detail in this artilects.
面部表情是一种可以用来与电脑或其他电子设备交互的通信方式,而从面部识别情感是一种新兴的行为,在许多领域都有应用。有许多基于云计算的视觉应用程序编程接口可用,可以从面部图像和视频中提取情感。在本文中,使用了一个公开的面部情感图像数据集,对两个知名的API的性能进行了比较。为了进行这些实验,开发了一款客户端程序,该程序对图像集进行迭代,调用云计算服务,并缓存每个图像的情感检测结果。在每个情感类别中,使用预测精度进行评估。发现每个情感的预测精度根据使用的云计算服务而异。类似地,每个服务提供商都展示了根据被分析的情感类别的强烈性能变化,这可以在本文中以更详细的方式看到。
https://arxiv.org/abs/2303.12974
Surgical scene understanding is a key prerequisite for contextaware decision support in the operating room. While deep learning-based approaches have already reached or even surpassed human performance in various fields, the task of surgical action recognition remains a major challenge. With this contribution, we are the first to investigate the concept of self-distillation as a means of addressing class imbalance and potential label ambiguity in surgical video analysis. Our proposed method is a heterogeneous ensemble of three models that use Swin Transfomers as backbone and the concepts of self-distillation and multi-task learning as core design choices. According to ablation studies performed with the CholecT45 challenge data via cross-validation, the biggest performance boost is achieved by the usage of soft labels obtained by self-distillation. External validation of our method on an independent test set was achieved by providing a Docker container of our inference model to the challenge organizers. According to their analysis, our method outperforms all other solutions submitted to the latest challenge in the field. Our approach thus shows the potential of self-distillation for becoming an important tool in medical image analysis applications.
surgical scene understanding是意识流决策支持在手术房中的关键前提。尽管基于深度学习的方法已经在各种领域中达到了或甚至超过了人类的表现,但识别手术动作仍然是一个 major 的挑战。通过这项工作,我们是第一位研究自我蒸馏概念的,将其作为解决手术视频分析中类别不平衡和潜在标签歧义的手段。我们提出的方法是由三个模型组成的异质组合,其中使用 Swin 流体层作为主干,自我蒸馏和多任务学习作为核心设计选择。根据对 CholecT45 挑战数据进行交叉验证的研究,最大的性能提升是通过使用自我蒸馏的软标签实现的。通过向挑战组织者提供我们的推理模型的 Docker 容器,实现了对独立测试集的外部验证。根据他们的分析,我们的方法在该领域的最新挑战中表现优于所有其他解决方案。我们的方法因此展示了自我蒸馏在医学图像分析应用中成为重要工具的潜力。
https://arxiv.org/abs/2303.12915
We show that training a multi-headed self-attention-based deep network to predict deleted, information-dense 2-8 Hz speech modulations over a 1.5-second section of a speech utterance is an effective way to make machines learn to extract speech modulations using time-domain contextual information. Our work exhibits that, once trained on large volumes of unlabelled data, the outputs of the self-attention layers vary in time with a modulation peak at 4 Hz. These pre-trained layers can be used to initialize parts of an Automatic Speech Recognition system to reduce its reliance on labeled speech data greatly.
我们表明,训练一个多头自注意力为基础的深度网络,以预测在语音发言中删除的、信息密度高的2-8Hz语音调制,在1.5秒的 section 内,是一种有效的方法,使机器学习使用时间域上下文信息从语音调制中提取信息。我们的工作表明,一旦训练了大规模的未标记数据,自注意力层的输出时间上的分布与一个调制峰值为4Hz的调制峰变化。这些预训练层可以用来初始化自动语音识别系统的部分,以减少它对标记语音数据的依赖性。
https://arxiv.org/abs/2303.12908
Myocardial infarction and heart failure are major cardiovascular diseases that affect millions of people in the US. The morbidity and mortality are highest among patients who develop cardiogenic shock. Early recognition of cardiogenic shock is critical. Prompt implementation of treatment measures can prevent the deleterious spiral of ischemia, low blood pressure, and reduced cardiac output due to cardiogenic shock. However, early identification of cardiogenic shock has been challenging due to human providers' inability to process the enormous amount of data in the cardiac intensive care unit (ICU) and lack of an effective risk stratification tool. We developed a deep learning-based risk stratification tool, called CShock, for patients admitted into the cardiac ICU with acute decompensated heart failure and/or myocardial infarction to predict onset of cardiogenic shock. To develop and validate CShock, we annotated cardiac ICU datasets with physician adjudicated outcomes. CShock achieved an area under the receiver operator characteristic curve (AUROC) of 0.820, which substantially outperformed CardShock (AUROC 0.519), a well-established risk score for cardiogenic shock prognosis. CShock was externally validated in an independent patient cohort and achieved an AUROC of 0.800, demonstrating its generalizability in other cardiac ICUs.
心脏病和 heart failure 是影响美国数百万人的 major 心血管疾病。患者发展心脏病性休克的严重程度最高。及早识别心脏病性休克是至关重要的。及时采取治疗措施可以防止缺血、低血压和心脏病性休克引起的心脏输出减少的有益螺旋。然而,早期识别心脏病性休克由于人类医护人员无法处理心脏重症监护室(ICU)中巨大的数据量和缺乏有效的风险分层工具而具有挑战性。我们开发了基于深度学习的风险分层工具 CShock,用于 admission to the heart lung center with acute decompensated heart failure and/or myocardial infarction 的心脏病性休克预测。为了开发并验证 CShock,我们给心脏重症监护室的数据集加上医生判断的结果。CShock 在Receiver operator characteristic 曲线上的 area under the curve (AUROC) 达到了 0.820,大大超过了 CardShock (AUROC 0.519), CardShock 是心脏病性休克预测的一个确立的风险评分。CShock 在独立的患者群体上进行外部验证,并达到了 AUROC 0.800,这表明它可以在其他心脏重症监护中通用。
https://arxiv.org/abs/2303.12888
This work focuses on sign language retrieval-a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue-sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: this https URL.
本研究专注于 Sign Language 检索--一项最近提出的理解 sign language 的任务。 Sign Language 检索包括两个子任务:文本到Sign视频(T2V)检索和Sign视频到文本(V2T)检索。与传统的 video-text 检索不同,Sign 视频不仅包含视觉信号,而且本身携带丰富的语义含义,因为 sign 语言也是自然语言。考虑到这一特点,我们将 Sign Language 检索界定为跨语言检索问题和视频-text 检索任务。具体来说,我们考虑了 Sign 语言和自然语言的语言学特征,同时同时识别 fine-grained 跨语言映射(即 sign-to-word 映射),而在 joint embedding 空间中,同时比较文本和 Sign 视频。这一过程被称为跨语言对比学习。此外,数据稀缺问题也带来了挑战--Sign 语言数据集的规模比语音识别数据集小得多。我们通过伪标签方式将具有广泛 Sign 语言训练数据的Sign 编码器应用于目标域。我们的框架,称为跨语言对比学习的 Sign 语言检索(CiCo),在多个数据集上比先驱方法表现更好,例如,How2Sign 数据集上的 T2V 检索改进了 22.4 倍,V2T 检索改进了 28.0 倍,而 PHOENIX-2014T 数据集上的 V2T 检索改进了 13.7 倍和 17.1 倍。代码和模型可在 this https URL 中找到。
https://arxiv.org/abs/2303.12793
LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at this https URL.
利用激光雷达点云识别3D点云的方法可以造福多种应用程序。如果没有特别考虑激光雷达点云分布,大多数当前方法都面临信息断开和接收域有限的问题,特别是对于稀疏遥远的点。在这项工作中,我们研究了激光雷达点云的 varying-sparss分布,并提出了Sphere Former直接聚合从密集接近点到稀疏遥远的信息。我们设计了径向窗口自注意力,将空间划分为多个非重叠的窄长窗口。它克服了信息断开的问题,并且极大地扩展了接收域,这极大地提高了稀疏遥远的点的性能。此外,为了适应窄长窗口,我们提出了指数分割,生成精细的位置编码和动态特征选择,以提高模型表示能力。值得注意的是,我们的方法在nuScenes和SemanticKITTI语义分割基准测试中分别获得了81.9%和74.8%的mIoU,同时,在nuScenes物体检测基准测试中获得了第3名,72.8%的NDS和68.5%的mAP。代码在此httpsURL上可用。
https://arxiv.org/abs/2303.12766