The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: this https URL
建立基准(一组数据集)的目标是提供一个统一的标准协议来进行公正评估,从而促进特定领域的演化。然而,我们指出,由于存在多个限制,现有的行动识别协议可能会得出部分评估结果。为了全面测试时间空间表示学习的有效性,我们介绍了BEAR,这是一个视频行动识别的新基准。BEAR是一个由18个视频数据集组成的集合,分为五个类别(异常、手势、日常、运动和教学),涵盖了多种实际应用场景。通过使用BEAR,我们全面评估了6个常见的时间空间模型,并通过标准微调、少量微调和无监督跨域适应等方式进行了迁移性能的测试。我们的观察表明,目前的最新技术无法完全保证接近实际应用场景的数据集的高表现,我们期望BEAR可以作为公正且具有挑战性的评估基准,以获得关于构建新一代时间空间学习器的见解。我们的数据集、代码和模型已发布在以下httpsURL:
https://arxiv.org/abs/2303.13505
Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at this https URL.
学习没有标签的高密度视觉表示是一项困难的任务,尤其是对于场景为中心的数据。我们提议解决这一挑战性的问题,并提出了一种交叉视图一致性目标,并使用在线簇集机制(CrOC)来发现和分割视图的语义。在没有手动构建的先验的情况下, resulting 方法更加通用,不需要繁琐的预处理步骤。更重要的是,簇集算法同时作用于两个视图的特征,从而巧妙地绕过了内容在两个视图中未被表示的问题,以及从一个作物到另一个作物的模糊匹配问题。我们证明了在多种数据集和视频对象分割中表现出优异的线性和 unsupervised 分割转移任务的性能。我们的代码和预训练模型在此 https URL 上公开可用。
https://arxiv.org/abs/2303.13245
Renal transplantation emerges as the most effective solution for end-stage renal disease. Occurring from complex causes, a substantial risk of transplant chronic dysfunction persists and may lead to graft loss. Medical imaging plays a substantial role in renal transplant monitoring in clinical practice. However, graft supervision is multi-disciplinary, notably joining nephrology, urology, and radiology, while identifying robust biomarkers from such high-dimensional and complex data for prognosis is challenging. In this work, taking inspiration from the recent success of Large Language Models (LLMs), we propose MEDIMP -- Medical Images and Prompts -- a model to learn meaningful multi-modal representations of renal transplant Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE MRI) by incorporating structural clinicobiological data after translating them into text prompts. MEDIMP is based on contrastive learning from joint text-image paired embeddings to perform this challenging task. Moreover, we propose a framework that generates medical prompts using automatic textual data augmentations from LLMs. Our goal is to learn meaningful manifolds of renal transplant DCE MRI, interesting for the prognosis of the transplant or patient status (2, 3, and 4 years after the transplant), fully exploiting the available multi-modal data in the most efficient way. Extensive experiments and comparisons with other renal transplant representation learning methods with limited data prove the effectiveness of MEDIMP in a relevant clinical setting, giving new directions toward medical prompts. Our code is available at this https URL.
移植成为处理终末型糖尿病最有效的方法。由于来自复杂的原因,移植的 chronic dysfunction 仍然存在并可能导致移植失败。医学成像在临床实践中对于移植监测非常重要。但是,移植监督是多学科的,特别是与神经学、肾脏学和影像学联合会管。然而,从这种高维度和复杂的数据中识别可靠的生物标志物,对于预测病情预后来说是一项挑战性的任务。在这项工作中,借鉴了大型语言模型(LLM)近期的成功,我们提出了 medIMP - 医疗图像和提示 - 一种模型,通过将结构生物信息学数据翻译成文本提示,将移植的动态 contrast-enhanced 磁共振成像(DCE MRI)有意义的多模态表示学习出来。medIMP 基于对比学习从联合文本-图像配对嵌入中进行挑战性的任务。此外,我们提出了一个框架,使用LLM自动生成的文本数据增强来生成医疗提示。我们的目标是学习移植 DCE MRI 的有意义的多模式分支,对于移植预后或患者状况(移植后2、3和4年)非常有趣,以最高效的方式充分利用可用的多模态数据。广泛的实验和与仅有有限数据的其他移植表示学习方法进行比较证明medIMP在相关临床环境中的有效性,为医疗提示提供了新的方向。我们的代码可在this https URL上获取。
https://arxiv.org/abs/2303.12445
Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at this https URL
Sequential video understanding,作为新兴的视频理解任务,吸引了许多研究人员的关注,因为它具有目标导向的性质。本文研究了未提供准确时间戳级别文本-视频对齐的弱监督Sequential视频理解任务。我们借鉴了CLIP的思想,具体来说,我们使用Transformer将帧级特征整合用于视频表示,使用预先训练的文本编码器分别编码每个行动和整个视频对应的文本。为了建模文本和视频之间的对应关系,我们提出了多个粒度的损失,其中视频段落对比度损失强迫整个视频和完整脚本匹配,而精细的帧语句对比度损失强迫每个行动和其描述匹配。由于帧语句对应关系不可得,我们提出了利用时间域中视频行动Sequential的顺序性生成伪帧语句对应关系,并监督网络使用伪标签进行训练。在视频序列验证和文本到视频匹配方面的广泛实验结果表明,我们的方法比基准方法表现更好,这验证了我们提出的方法的有效性。代码可在该https URL处获取。
https://arxiv.org/abs/2303.12370
Teaching dexterity to multi-fingered robots has been a longstanding challenge in robotics. Most prominent work in this area focuses on learning controllers or policies that either operate on visual observations or state estimates derived from vision. However, such methods perform poorly on fine-grained manipulation tasks that require reasoning about contact forces or about objects occluded by the hand itself. In this work, we present T-Dex, a new approach for tactile-based dexterity, that operates in two phases. In the first phase, we collect 2.5 hours of play data, which is used to train self-supervised tactile encoders. This is necessary to bring high-dimensional tactile readings to a lower-dimensional embedding. In the second phase, given a handful of demonstrations for a dexterous task, we learn non-parametric policies that combine the tactile observations with visual ones. Across five challenging dexterous tasks, we show that our tactile-based dexterity models outperform purely vision and torque-based models by an average of 1.7X. Finally, we provide a detailed analysis on factors critical to T-Dex including the importance of play data, architectures, and representation learning.
将多指机器人的教育 dexterity 问题已经持续了多年的 robotics 领域的挑战。该领域的主要工作都关注于学习控制器或政策,它们要么基于视觉观察或从视觉中推断的状态估计进行操作。然而,这些方法在处理需要对接触力量或手部本身遮盖的对象进行推理的精细操作任务时表现不佳。在这项工作中,我们提出了 T-Dex,一种基于触觉的 dexterity 新 approach,并采用了两个阶段的运行方式。在第一阶段,我们收集了 2.5 小时的玩耍数据,用于训练自我监督触觉编码器。这是将高维触觉读数嵌入低维空间的必要步骤。在第二阶段,我们提供了少量演示来完成一个灵巧的任务,并学习基于非参数政策的结合触觉观察的视觉政策。在五个具有挑战性的灵巧任务中,我们表明,我们基于触觉的 dexterity 模型平均比纯视觉和触觉动力模型表现更好,提高了 1.7 倍。最后,我们详细分析了 T-Dex 中的关键因素,包括玩耍数据、架构和表示学习的重要性。
https://arxiv.org/abs/2303.12076
Masked Autoencoders (MAEs) learn self-supervised representations by randomly masking input image patches and a reconstruction loss. Alternatively, contrastive learning self-supervised methods encourage two versions of the same input to have a similar representation, while pulling apart the representations for different inputs. We propose ViC-MAE, a general method that combines both MAE and contrastive learning by pooling the local feature representations learned under the MAE reconstruction objective and leveraging this global representation under a contrastive objective across video frames. We show that visual representations learned under ViC-MAE generalize well to both video classification and image classification tasks. Using a backbone ViT-B/16 network pre-trained on the Moments in Time (MiT) dataset, we obtain state-of-the-art transfer learning from video to images on Imagenet-1k by improving 1.58% in absolute top-1 accuracy from a recent previous work. Moreover, our method maintains a competitive transfer-learning performance of 81.50% top-1 accuracy on the Kinetics-400 video classification benchmark. In addition, we show that despite its simplicity, ViC-MAE yields improved results compared to combining MAE pre-training with previously proposed contrastive objectives such as VicReg and SiamSiam.
遮蔽自动编码器(MAEs)通过随机遮蔽输入图像点和重构损失学习自监督表示。Alternatively,比较性学习自监督方法鼓励相同的输入版本具有相似的表示,同时分离不同的输入版本的表示。我们提出了ViC-MAE,一种通用方法,将MAEs和比较性学习相结合,通过汇总在MAEs重构目标下学习的小特征表示并利用跨视频帧的比较目标上的优势,获得视频分类和图像分类任务的最新研究成果。通过在时间序列数据(MiT)数据集上预先训练的ViT-B/16网络,我们在Imagenet-1k上从视频到图像的迁移学习中获得最先进的结果,从最近的一项工作提高了1.58%的绝对准确率。此外,我们的方法在Kinetics-400视频分类基准上保持了具有竞争力的迁移学习性能,保持了81.50%的top-1准确率。此外,我们表明,尽管ViC-MAE的简单易用,但它比结合MAEs预训练与以前提出的比较性目标,如vicReg和SiamSiam等方法获得更好的结果。
https://arxiv.org/abs/2303.12001
Recent years have seen development of descriptor generation based on representation learning of extremely diverse molecules, especially those that apply natural language processing (NLP) models to SMILES, a literal representation of molecular structure. However, little research has been done on how these models understand chemical structure. To address this, we investigated the relationship between the learning progress of SMILES and chemical structure using a representative NLP model, the Transformer. The results suggest that while the Transformer learns partial structures of molecules quickly, it requires extended training to understand overall structures. Consistently, the accuracy of molecular property predictions using descriptors generated from models at different learning steps was similar from the beginning to the end of training. Furthermore, we found that the Transformer requires particularly long training to learn chirality and sometimes stagnates with low translation accuracy due to misunderstanding of enantiomers. These findings are expected to deepen understanding of NLP models in chemistry.
近年来,基于非常不同的分子表示学习特征表示的发展,特别是对于那些将自然语言处理(NLP)模型应用于SMIDLES(分子结构 literal 表示)的分子结构表示的学习。然而,关于这些模型如何理解化学结构的研究较少。为了解决这个问题,我们使用Transformer代表的NLP模型研究了SMIDLES的学习进展与化学结构之间的关系。结果表明,虽然Transformer很快学习分子的局部结构,但它需要更长的训练时间来理解整体结构。同样,使用从不同学习步骤中生成的特征表示进行分子属性预测的准确性从开始到结束的训练期间是一致的。此外,我们发现Transformer需要特别长的训练来学习 chirality,有时因为对同工效应的误解而停滞不前。这些发现期望加深对在化学中的NLP模型的理解。
https://arxiv.org/abs/2303.11593
One-to-one (o2o) label assignment plays a key role for transformer based end-to-end detection, and it has been recently introduced in fully convolutional detectors for end-to-end dense detection. However, o2o can degrade the feature learning efficiency due to the limited number of positive samples. Though extra positive samples are introduced to mitigate this issue in recent DETRs, the computation of self- and cross- attentions in the decoder limits its practical application to dense and fully convolutional detectors. In this work, we propose a simple yet effective one-to-few (o2f) label assignment strategy for end-to-end dense detection. Apart from defining one positive and many negative anchors for each object, we define several soft anchors, which serve as positive and negative samples simultaneously. The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to ``representation learning'' in the early training stage, and contribute more to ``duplicated prediction removal'' in the later stage. The detector trained in this way can not only learn a strong feature representation but also perform end-to-end dense detection. Experiments on COCO and CrowdHuman datasets demonstrate the effectiveness of the o2f scheme. Code is available at this https URL.
一对一(o2o)标签分配对于基于Transformer的端到端检测至关重要,最近在完全卷积检测中引入了端到端密度检测。然而,由于有限的阳性样本数量,o2o可能会降低特征学习效率。尽管最近的DeTRs引入了额外的阳性样本来缓解这个问题,但解码器中的自我和交叉注意力计算限制了它适用于密度和完全卷积检测的实际应用。在这项工作中,我们提出了一种简单但有效的一对一标签分配策略,用于端到端密度检测。除了为每个对象定义一个阳性和许多阴性Anchors外,我们定义了几个软Anchors,它们可以同时充当阳性和阴性样本。这些软Anchors的阳性和阴性权重在训练期间动态地调整,以便它们在早期的训练阶段更多地参与“表示学习”,并在后期更多地参与“重复预测删除”。通过这种方式训练的探测器不仅可以学习强大的特征表示,还可以进行端到端密度检测。COCO和人群人类数据集的实验证明了o2f方案的有效性。代码可在该httpsURL上获取。
https://arxiv.org/abs/2303.11567
Deep learning in general domains has constantly been extended to domain-specific tasks requiring the recognition of fine-grained characteristics. However, real-world applications for fine-grained tasks suffer from two challenges: a high reliance on expert knowledge for annotation and necessity of a versatile model for various downstream tasks in a specific domain (e.g., prediction of categories, bounding boxes, or pixel-wise annotations). Fortunately, the recent self-supervised learning (SSL) is a promising approach to pretrain a model without annotations, serving as an effective initialization for any downstream tasks. Since SSL does not rely on the presence of annotation, in general, it utilizes the large-scale unlabeled dataset, referred to as an open-set. In this sense, we introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available, as well as the fine-grained target dataset, during a pretraining phase. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. Hence, we propose SimCore algorithm to sample a coreset, the subset of an open-set that has a minimum distance to the target dataset in the latent space. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings, including eleven fine-grained datasets and seven open-sets in various downstream tasks.
深度学习在通用领域一直被扩展到需要精确特征识别的特定任务。然而,对于高精度任务的实际应用领域,存在两个挑战:高度依赖专家知识进行标注,以及在特定的特定领域(例如分类、边界框预测或像素级标注)中需要一种多功能模型满足不同下游任务的需求。幸运的是,最近的一种自监督学习(SSL)是一种有前途的方法来在没有标注的情况下进行模型预训练,可以作为任何下游任务的有效初始化。由于SSL并不依赖于标注的存在,通常它使用被称为“开放集”的大型未标记数据集。因此,我们提出了一种新的开放集自监督学习问题,假设有一个大规模的未标记开放集和高精度目标数据集,在预训练阶段出现。在我们的问题设置中,重要的是考虑开放集和目标数据集之间的分布不匹配。因此,我们提出了SimCore算法来采样核心集,它是开放集的子集,在潜在空间中与目标数据集有最小距离。我们证明,SimCore通过广泛的实验设置显著改善了表示学习性能,包括在不同下游任务中的11个高精度数据集和7个开放集。
https://arxiv.org/abs/2303.11101
Supervised deep learning methods have achieved considerable success in medical image analysis, owing to the availability of large-scale and well-annotated datasets. However, creating such datasets for whole slide images (WSIs) in histopathology is a challenging task due to their gigapixel size. In recent years, self-supervised learning (SSL) has emerged as an alternative solution to reduce the annotation overheads in WSIs, as it does not require labels for training. These SSL approaches, however, are not designed for handling multi-resolution WSIs, which limits their performance in learning discriminative image features. In this paper, we propose a Dual-branch SSL Framework for WSI tumour segmentation (DSF-WSI) that can effectively learn image features from multi-resolution WSIs. Our DSF-WSI connected two branches and jointly learnt low and high resolution WSIs in a self-supervised manner. Moreover, we introduced a novel Context-Target Fusion Module (CTFM) and a masked jigsaw pretext task to align the learnt multi-resolution features. Furthermore, we designed a Dense SimSiam Learning (DSL) strategy to maximise the similarity of different views of WSIs, enabling the learnt representations to be more efficient and discriminative. We evaluated our method using two public datasets on breast and liver cancer segmentation tasks. The experiment results demonstrated that our DSF-WSI can effectively extract robust and efficient representations, which we validated through subsequent fine-tuning and semi-supervised settings. Our proposed method achieved better accuracy than other state-of-the-art approaches. Code is available at this https URL.
监督深度学习方法在医学图像分析中取得了相当大的成功,因为这些数据集大规模、标记良好的可用性。然而,在病理诊断中创建这样的数据集对于整个Slide图像(WSIs)来说是一项挑战性的任务,因为它们的像素大小高达数百万。近年来,自监督学习(SSL)已成为减少WSIs的标注 overheads的另一种解决方案,因为它不需要训练标签。但是这些SSL方法的设计专门针对处理高分辨率WSIs,这限制了它们在学习区别性图像特征方面的表现。在本文中,我们提出了一个用于WSI肿瘤分割的双向分支 SSL框架(DSF-WSI),它可以从高分辨率WSIs中有效地学习图像特征。我们的DSF-WSI连接了两个分支,并同时学习低分辨率和高分辨率WSIs的自监督学习。此外,我们引入了一个 novel Context-Target Fusion Module(CTFM)和一个掩膜拼图任务,以 align 学到的高分辨率特征。此外,我们设计了Dense SimSiam Learning(DSL)策略,以最大化 WSIs不同视图之间的相似性,从而使学习到的表示更加高效和区别性。我们使用了两个公开数据集,乳腺癌和肝脏癌症分割任务,对方法进行了评估。实验结果表明,我们的DSF-WSI可以有效地提取稳健和高效的表示,我们通过后续微调和半监督设置进行了验证。我们提出的方法比其他任何先进的方法取得了更好的准确性。代码在此 https URL 可用。
https://arxiv.org/abs/2303.11019
In this paper, we investigate representation learning for low-resource keyword spotting (KWS). The main challenges of KWS are limited labeled data and limited available device resources. To address those challenges, we explore representation learning for KWS by self-supervised contrastive learning and self-training with pretrained model. First, local-global contrastive siamese networks (LGCSiam) are designed to learn similar utterance-level representations for similar audio samplers by proposed local-global contrastive loss without requiring ground-truth. Second, a self-supervised pretrained Wav2Vec 2.0 model is applied as a constraint module (WVC) to force the KWS model to learn frame-level acoustic representations. By the LGCSiam and WVC modules, the proposed small-footprint KWS model can be pretrained with unlabeled data. Experiments on speech commands dataset show that the self-training WVC module and the self-supervised LGCSiam module significantly improve accuracy, especially in the case of training on a small labeled dataset.
本论文研究的是低资源关键字标注(KWS)的表示学习。KWS的主要挑战是标注数据有限和可用设备资源有限。为了解决这些问题,我们采用自监督比较学习方法和自训练预训练模型来进行KWS的表示学习。首先,我们设计了一个 local-global 比较无监督神经网络(LGCSiam),该网络为类似音频编辑器的学习类似语音片段的表示,而无需真实值,只需要求局部和全局比较损失。其次,我们应用自监督的 WVC 模型作为约束模块(WVC),强制KWS 模型学习帧级别的声学表示。通过 LGCSiam 和 WVC 模块,我们提出了小型KWS模型,该模型使用未标记数据进行预训练。在语音命令数据集的实验中,表明自训练的 WVC 模块和自监督的 LGCSiam 模块显著提高了准确性,特别是在训练仅使用少量标记数据的情况下。
https://arxiv.org/abs/2303.10912
Multifold observations are common for different data modalities, e.g., a 3D shape can be represented by multi-view images and an image can be described with different captions. Existing cross-modal contrastive representation learning (XM-CLR) methods such as CLIP are not fully suitable for multifold data as they only consider one positive pair and treat other pairs as negative when computing the contrastive loss. In this paper, we propose MXM-CLR, a unified framework for contrastive learning of multifold cross-modal representations. MXM-CLR explicitly models and learns the relationships between multifold observations of instances from different modalities for more comprehensive representation learning. The key of MXM-CLR is a novel multifold-aware hybrid loss which considers multiple positive observations when computing the hard and soft relationships for the cross-modal data pairs. We conduct quantitative and qualitative comparisons with SOTA baselines for cross-modal retrieval tasks on the Text2Shape and Flickr30K datasets. We also perform extensive evaluations on the adaptability and generalizability of MXM-CLR, as well as ablation studies on the loss design and effects of batch sizes. The results show the superiority of MXM-CLR in learning better representations for the multifold data. The code is available at this https URL.
多视角图像可以代表三维形状,图像可以有不同的描述词。现有的跨模态对比表示学习(XM-CLR)方法,如Clip,不适合处理多视角数据,因为它们只考虑一个正对和一个负对,并在计算对比损失时将其视为负对。在本文中,我们提出了MXM-CLR,一个统一的对比学习多视角跨模态表示框架。MXM-CLR explicitly models and learns来自不同模态的多视角实例之间的关系,以更全面地表示学习。MXM-CLR的关键是一种独特的多视角aware混合损失,它在计算跨模态数据对的硬和软关系时考虑多个正对。我们在Text2Shape和Flickr30K数据集上的跨模态检索任务中进行了与SOTA基准模型的定量和定性比较。我们还对MXM-CLR的适应性和泛化性进行了广泛的评估,并进行了 batch size 设计和损失设计的影响的 ablation study。结果表明,MXM-CLR在多视角数据的学习更好的表示方面具有优势。代码可在本URL上获取。
https://arxiv.org/abs/2303.10839
Deep neural networks have achieved promising results in automatic image captioning due to their effective representation learning and context-based content generation capabilities. As a prominent type of deep features used in many of the recent image captioning methods, the well-known bottomup features provide a detailed representation of different objects of the image in comparison with the feature maps directly extracted from the raw image. However, the lack of high-level semantic information about the relationships between these objects is an important drawback of bottom-up features, despite their expensive and resource-demanding extraction procedure. To take advantage of visual relationships in caption generation, this paper proposes a deep neural network architecture for image captioning based on fusing the visual relationships information extracted from an image's scene graph with the spatial feature maps of the image. A multi-modal reward function is then introduced for deep reinforcement learning of the proposed network using a combination of language and vision similarities in a common embedding space. The results of extensive experimentation on the MSCOCO dataset show the effectiveness of using visual relationships in the proposed captioning method. Moreover, the results clearly indicate that the proposed multi-modal reward in deep reinforcement learning leads to better model optimization, outperforming several state-of-the-art image captioning algorithms, while using light and easy to extract image features. A detailed experimental study of the components constituting the proposed method is also presented.
神经网络在自动图像翻译中取得了令人瞩目的成果,因为它们有效地学习了表示和基于上下文的内容生成能力。作为最近在许多图像翻译方法中广泛应用的深度特征类型,著名的bottom-up特征提供了与从原始图像直接提取的特征映射相比更详细的图像对象表示。然而,缺乏这些对象之间的高级语义信息是bottom-up特征的一个重要缺点,尽管它们的提取程序昂贵且资源要求高。为了利用图像关系在翻译生成中的作用,本文提出了一种基于融合图像场景图提取的视觉关系信息与图像空间特征映射的深度学习神经网络架构。然后,在共同嵌入空间中通过语言和视觉相似性的组合引入一种多模态奖励函数,用于训练 proposed 网络的深度强化学习。对MSCOCO数据集进行广泛的实验结果显示,使用视觉关系在所提出的翻译方法中的有效性。此外,实验结果清楚地表明,所提出的深度强化学习多模态奖励导致更好的模型优化,比一些最先进的图像翻译算法表现更好,同时使用简单易提取的图像特征。还介绍了组成所提出方法的详细实验研究的组件。
https://arxiv.org/abs/2303.10766
Audio event has a hierarchical architecture in both time and frequency and can be grouped together to construct more abstract semantic audio classes. In this work, we develop a multiscale audio spectrogram Transformer (MAST) that employs hierarchical representation learning for efficient audio classification. Specifically, MAST employs one-dimensional (and two-dimensional) pooling operators along the time (and frequency domains) in different stages, and progressively reduces the number of tokens and increases the feature dimensions. MAST significantly outperforms AST~\cite{gong2021ast} by 22.2\%, 4.4\% and 4.7\% on Kinetics-Sounds, Epic-Kitchens-100 and VGGSound in terms of the top-1 accuracy without external training data. On the downloaded AudioSet dataset, which has over 20\% missing audios, MAST also achieves slightly better accuracy than AST. In addition, MAST is 5x more efficient in terms of multiply-accumulates (MACs) with 42\% reduction in the number of parameters compared to AST. Through clustering metrics and visualizations, we demonstrate that the proposed MAST can learn semantically more separable feature representations from audio signals.
音频事件在时间和频率上都具有层次结构,可以分组在一起以构建更抽象的精神音频类别。在这项工作中,我们开发了一个多尺度音频 spectrogram Transformer (MAST),采用层次表示学习来高效地进行音频分类。具体来说,MAST在不同阶段使用时间(和频率)上的一维(和二维)聚合操作,逐步减少代币数量并增加特征维度。在 Kinetics- sounds、Epic- Kitchens-100和VGGSound等测试集上,MAST在外部训练数据不足的情况下,在top-1准确率方面显著优于AST~cite{gong2021ast}。在已下载的AudioSet数据集上,该数据集有超过20%的音频缺失,MAST也比AST实现略微更好的准确率。此外,与AST相比,MAST的乘法累加(MAC)效率更高,参数数量减少了42%。通过聚类 metrics 和可视化,我们证明, proposed MAST可以从音频信号中学习更可分离的特征表示。
https://arxiv.org/abs/2303.10757
This paper presents our Facial Action Units (AUs) recognition submission to the fifth Affective Behavior Analysis in-the-wild Competition (ABAW). Our approach consists of three main modules: (i) a pre-trained facial representation encoder which produce a strong facial representation from each input face image in the input sequence; (ii) an AU-specific feature generator that specifically learns a set of AU features from each facial representation; and (iii) a spatio-temporal graph learning module that constructs a spatio-temporal graph representation. This graph representation describes AUs contained in all frames and predicts the occurrence of each AU based on both the modeled spatial information within the corresponding face and the learned temporal dynamics among frames. The experimental results show that our approach outperformed the baseline and the spatio-temporal graph representation learning allows the model to generate the best results among all ablation systems.
本 paper 介绍了我们对 facial action units (AUs) 在第五项野生情感行为分析竞赛(ABAW)中的识别提交。我们的算法由三个主要模块组成:(i) 预先训练的面部表示编码器,从输入序列中的每个输入面部图像生成强面部表示;(ii) AU 特定的特征生成器,从每个面部表示中专门学习一组 AU 特征;(iii) 时间域图学习模块,构建时间域图表示。这个图表示了所有帧中的 AU 内容和根据对应面部模型空间信息以及帧中学习的时间动态预测每个 AU 的发生情况。实验结果显示,我们的算法超越了基准模型,和时间域图表示学习使得模型能够在所有 ablation 系统中生成最好的结果。
https://arxiv.org/abs/2303.10644
Automatic Robotic Assembly Sequence Planning (RASP) can significantly improve productivity and resilience in modern manufacturing along with the growing need for greater product customization. One of the main challenges in realizing such automation resides in efficiently finding solutions from a growing number of potential sequences for increasingly complex assemblies. Besides, costly feasibility checks are always required for the robotic system. To address this, we propose a holistic graphical approach including a graph representation called Assembly Graph for product assemblies and a policy architecture, Graph Assembly Processing Network, dubbed GRACE for assembly sequence generation. Secondly, we use GRACE to extract meaningful information from the graph input and predict assembly sequences in a step-by-step manner. In experiments, we show that our approach can predict feasible assembly sequences across product variants of aluminum profiles based on data collected in simulation of a dual-armed robotic system. We further demonstrate that our method is capable of detecting infeasible assemblies, substantially alleviating the undesirable impacts from false predictions, and hence facilitating real-world deployment soon. Code and training data will be open-sourced.
自动机器人装配序列规划(RASP)可以显著改善现代制造业中的生产率和 resilience,同时满足越来越个性化的产品定制需求。实现这种自动化的主要挑战之一是从越来越多的潜在序列中高效地找到解决方案。此外,机器人系统总是需要进行昂贵的可行性检查。为了解决这一问题,我们提出了一个整体图形方法,包括称为产品装配图的图形表示以及名为GraphAssembly Processing Network(GRACE)的政策架构,它被称为装配序列生成器。其次,我们使用GRACE从图形输入中提取有意义的信息,并逐步预测装配序列。在实验中,我们表明,我们的方法可以预测 aluminum profile产品变异体之间的可行装配序列,基于模拟双臂机器人系统的收集的数据。我们进一步证明了我们的方法可以检测可行的装配序列,极大地减轻由错误的预测引起的不良影响,从而 facilitate soon·real-world deployment。代码和训练数据将开源。
https://arxiv.org/abs/2303.10135
Barrett's Esophagus (BE) is the only precursor known to Esophageal Adenocarcinoma (EAC), a type of esophageal cancer with poor prognosis upon diagnosis. Therefore, diagnosing BE is crucial in preventing and treating esophageal cancer. While supervised machine learning supports BE diagnosis, high interobserver variability in histopathological training data limits these methods. Unsupervised representation learning via Variational Autoencoders (VAEs) shows promise, as they map input data to a lower-dimensional manifold with only useful features, characterizing BE progression for improved downstream tasks and insights. However, the VAE's Euclidean latent space distorts point relationships, hindering disease progression modeling. Geometric VAEs provide additional geometric structure to the latent space, with RHVAE assuming a Riemannian manifold and $\mathcal{S}$-VAE a hyperspherical manifold. Our study shows that $\mathcal{S}$-VAE outperforms vanilla VAE with better reconstruction losses, representation classification accuracies, and higher-quality generated images and interpolations in lower-dimensional settings. By disentangling rotation information from the latent space, we improve results further using a group-based architecture. Additionally, we take initial steps towards $\mathcal{S}$-AE, a novel autoencoder model generating qualitative images without a variational framework, but retaining benefits of autoencoders such as stability and reconstruction quality.
Barrett's esophagus (BE) 是已知柠檬汁癌(EAC)的已知前驱体,而这种癌症在诊断时具有较差的预后。因此,诊断BE在预防和治愈柠檬汁癌方面至关重要。虽然监督机器学习支持BE诊断,但病理学训练数据的高互现性限制了这些方法。无监督的变分自编码器(VAE)的表现令人鼓舞,因为它们将输入数据映射到一个只有有用特征的低维多态,以 characterized BE 的进展,并提高后续任务和洞察力。然而,VAE 的欧几里得 latent space 扭曲了点关系,妨碍了疾病进展建模。几何VAE为 latent space 提供了额外的几何结构,其中 RHVAE 假设一个黎曼多态,而 $\mathcal{S}$-VAE 是一个高斯多态。我们的研究表明,$\mathcal{S}$-VAE 在低维环境中比传统VAE有更好的重构损失、表示分类精度和更高质量的生成图像和插值。通过从 latent space 中分离旋转信息,我们使用群体架构进一步改进结果。此外,我们开始朝着 $\mathcal{S}$-AE 迈进,这是一种没有变分框架的新型自编码器模型,生成高质量的定性图像,但保留了自编码器的稳定性和重构质量。
https://arxiv.org/abs/2303.12711
Autoencoding, which aims to reconstruct the input images through a bottleneck latent representation, is one of the classic feature representation learning strategies. It has been shown effective as an auxiliary task for semi-supervised learning but has become less popular as more sophisticated methods have been proposed in recent years. In this paper, we revisit the idea of using image reconstruction as the auxiliary task and incorporate it with a modern semi-supervised semantic segmentation framework. Surprisingly, we discover that such an old idea in semi-supervised learning can produce results competitive with state-of-the-art semantic segmentation algorithms. By visualizing the intermediate layer activations of the image reconstruction module, we show that the feature map channel could correlate well with the semantic concept, which explains why joint training with the reconstruction task is helpful for the segmentation task. Motivated by our observation, we further proposed a modification to the image reconstruction task, aiming to further disentangle the object clue from the background patterns. From experiment evaluation on various datasets, we show that using reconstruction as auxiliary loss can lead to consistent improvements in various datasets and methods. The proposed method can further lead to significant improvement in object-centric segmentation tasks.
自动编码(Autoencoding)旨在通过瓶颈潜在表示重构输入图像,是经典的特征表示学习策略之一。近年来,随着更复杂的方法的提出,自动编码不再像以前那样受欢迎了。在本文中,我们重新考虑了将图像重建作为辅助任务的想法,并将其与现代半监督语义分割框架相结合。出乎意料的是,我们发现在半监督学习中,这个古老的想法可以产生与最先进的语义分割算法相媲美的结果。通过可视化图像重建模块的中间层激活,我们证明了特征映射通道与语义概念具有良好的相关性,这解释了为什么与重建任务协同训练对于分割任务有益。基于我们的观察,我们进一步提出了对图像重建任务的优化建议,旨在进一步从背景模式中分离物体线索。从各种数据集的实验评估中,我们表明,使用重建作为辅助损失可以导致各种数据和方法的一致性改进。该建议方法还可以进一步改善对象为中心的分割任务。
https://arxiv.org/abs/2303.09794
Graph Neural Networks (GNNs) are de facto solutions to structural data learning. However, it is susceptible to low-quality and unreliable structure, which has been a norm rather than an exception in real-world graphs. Existing graph structure learning (GSL) frameworks still lack robustness and interpretability. This paper proposes a general GSL framework, SE-GSL, through structural entropy and the graph hierarchy abstracted in the encoding tree. Particularly, we exploit the one-dimensional structural entropy to maximize embedded information content when auxiliary neighbourhood attributes are fused to enhance the original graph. A new scheme of constructing optimal encoding trees is proposed to minimize the uncertainty and noises in the graph whilst assuring proper community partition in hierarchical abstraction. We present a novel sample-based mechanism for restoring the graph structure via node structural entropy distribution. It increases the connectivity among nodes with larger uncertainty in lower-level communities. SE-GSL is compatible with various GNN models and enhances the robustness towards noisy and heterophily structures. Extensive experiments show significant improvements in the effectiveness and robustness of structure learning and node representation learning.
图形神经网络(GNNs)实际上是结构数据学习的实际上的解决方案。然而,它容易被低质量和可靠的结构所影响,这在现实世界的图形中已成为一种常态,而不是一个例外。现有的图形结构学习框架(GSL)框架仍然缺乏稳定性和解释性。本文提出了一个通用的GSL框架,即SE-GSL,通过在编码树中抽象结构的熵和图形级数来实现。特别地,我们利用一维结构的熵来最大化嵌入信息 content,当辅助邻域属性融合以增强原始图形时。我们提出了一种新的方式来构建最优编码树,以最小化图形中的不确定和噪声,同时确保适当的社区分区在Hierarchical抽象中。我们提出了一种基于样本的结构熵分布机制,通过节点结构熵分布来恢复图形结构。它增加了高层次社区中节点之间的连通性。SE-GSL与各种GNN模型兼容,并增强对噪声和异质结构的鲁棒性。广泛的实验表明,结构学习和节点表示学习的有效性和鲁棒性得到了显著改善。
https://arxiv.org/abs/2303.09778
Data augmentation has become a crucial component to train state-of-the-art visual representation models. However, handcrafting combinations of transformations that lead to improved performances is a laborious task, which can result in visually unrealistic samples. To overcome these limitations, recent works have explored the use of generative models as learnable data augmentation tools, showing promising results in narrow application domains, e.g., few-shot learning and low-data medical imaging. In this paper, we introduce a data augmentation module, called DA_IC-GAN, which leverages instance-conditioned GAN generations and can be used off-the-shelf in conjunction with most state-of-the-art training recipes. We showcase the benefits of DA_IC-GAN by plugging it out-of-the-box into the supervised training of ResNets and DeiT models on the ImageNet dataset, and achieving accuracy boosts up to between 1%p and 2%p with the highest capacity models. Moreover, the learnt representations are shown to be more robust than the baselines when transferred to a handful of out-of-distribution datasets, and exhibit increased invariance to variations of instance and viewpoints. We additionally couple DA_IC-GAN with a self-supervised training recipe and show that we can also achieve an improvement of 1%p in accuracy in some settings. With this work, we strengthen the evidence on the potential of learnable data augmentations to improve visual representation learning, paving the road towards non-handcrafted augmentations in model training.
数据增强已经成为训练现代视觉表示模型的关键组成部分。然而,手工组合变换导致性能改善是一项艰苦的任务,可能会导致视觉效果不合理的样本。为了克服这些限制,最近的工作探索了生成模型作为可学习的数据增强工具的使用,在狭窄的应用 domains 内,例如单样本学习和小数据医学成像,取得了令人瞩目的结果。在本文中,我们介绍了 DA_IC-GAN 数据增强模块,它利用实例条件GAN生成器,可以与大多数先进的训练配方一起使用。我们展示 DA_IC-GAN 的优势,通过将其插入 ImageNet 数据集上的ResNet 和 DeiT模型的监督训练中,并将精度Boost到1%p至2%p的最高水平模型上。此外,当将其转移到少数非分布数据集时,学习的表示比基准更加鲁棒,并且具有增加对实例和观点变异的不变性。我们此外与自监督训练配方联用,并表明在某些设置下,我们也能提高1%p的精度。通过这项工作,我们加强了可学习数据增强改善视觉表示学习的潜力的证据,开创了模型训练中不使用手工增强的道路。
https://arxiv.org/abs/2303.09677