The fifth Affective Behavior Analysis in-the-wild (ABAW) competition has multiple challenges such as Valence-Arousal Estimation Challenge, Expression Classification Challenge, Action Unit Detection Challenge, Emotional Reaction Intensity Estimation Challenge. In this paper we have dealt only expression classification challenge using multiple approaches such as fully supervised, semi-supervised and noisy label approach. Our approach using noise aware model has performed better than baseline model by 10.46% and semi supervised model has performed better than baseline model by 9.38% and the fully supervised model has performed better than the baseline by 9.34%
The increasing intensity and frequency of floods is one of the many consequences of our changing climate. In this work, we explore ML techniques that improve the flood detection module of an operational early flood warning system. Our method exploits an unlabelled dataset of paired multi-spectral and Synthetic Aperture Radar (SAR) imagery to reduce the labeling requirements of a purely supervised learning method. Prior works have used unlabelled data by creating weak labels out of them. However, from our experiments we noticed that such a model still ends up learning the label mistakes in those weak labels. Motivated by knowledge distillation and semi supervised learning, we explore the use of a teacher to train a student with the help of a small hand labelled dataset and a large unlabelled dataset. Unlike the conventional self distillation setup, we propose a cross modal distillation framework that transfers supervision from a teacher trained on richer modality (multi-spectral images) to a student model trained on SAR imagery. The trained models are then tested on the Sen1Floods11 dataset. Our model outperforms the Sen1Floods11 baseline model trained on the weak labeled SAR imagery by an absolute margin of 6.53% Intersection-over-Union (IoU) on the test split.
洪水的强度和频率的增加是我们气候变化的许多后果之一。在这个研究中,我们探讨了机器学习技术,以提高 operational early flood warning system 中的洪水检测模块。我们利用一个未标记的配对多光谱和合成孔径雷达图像的未命名数据集,以降低纯粹的监督学习方法的标记要求。以前的工作已经使用未标记数据,从它们中创建弱标签。然而,从我们的实验中我们发现,这样的模型仍然最终学习这些弱标签的标记错误。基于知识蒸馏和半监督学习的动机,我们探讨了使用一名教师帮助训练学生的方法,使用一个小手标注的数据集和一个大型未标注的数据集。与传统的自蒸馏setup不同,我们提出了一种跨modal蒸馏框架,将监督从训练丰富的modality(多光谱图像)转移到训练SAR图像的学生模型中。训练模型后,在Sen1Floods11数据集上进行了测试。我们的模型在弱标签SAR图像上的标记错误训练 Sen1Floods11 基线模型的相对误差6.53%的IoU上表现出色。
Semi supervised learning (SSL) provides an effective means of leveraging unlabelled data to improve a model's performance. Even though the domain has received a considerable amount of attention in the past years, most methods present the common drawback of being unsafe. By safeness we mean the quality of not degrading a fully supervised model when including unlabelled data. Our starting point is to notice that the estimate of the risk that most discriminative SSL methods minimise is biased, even asymptotically. This bias makes these techniques untrustable without a proper validation set, but we propose a simple way of removing the bias. Our debiasing approach is straightforward to implement, and applicable to most deep SSL methods. We provide simple theoretical guarantees on the safeness of these modified methods, without having to rely on the strong assumptions on the data distribution that SSL theory usually requires. We evaluate debiased versions of different existing SSL methods and show that debiasing can compete with classic deep SSL techniques in various classic settings and even performs well when traditional SSL fails.
In this paper, we propose a Neural Architecture Search strategy based on self supervision and semi-supervised learning for the task of semantic segmentation. Our approach builds an optimized neural network (NN) model for this task by jointly solving a jigsaw pretext task discovered with self-supervised learning over unlabeled training data, and, exploiting the structure of the unlabeled data with semi-supervised learning. The search of the architecture of the NN model is performed by dynamic routing using a gradient descent algorithm. Experiments on the Cityscapes and PASCAL VOC 2012 datasets demonstrate that the discovered neural network is more efficient than a state-of-the-art hand-crafted NN model with four times less floating operations.
Computer aided diagnostics often requires analysis of a region of interest (ROI) within a radiology scan, and the ROI may be an organ or a suborgan. Although deep learning algorithms have the ability to outperform other methods, they rely on the availability of a large amount of annotated data. Motivated by the need to address this limitation, an approach to localisation and detection of multiple organs based on supervised and semi-supervised learning is presented here. It draws upon previous work by the authors on localising the thoracic and lumbar spine region in CT images. The method generates six bounding boxes of organs of interest, which are then fused to a single bounding box. The results of experiments on localisation of the Spleen, Left and Right Kidneys in CT Images using supervised and semi supervised learning (SSL) demonstrate the ability to address data limitations with a much smaller data set and fewer annotations, compared to other state-of-the-art methods. The SSL performance was evaluated using three different mixes of labelled and unlabelled data (i.e.30:70,35:65,40:60) for each of lumbar spine, spleen left and right kidneys respectively. The results indicate that SSL provides a workable alternative especially in medical imaging where it is difficult to obtain annotated data.
Our way of grasping objects is challenging for efficient, intelligent and optimal grasp by COBOTs. To streamline the process, here we use deep learning techniques to help robots learn to generate and execute appropriate grasps quickly. We developed a Generative Inception Neural Network (GI-NNet) model, capable of generating antipodal robotic grasps on seen as well as unseen objects. It is trained on Cornell Grasping Dataset (CGD) and attained 98.87% grasp pose accuracy for detecting both regular and irregular shaped objects from RGB-Depth (RGB-D) images while requiring only one third of the network trainable parameters as compared to the existing approaches. However, to attain this level of performance the model requires the entire 90% of the available labelled data of CGD keeping only 10% labelled data for testing which makes it vulnerable to poor generalization. Furthermore, getting sufficient and quality labelled dataset is becoming increasingly difficult keeping in pace with the requirement of gigantic networks. To address these issues, we attach our model as a decoder with a semi-supervised learning based architecture known as Vector Quantized Variational Auto Encoder (VQVAE), which works efficiently when trained both with the available labelled and unlabelled data. The proposed model, which we name as Representation based GI-NNet (RGI-NNet), has been trained with various splits of label data on CGD with as minimum as 10% labelled dataset together with latent embedding generated from VQVAE up to 50% labelled data with latent embedding obtained from VQVAE. The performance level, in terms of grasp pose accuracy of RGI-NNet, varies between 92.13% to 95.6% which is far better than several existing models trained with only labelled dataset. For the performance verification of both GI-NNet and RGI-NNet models, we use Anukul (Baxter) hardware cobot.
This paper addresses semi-supervised semantic segmentation by exploiting a small set of images with pixel-level annotations (strong supervisions) and a large set of images with only image-level annotations (weak supervisions). Most existing approaches aim to generate accurate pixel-level labels from weak supervisions. However, we observe that those generated labels still inevitably contain noisy labels. Motivated by this observation, we present a novel perspective and formulate this task as a problem of learning with pixel-level label noise. Existing noisy label methods, nevertheless, mainly aim at image-level tasks, which can not capture the relationship between neighboring labels in one image. Therefore, we propose a graph based label noise detection and correction framework to deal with pixel-level noisy labels. In particular, for the generated pixel-level noisy labels from weak supervisions by Class Activation Map (CAM), we train a clean segmentation model with strong supervisions to detect the clean labels from these noisy labels according to the cross-entropy loss. Then, we adopt a superpixel-based graph to represent the relations of spatial adjacency and semantic similarity between pixels in one image. Finally we correct the noisy labels using a Graph Attention Network (GAT) supervised by detected clean labels. We comprehensively conduct experiments on PASCAL VOC 2012, PASCAL-Context and MS-COCO datasets. The experimental results show that our proposed semi supervised method achieves the state-of-the-art performances and even outperforms the fully-supervised models on PASCAL VOC 2012 and MS-COCO datasets in some cases.
Few-shot learning aims to generalize unseen classes that appear during testing but are unavailable during training. Prototypical networks incorporate few-shot metric learning, by constructing a class prototype in the form of a mean vector of the embedded support points within a class. The performance of prototypical networks in extreme few-shot scenarios (like one-shot) degrades drastically, mainly due to the desuetude of variations within the clusters while constructing prototypes. In this paper, we propose to replace the typical prototypical loss function with an Episodic Triplet Mining (ETM) technique. The conventional triplet selection leads to overfitting, because of all possible combinations being used during training. We incorporate episodic training for mining the semi hard positive and the semi hard negative triplets to overcome the overfitting. We also propose an adaptation to make use of unlabeled training samples for better modeling. Experimenting on two different audio processing tasks, namely speaker recognition and audio event detection; show improved performances and hence the efficacy of ETM over the prototypical loss function and other meta-learning frameworks. Further, we show improved performances when unlabeled training samples are used.
In this paper, we present a semi supervised deep quick learning framework for instance detection and pixel-wise semantic segmentation of images in a dense clutter of items. The framework can quickly and incrementally learn novel items in an online manner by real-time data acquisition and generating corresponding ground truths on its own. To learn various combinations of items, it can synthesize cluttered scenes, in real time. The overall approach is based on the tutor-child analogy in which a deep network (tutor) is pretrained for class-agnostic object detection which generates labeled data for another deep network (child). The child utilizes a customized convolutional neural network head for the purpose of quick learning. There are broadly four key components of the proposed framework semi supervised labeling, occlusion aware clutter synthesis, a customized convolutional neural network head, and instance detection. The initial version of this framework was implemented during our participation in Amazon Robotics Challenge (ARC), 2017. Our system was ranked 3rd, 4th and 5th worldwide in pick, stow-pick and stow task respectively. The proposed framework is an improved version over ARC17 where novel features such as instance detection and online learning has been added.
We propose a novel weakly supervised method to improve the boundary of the 3D segmented nuclei utilizing an over-segmented image. This is motivated by the observation that current state-of-the-art deep learning methods do not result in accurate boundaries when the training data is weakly annotated. Towards this, a 3D U-Net is trained to get the centroid of the nuclei and integrated with a simple linear iterative clustering (SLIC) supervoxel algorithm that provides better adherence to cluster boundaries. To track these segmented nuclei, our algorithm utilizes the relative nuclei location depicting the processes of nuclei division and apoptosis. The proposed algorithmic pipeline achieves better segmentation performance compared to the state-of-the-art method in Cell Tracking Challenge (CTC) 2019 and comparable performance to state-of-the-art methods in IEEE ISBI CTC2020 while utilizing very few pixel-wise annotated data. Detailed experimental results are provided, and the source code is available on GitHub.
In this paper, we propose an original object detection methodology applied to Global Wheat Head Detection (GWHD) Dataset. We have been through two major architectures of object detection which are FasterRCNN and EfficientDet, in order to design a novel and robust wheat head detection model. We emphasize on optimizing the performance of our proposed final architectures. Furthermore, we have been through an extensive exploratory data analysis and adapted best data augmentation techniques to our context. We use semi supervised learning to boost previous supervised models of object detection. Moreover, we put much effort on ensemble to achieve higher performance. Finally we use specific post-processing techniques to optimize our wheat head detection results. Our results have been submitted to solve a research challenge launched on the GWHD Dataset which is led by nine research institutes from seven countries. Our proposed method was ranked within the top 6% in the above mentioned challenge.
A combinatory approach of two well-known fields: deep learning and semi supervised learning is presented, to tackle the land cover identification problem. The proposed methodology demonstrates the impact on the performance of deep learning models, when SSL approaches are used as performance functions during training. Obtained results, at pixel level segmentation tasks over orthoimages, suggest that SSL enhanced loss functions can be beneficial in models' performance.
Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semi-supervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time. We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.
In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model. However, uncertainties of the model may not be well captured with a single hypothesis. As opposed to this technique, we apply a dropout mechanism to capture the uncertainty by obtaining multiple hypothesized text transcripts of an speech recording. We assume that the diversity of automatically generated transcripts for an utterance will implicitly increase the reliability of the model. Finally, the data selection process is also applied on these hypothesized transcripts to reduce the uncertainty. Experiments on freely available TEDLIUM corpus and proprietary Adobe's internal dataset show that the proposed approach significantly reduces ASR errors, compared to the baseline model.
We introduce a novel deep neural network architecture that links visual regions to corresponding textual segments including phrases and words. To accomplish this task, our architecture makes use of the rich semantic information available in a joint embedding space of multi-modal data. From this joint embedding space, we extract the associative localization maps that develop naturally, without explicitly providing supervision during training for the localization task. The joint space is learned using a bidirectional ranking objective that is optimized using a $N$-Pair loss formulation. This training mechanism demonstrates the idea that localization information is learned inherently while optimizing a Bidirectional Retrieval objective. The model's retrieval and localization performance is evaluated on MSCOCO and Flickr30K Entities datasets. This architecture outperforms the state of the art results in the semi-supervised phrase localization setting.
This paper presents a novel framework for predicting shot location and type in tennis. Inspired by recent neuroscience discoveries we incorporate neural memory modules to model the episodic and semantic memory components of a tennis player. We propose a Semi Supervised Generative Adversarial Network architecture that couples these memory models with the automatic feature learning power of deep neural networks and demonstrate methodologies for learning player level behavioural patterns with the proposed framework. We evaluate the effectiveness of the proposed model on tennis tracking data from the 2012 Australian Tennis open and exhibit applications of the proposed method in discovering how players adapt their style depending on the match context.
Conventional methods for visual assessment of civil infrastructures have certain limitations, such as subjectivity of the collected data, long inspection time, and high cost of labor. Although some new technologies i.e. robotic techniques that are currently in practice can collect objective, quantified data, the inspectors own expertise is still critical in many instances since these technologies are not designed to work interactively with human inspector. This study aims to create a smart, human centered method that offers significant contributions to infrastructure inspection, maintenance, management practice, and safety for the bridge owners. By developing a smart Mixed Reality framework, which can be integrated into a wearable holographic headset device, a bridge inspector, for example, can automatically analyze a certain defect such as a crack that he or she sees on an element, display its dimension information in real-time along with the condition state. Such systems can potentially decrease the time and cost of infrastructure inspections by accelerating essential tasks of the inspector such as defect measurement, condition assessment and data processing to management systems. The human centered artificial intelligence will help the inspector collect more quantified and objective data while incorporating inspectors professional judgement. This study explains in detail the described system and related methodologies of implementing attention guided semi supervised deep learning into mixed reality technology, which interacts with the human inspector during assessment. Thereby, the inspector and the AI will collaborate or communicate for improved visual inspection.
The level of PD-L1 expression in immunohistochemistry (IHC) assays is a key biomarker for the identification of Non-Small-Cell-Lung-Cancer (NSCLC) patients that may respond to anti PD-1/PD-L1 treatments. The quantification of PD-L1 expression currently includes the visual estimation of a Tumor Cell (TC) score by a pathologist and consists of evaluating the ratio of PD-L1 positive and PD-L1 negative tumor cells. Known challenges like differences in positivity estimation around clinically relevant cut-offs and sub-optimal quality of samples makes visual scoring tedious and subjective, yielding a scoring variability between pathologists. In this work, we propose a novel deep learning solution that enables the first automated and objective scoring of PD-L1 expression in late stage NSCLC needle biopsies. To account for the low amount of tissue available in biopsy images and to restrict the amount of manual annotations necessary for training, we explore the use of semi-supervised approaches against standard fully supervised methods. We consolidate the manual annotations used for training as well the visual TC scores used for quantitative evaluation with multiple pathologists. Concordance measures computed on a set of slides unseen during training provide evidence that our automatic scoring method matches visual scoring on the considered dataset while ensuring repeatability and objectivity.
免疫组织化学（IHC）测定中PD-L1表达水平是鉴定可能对抗PD-1 / PD-L1治疗有反应的非小细胞肺癌（NSCLC）患者的关键生物标志物。 PD-L1表达的定量目前包括由病理学家对肿瘤细胞（TC）评分的视觉估计，并且包括评估PD-L1阳性和PD-L1阴性肿瘤细胞的比率。已知的挑战如临床相关临界点附近的积极性评估差异和样品的次优质量使得视觉评分冗长和主观，产生病理学家之间的评分变异性。在这项工作中，我们提出了一种新型的深度学习解决方案，能够在晚期NSCLC穿刺活检中首次实现PD-L1表达的自动客观评分。为了解释活组织图像中可用的组织量较少并限制培训所需的手动注释的数量，我们探讨了在标准完全监督方法中使用半监督方法。我们整合了用于培训的手册注释以及用于与多位病理学家进行定量评估的视觉TC分数。通过在训练期间看不到的一组幻灯片计算出的一致性度量提供了证据，证明我们的自动评分方法与所考虑数据集的视觉评分匹配，同时确保可重复性和客观性。