Image quality assessment (IQA) plays a critical role in optimizing radiation dose and developing novel medical imaging techniques in computed tomography (CT). Traditional IQA methods relying on hand-crafted features have limitations in summarizing the subjective perceptual experience of image quality. Recent deep learning-based approaches have demonstrated strong modeling capabilities and potential for medical IQA, but challenges remain regarding model generalization and perceptual accuracy. In this work, we propose a multi-scale distributions regression approach to predict quality scores by constraining the output distribution, thereby improving model generalization. Furthermore, we design a dual-branch alignment network to enhance feature extraction capabilities. Additionally, semi-supervised learning is introduced by utilizing pseudo-labels for unlabeled data to guide model training. Extensive qualitative experiments demonstrate the effectiveness of our proposed method for advancing the state-of-the-art in deep learning-based medical IQA. Code is available at: this https URL.
Intentionally luring readers to click on a particular content by exploiting their curiosity defines a title as clickbait. Although several studies focused on detecting clickbait titles in English articles, low resource language like Bangla has not been given adequate attention. To tackle clickbait titles in Bangla, we have constructed the first Bangla clickbait detection dataset containing 15,056 labeled news articles and 65,406 unlabelled news articles extracted from clickbait dense news sites. Each article has been labeled by three expert linguists and includes an article's title, body, and other metadata. By incorporating labeled and unlabelled data, we finetune a pretrained Bangla transformer model in an adversarial fashion using Semi Supervised Generative Adversarial Networks (SS GANs). The proposed model acts as a good baseline for this dataset, outperforming traditional neural network models (LSTM, GRU, CNN) and linguistic feature based models. We expect that this dataset and the detailed analysis and comparison of these clickbait detection models will provide a fundamental basis for future research into detecting clickbait titles in Bengali articles. We have released the corresponding code and dataset.
Surgical instrument segmentation is recognised as a key enabler to provide advanced surgical assistance and improve computer assisted interventions. In this work, we propose SegMatch, a semi supervised learning method to reduce the need for expensive annotation for laparoscopic and robotic surgical images. SegMatch builds on FixMatch, a widespread semi supervised classification pipeline combining consistency regularization and pseudo labelling, and adapts it for the purpose of segmentation. In our proposed SegMatch, the unlabelled images are weakly augmented and fed into the segmentation model to generate a pseudo-label to enforce the unsupervised loss against the output of the model for the adversarial augmented image on the pixels with a high confidence score. Our adaptation for segmentation tasks includes carefully considering the equivariance and invariance properties of the augmentation functions we rely on. To increase the relevance of our augmentations, we depart from using only handcrafted augmentations and introduce a trainable adversarial augmentation strategy. Our algorithm was evaluated on the MICCAI Instrument Segmentation Challenge datasets Robust-MIS 2019 and EndoVis 2017. Our results demonstrate that adding unlabelled data for training purposes allows us to surpass the performance of fully supervised approaches which are limited by the availability of training data in these challenges. SegMatch also outperforms a range of state-of-the-art semi-supervised learning semantic segmentation models in different labelled to unlabelled data ratios.
surgical instrument segmentation 被认为是提供高级手术辅助和提高计算机辅助干预的关键工具。在这项工作中,我们提出了 segMatch 半监督学习方法,以降低对Laparoscopic 和机器人手术图像的昂贵标记的依赖性。SegMatch 建立在 fixMatch 普遍使用的半监督分类管道,结合一致性 Regularization 和伪标签,并适应于分割任务。在我们提出的 segMatch 中,未标记的图像进行弱增强,并输入到分割模型,生成伪标签,以强制模型输出对具有高信心值像素的无监督损失与模型的输出。我们对分割任务的调整包括仔细考虑我们依赖的增强函数的等温性和不变性性质。为了增加我们的增强函数的相关性,我们离开了仅使用手工增强和引入训练可增强的对抗增强策略。我们的方法在 MICCAI 设备分割挑战数据集 Robust-MIS 2019 和EndoVis 2017 中进行评估。我们的结果表明,为训练目的添加未标记数据可以使我们超过这些挑战中完全监督方法的性能限制。此外, segMatch 在不同类型的标记数据比例下表现出比当前最先进的半监督学习语义分割模型更好的性能。
We propose a novel semi-supervised active learning (SSAL) framework for monocular 3D object detection with LiDAR guidance (MonoLiG), which leverages all modalities of collected data during model development. We utilize LiDAR to guide the data selection and training of monocular 3D detectors without introducing any overhead in the inference phase. During training, we leverage the LiDAR teacher, monocular student cross-modal framework from semi-supervised learning to distill information from unlabeled data as pseudo-labels. To handle the differences in sensor characteristics, we propose a data noise-based weighting mechanism to reduce the effect of propagating noise from LiDAR modality to monocular. For selecting which samples to label to improve the model performance, we propose a sensor consistency-based selection score that is also coherent with the training objective. Extensive experimental results on KITTI and Waymo datasets verify the effectiveness of our proposed framework. In particular, our selection strategy consistently outperforms state-of-the-art active learning baselines, yielding up to 17% better saving rate in labeling costs. Our training strategy attains the top place in KITTI 3D and birds-eye-view (BEV) monocular object detection official benchmarks by improving the BEV Average Precision (AP) by 2.02.
我们提出了一种新的半监督主动学习(SSAL)框架,以单眼3D物体检测为例,利用LiDAR guidance(单眼LiG),在模型开发过程中利用所有数据模式。我们利用LiDAR指导选择和训练单眼3D探测器,在推理阶段没有引入任何额外的负担。在训练过程中,我们利用半监督学习的LiDAR教师和单眼学生跨modal框架,从半监督学习中提取信息,将其作为伪标签进行舍入。为了处理传感器特性的差异,我们提出了一种数据噪声加权机制,以减少LiDAR模式向单眼模式的传播噪声的影响。为了选择哪些样本进行标注以改善模型性能,我们提出了一种传感器一致性选择得分,也与训练目标相一致。在KITTI和Waymo数据集上的广泛实验结果验证了我们提出的框架的有效性。特别是,我们的选择策略 consistently outperforms state-of-the-art主动学习基准线,在标签成本方面提高了17%的节省率。我们的训练策略通过提高KITTI3D和 birds-eye-view(BEV)单眼物体检测官方基准线的AP值,达到了最高排名。
We approached the goal of applying meta-learning to self-supervised masked autoencoders for spatiotemporal learning in three steps. Broadly, we seek to understand the impact of applying meta-learning to existing state-of-the-art representation learning architectures. Thus, we test spatiotemporal learning through: a meta-learning architecture only, a representation learning architecture only, and an architecture applying representation learning alongside a meta learning architecture. We utilize the Memory Augmented Neural Network (MANN) architecture to apply meta-learning to our framework. Specifically, we first experiment with applying a pre-trained MAE and fine-tuning on our small-scale spatiotemporal dataset for video reconstruction tasks. Next, we experiment with training an MAE encoder and applying a classification head for action classification tasks. Finally, we experiment with applying a pre-trained MAE and fine-tune with MANN backbone for action classification tasks.
The fifth Affective Behavior Analysis in-the-wild (ABAW) competition has multiple challenges such as Valence-Arousal Estimation Challenge, Expression Classification Challenge, Action Unit Detection Challenge, Emotional Reaction Intensity Estimation Challenge. In this paper we have dealt only expression classification challenge using multiple approaches such as fully supervised, semi-supervised and noisy label approach. Our approach using noise aware model has performed better than baseline model by 10.46% and semi supervised model has performed better than baseline model by 9.38% and the fully supervised model has performed better than the baseline by 9.34%
The increasing intensity and frequency of floods is one of the many consequences of our changing climate. In this work, we explore ML techniques that improve the flood detection module of an operational early flood warning system. Our method exploits an unlabelled dataset of paired multi-spectral and Synthetic Aperture Radar (SAR) imagery to reduce the labeling requirements of a purely supervised learning method. Prior works have used unlabelled data by creating weak labels out of them. However, from our experiments we noticed that such a model still ends up learning the label mistakes in those weak labels. Motivated by knowledge distillation and semi supervised learning, we explore the use of a teacher to train a student with the help of a small hand labelled dataset and a large unlabelled dataset. Unlike the conventional self distillation setup, we propose a cross modal distillation framework that transfers supervision from a teacher trained on richer modality (multi-spectral images) to a student model trained on SAR imagery. The trained models are then tested on the Sen1Floods11 dataset. Our model outperforms the Sen1Floods11 baseline model trained on the weak labeled SAR imagery by an absolute margin of 6.53% Intersection-over-Union (IoU) on the test split.
洪水的强度和频率的增加是我们气候变化的许多后果之一。在这个研究中,我们探讨了机器学习技术,以提高 operational early flood warning system 中的洪水检测模块。我们利用一个未标记的配对多光谱和合成孔径雷达图像的未命名数据集,以降低纯粹的监督学习方法的标记要求。以前的工作已经使用未标记数据,从它们中创建弱标签。然而,从我们的实验中我们发现,这样的模型仍然最终学习这些弱标签的标记错误。基于知识蒸馏和半监督学习的动机,我们探讨了使用一名教师帮助训练学生的方法,使用一个小手标注的数据集和一个大型未标注的数据集。与传统的自蒸馏setup不同,我们提出了一种跨modal蒸馏框架,将监督从训练丰富的modality(多光谱图像)转移到训练SAR图像的学生模型中。训练模型后,在Sen1Floods11数据集上进行了测试。我们的模型在弱标签SAR图像上的标记错误训练 Sen1Floods11 基线模型的相对误差6.53%的IoU上表现出色。
Semi supervised learning (SSL) provides an effective means of leveraging unlabelled data to improve a model's performance. Even though the domain has received a considerable amount of attention in the past years, most methods present the common drawback of being unsafe. By safeness we mean the quality of not degrading a fully supervised model when including unlabelled data. Our starting point is to notice that the estimate of the risk that most discriminative SSL methods minimise is biased, even asymptotically. This bias makes these techniques untrustable without a proper validation set, but we propose a simple way of removing the bias. Our debiasing approach is straightforward to implement, and applicable to most deep SSL methods. We provide simple theoretical guarantees on the safeness of these modified methods, without having to rely on the strong assumptions on the data distribution that SSL theory usually requires. We evaluate debiased versions of different existing SSL methods and show that debiasing can compete with classic deep SSL techniques in various classic settings and even performs well when traditional SSL fails.
In this paper, we propose a Neural Architecture Search strategy based on self supervision and semi-supervised learning for the task of semantic segmentation. Our approach builds an optimized neural network (NN) model for this task by jointly solving a jigsaw pretext task discovered with self-supervised learning over unlabeled training data, and, exploiting the structure of the unlabeled data with semi-supervised learning. The search of the architecture of the NN model is performed by dynamic routing using a gradient descent algorithm. Experiments on the Cityscapes and PASCAL VOC 2012 datasets demonstrate that the discovered neural network is more efficient than a state-of-the-art hand-crafted NN model with four times less floating operations.
Computer aided diagnostics often requires analysis of a region of interest (ROI) within a radiology scan, and the ROI may be an organ or a suborgan. Although deep learning algorithms have the ability to outperform other methods, they rely on the availability of a large amount of annotated data. Motivated by the need to address this limitation, an approach to localisation and detection of multiple organs based on supervised and semi-supervised learning is presented here. It draws upon previous work by the authors on localising the thoracic and lumbar spine region in CT images. The method generates six bounding boxes of organs of interest, which are then fused to a single bounding box. The results of experiments on localisation of the Spleen, Left and Right Kidneys in CT Images using supervised and semi supervised learning (SSL) demonstrate the ability to address data limitations with a much smaller data set and fewer annotations, compared to other state-of-the-art methods. The SSL performance was evaluated using three different mixes of labelled and unlabelled data (i.e.30:70,35:65,40:60) for each of lumbar spine, spleen left and right kidneys respectively. The results indicate that SSL provides a workable alternative especially in medical imaging where it is difficult to obtain annotated data.
Our way of grasping objects is challenging for efficient, intelligent and optimal grasp by COBOTs. To streamline the process, here we use deep learning techniques to help robots learn to generate and execute appropriate grasps quickly. We developed a Generative Inception Neural Network (GI-NNet) model, capable of generating antipodal robotic grasps on seen as well as unseen objects. It is trained on Cornell Grasping Dataset (CGD) and attained 98.87% grasp pose accuracy for detecting both regular and irregular shaped objects from RGB-Depth (RGB-D) images while requiring only one third of the network trainable parameters as compared to the existing approaches. However, to attain this level of performance the model requires the entire 90% of the available labelled data of CGD keeping only 10% labelled data for testing which makes it vulnerable to poor generalization. Furthermore, getting sufficient and quality labelled dataset is becoming increasingly difficult keeping in pace with the requirement of gigantic networks. To address these issues, we attach our model as a decoder with a semi-supervised learning based architecture known as Vector Quantized Variational Auto Encoder (VQVAE), which works efficiently when trained both with the available labelled and unlabelled data. The proposed model, which we name as Representation based GI-NNet (RGI-NNet), has been trained with various splits of label data on CGD with as minimum as 10% labelled dataset together with latent embedding generated from VQVAE up to 50% labelled data with latent embedding obtained from VQVAE. The performance level, in terms of grasp pose accuracy of RGI-NNet, varies between 92.13% to 95.6% which is far better than several existing models trained with only labelled dataset. For the performance verification of both GI-NNet and RGI-NNet models, we use Anukul (Baxter) hardware cobot.
This paper addresses semi-supervised semantic segmentation by exploiting a small set of images with pixel-level annotations (strong supervisions) and a large set of images with only image-level annotations (weak supervisions). Most existing approaches aim to generate accurate pixel-level labels from weak supervisions. However, we observe that those generated labels still inevitably contain noisy labels. Motivated by this observation, we present a novel perspective and formulate this task as a problem of learning with pixel-level label noise. Existing noisy label methods, nevertheless, mainly aim at image-level tasks, which can not capture the relationship between neighboring labels in one image. Therefore, we propose a graph based label noise detection and correction framework to deal with pixel-level noisy labels. In particular, for the generated pixel-level noisy labels from weak supervisions by Class Activation Map (CAM), we train a clean segmentation model with strong supervisions to detect the clean labels from these noisy labels according to the cross-entropy loss. Then, we adopt a superpixel-based graph to represent the relations of spatial adjacency and semantic similarity between pixels in one image. Finally we correct the noisy labels using a Graph Attention Network (GAT) supervised by detected clean labels. We comprehensively conduct experiments on PASCAL VOC 2012, PASCAL-Context and MS-COCO datasets. The experimental results show that our proposed semi supervised method achieves the state-of-the-art performances and even outperforms the fully-supervised models on PASCAL VOC 2012 and MS-COCO datasets in some cases.
Few-shot learning aims to generalize unseen classes that appear during testing but are unavailable during training. Prototypical networks incorporate few-shot metric learning, by constructing a class prototype in the form of a mean vector of the embedded support points within a class. The performance of prototypical networks in extreme few-shot scenarios (like one-shot) degrades drastically, mainly due to the desuetude of variations within the clusters while constructing prototypes. In this paper, we propose to replace the typical prototypical loss function with an Episodic Triplet Mining (ETM) technique. The conventional triplet selection leads to overfitting, because of all possible combinations being used during training. We incorporate episodic training for mining the semi hard positive and the semi hard negative triplets to overcome the overfitting. We also propose an adaptation to make use of unlabeled training samples for better modeling. Experimenting on two different audio processing tasks, namely speaker recognition and audio event detection; show improved performances and hence the efficacy of ETM over the prototypical loss function and other meta-learning frameworks. Further, we show improved performances when unlabeled training samples are used.
In this paper, we present a semi supervised deep quick learning framework for instance detection and pixel-wise semantic segmentation of images in a dense clutter of items. The framework can quickly and incrementally learn novel items in an online manner by real-time data acquisition and generating corresponding ground truths on its own. To learn various combinations of items, it can synthesize cluttered scenes, in real time. The overall approach is based on the tutor-child analogy in which a deep network (tutor) is pretrained for class-agnostic object detection which generates labeled data for another deep network (child). The child utilizes a customized convolutional neural network head for the purpose of quick learning. There are broadly four key components of the proposed framework semi supervised labeling, occlusion aware clutter synthesis, a customized convolutional neural network head, and instance detection. The initial version of this framework was implemented during our participation in Amazon Robotics Challenge (ARC), 2017. Our system was ranked 3rd, 4th and 5th worldwide in pick, stow-pick and stow task respectively. The proposed framework is an improved version over ARC17 where novel features such as instance detection and online learning has been added.
We propose a novel weakly supervised method to improve the boundary of the 3D segmented nuclei utilizing an over-segmented image. This is motivated by the observation that current state-of-the-art deep learning methods do not result in accurate boundaries when the training data is weakly annotated. Towards this, a 3D U-Net is trained to get the centroid of the nuclei and integrated with a simple linear iterative clustering (SLIC) supervoxel algorithm that provides better adherence to cluster boundaries. To track these segmented nuclei, our algorithm utilizes the relative nuclei location depicting the processes of nuclei division and apoptosis. The proposed algorithmic pipeline achieves better segmentation performance compared to the state-of-the-art method in Cell Tracking Challenge (CTC) 2019 and comparable performance to state-of-the-art methods in IEEE ISBI CTC2020 while utilizing very few pixel-wise annotated data. Detailed experimental results are provided, and the source code is available on GitHub.
In this paper, we propose an original object detection methodology applied to Global Wheat Head Detection (GWHD) Dataset. We have been through two major architectures of object detection which are FasterRCNN and EfficientDet, in order to design a novel and robust wheat head detection model. We emphasize on optimizing the performance of our proposed final architectures. Furthermore, we have been through an extensive exploratory data analysis and adapted best data augmentation techniques to our context. We use semi supervised learning to boost previous supervised models of object detection. Moreover, we put much effort on ensemble to achieve higher performance. Finally we use specific post-processing techniques to optimize our wheat head detection results. Our results have been submitted to solve a research challenge launched on the GWHD Dataset which is led by nine research institutes from seven countries. Our proposed method was ranked within the top 6% in the above mentioned challenge.
A combinatory approach of two well-known fields: deep learning and semi supervised learning is presented, to tackle the land cover identification problem. The proposed methodology demonstrates the impact on the performance of deep learning models, when SSL approaches are used as performance functions during training. Obtained results, at pixel level segmentation tasks over orthoimages, suggest that SSL enhanced loss functions can be beneficial in models' performance.
Optical character recognition (OCR) systems performance have improved significantly in the deep learning era. This is especially true for handwritten text recognition (HTR), where each author has a unique style, unlike printed text, where the variation is smaller by design. That said, deep learning based HTR is limited, as in every other task, by the number of training examples. Gathering data is a challenging and costly task, and even more so, the labeling task that follows, of which we focus here. One possible approach to reduce the burden of data annotation is semi-supervised learning. Semi supervised methods use, in addition to labeled data, some unlabeled samples to improve performance, compared to fully supervised ones. Consequently, such methods may adapt to unseen images during test time. We present ScrabbleGAN, a semi-supervised approach to synthesize handwritten text images that are versatile both in style and lexicon. ScrabbleGAN relies on a novel generative model which can generate images of words with an arbitrary length. We show how to operate our approach in a semi-supervised manner, enjoying the aforementioned benefits such as performance boost over state of the art supervised HTR. Furthermore, our generator can manipulate the resulting text style. This allows us to change, for instance, whether the text is cursive, or how thin is the pen stroke.
In this paper, we explore various approaches for semi supervised learning in an end to end automatic speech recognition (ASR) framework. The first step in our approach involves training a seed model on the limited amount of labelled data. Additional unlabelled speech data is employed through a data selection mechanism to obtain the best hypothesized output, further used to retrain the seed model. However, uncertainties of the model may not be well captured with a single hypothesis. As opposed to this technique, we apply a dropout mechanism to capture the uncertainty by obtaining multiple hypothesized text transcripts of an speech recording. We assume that the diversity of automatically generated transcripts for an utterance will implicitly increase the reliability of the model. Finally, the data selection process is also applied on these hypothesized transcripts to reduce the uncertainty. Experiments on freely available TEDLIUM corpus and proprietary Adobe's internal dataset show that the proposed approach significantly reduces ASR errors, compared to the baseline model.
We introduce a novel deep neural network architecture that links visual regions to corresponding textual segments including phrases and words. To accomplish this task, our architecture makes use of the rich semantic information available in a joint embedding space of multi-modal data. From this joint embedding space, we extract the associative localization maps that develop naturally, without explicitly providing supervision during training for the localization task. The joint space is learned using a bidirectional ranking objective that is optimized using a $N$-Pair loss formulation. This training mechanism demonstrates the idea that localization information is learned inherently while optimizing a Bidirectional Retrieval objective. The model's retrieval and localization performance is evaluated on MSCOCO and Flickr30K Entities datasets. This architecture outperforms the state of the art results in the semi-supervised phrase localization setting.