For the diagnosis of diabetes retinopathy (DR) images, this paper proposes a classification method based on artificial intelligence. The core lies in a new data augmentation method, GreenBen, which first extracts the green channel grayscale image from the retinal image and then performs Ben enhancement. Considering that diabetes macular edema (DME) is a complication closely related to DR, this paper constructs a joint classification framework of DR and DME based on multi task learning and attention module, and uses GreenBen to enhance its data to reduce the difference of DR images and improve the accuracy of model classification. We conducted extensive experiments on three publicly available datasets, and our method achieved the best results. For GreenBen, whether based on the ResNet50 network or the Swin Transformer network, whether for individual classification or joint DME classification, compared with other data augmentation methods, GreenBen achieved stable and significant improvements in DR classification results, with an accuracy increase of 10%.
https://arxiv.org/abs/2410.09444
Pretrained language models like BERT and T5 serve as crucial backbone encoders for dense retrieval. However, these models often exhibit limited generalization capabilities and face challenges in improving in domain accuracy. Recent research has explored using large language models (LLMs) as retrievers, achieving SOTA performance across various tasks. Despite these advancements, the specific benefits of LLMs over traditional retrievers and the impact of different LLM configurations, such as parameter sizes, pretraining duration, and alignment processes on retrieval tasks remain unclear. In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks, including in domain accuracy, data efficiency, zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning. We evaluate over 15 different backbone LLMs and non LLMs. Our findings reveal that larger models and extensive pretraining consistently enhance in domain accuracy and data efficiency. Additionally, larger models demonstrate significant potential in zero shot generalization, lengthy retrieval, instruction based retrieval, and multi task learning. These results underscore the advantages of LLMs as versatile and effective backbone encoders in dense retrieval, providing valuable insights for future research and development in this field.
预训练语言模型(如BERT和T5)在密集检索中扮演着至关重要的基础编码器角色。然而,这些模型通常表现出有限的泛化能力,并且在改进领域精度方面面临挑战。最近的研究探索了使用大型语言模型(LLMs)作为检索器,在各种任务上实现SOTA性能。尽管有这些进步,但LLMs相对于传统检索器的具体优势以及不同LLM配置(如参数大小、预训练持续时间和对齐过程)对检索任务的影响仍然不明确。在这篇论文中,我们对一系列检索任务进行了全面的实证研究,包括领域精度、数据效率、零散获取、长时间检索、基于指令的检索和多任务学习。我们评估了15种不同的基础LLM和非LLM。我们的发现表明,更大的模型和广泛的预训练确实增强了领域精度和数据效率。此外,更大的模型在零散获取、长时间检索、基于指令的检索和多任务学习方面表现出显著的增长潜力。这些结果突出了LLMs作为密集检索的有用且有效的编码器的作用,为未来研究和开发这个领域的未来研究提供了宝贵的见解。
https://arxiv.org/abs/2408.12194
This paper describes the winning solution of all 5 tasks for the Amazon KDD Cup 2024 Multi Task Online Shopping Challenge for LLMs. The challenge was to build a useful assistant, answering questions in the domain of online shopping. The competition contained 57 diverse tasks, covering 5 different task types (e.g. multiple choice) and across 4 different tracks (e.g. multi-lingual). Our solution is a single model per track. We fine-tune Qwen2-72B-Instruct on our own training dataset. As the competition released only 96 example questions, we developed our own training dataset by processing multiple public datasets or using Large Language Models for data augmentation and synthetic data generation. We apply wise-ft to account for distribution shifts and ensemble multiple LoRA adapters in one model. We employed Logits Processors to constrain the model output on relevant tokens for the tasks. AWQ 4-bit Quantization and vLLM are used during inference to predict the test dataset in the time constraints of 20 to 140 minutes depending on the track. Our solution achieved the first place in each individual track and is the first place overall of Amazons KDD Cup 2024.
本文描述了亚马逊KDD Cup 2024多任务在线购物挑战中所有5个任务的获胜解决方案。挑战是在在线购物领域的回答问题,比赛包含了57个不同的任务,涵盖了5种不同的任务类型(例如多选题)和4个不同的赛道(例如多语言)。我们的解决方案是一个模型 per track。我们对自己的训练数据进行了微调。由于比赛只发布了96个示例问题,因此我们通过处理多个公共数据集或使用大型语言模型进行数据增强和合成数据生成来创建自己的训练数据。我们应用了 wise-ft 来处理分布漂移和模型输出对相关词的限制。在推理过程中,我们使用了 AWQ 4 位量化器和 vLLM 来预测基于任务的测试数据在20到140分钟内的时间约束。我们的解决方案在每条赛道上均获得了第一,在整体亚马逊KDD Cup 2024中也获得了第一。
https://arxiv.org/abs/2408.04658
Parkinson's disease is easy to diagnose when it is advanced, but it is very difficult to diagnose in its early stages. Early diagnosis is essential to be able to treat the symptoms. It impacts on daily activities and reduces the quality of life of both the patients and their families and it is also the second most prevalent neurodegenerative disorder after Alzheimer in people over the age of 60. Most current studies on the prediction of Parkinson's severity are carried out in advanced stages of the disease. In this work, the study analyzes a set of variables that can be easily extracted from voice analysis, making it a very non-intrusive technique. In this paper, a method based on different deep learning techniques is proposed with two purposes. On the one hand, to find out if a person has severe or non-severe Parkinson's disease, and on the other hand, to determine by means of regression techniques the degree of evolution of the disease in a given patient. The UPDRS (Unified Parkinson's Disease Rating Scale) has been used by taking into account both the motor and total labels, and the best results have been obtained using a mixed multi-layer perceptron (MLP) that classifies and regresses at the same time and the most important features of the data obtained are taken as input, using an autoencoder. A success rate of 99.15% has been achieved in the problem of predicting whether a person suffers from severe Parkinson's disease or non-severe Parkinson's disease. In the degree of disease involvement prediction problem case, a MSE (Mean Squared Error) of 0.15 has been obtained. Using a full deep learning pipeline for data preprocessing and classification has proven to be very promising in the field Parkinson's outperforming the state-of-the-art proposals.
帕金森病在病情较晚的时候容易诊断,但在早期阶段很难诊断。早期的诊断对治疗症状非常重要。这种疾病会影响患者的日常生活,降低他们的生活质量,也是60岁以上人群中最常见的神经退行性疾病。目前,关于预测帕金森病严重程度的研究大部分都是在疾病进展到较晚阶段时进行的。在这项工作中,研究分析了一系列可以轻松从语音分析中提取的变量,使得这是一种非常非侵入性的技术。在这篇论文中,提出了一种基于不同深度学习技术的两种目的的方法。一方面,通过确定一个人是否患有严重或非严重的帕金森病,另一方面,通过回归分析方法确定患者疾病在一定程度上的发展程度。在考虑了运动和总标签的情况下,使用了统一帕金森病评分表(UPDRS),并且最佳结果是通过一种混合多层感知器(MLP)进行分类和回归得到的,输入数据的最重要特征使用自动编码器。在预测一个人是否患有严重帕金森病或非严重帕金森病的问题上,取得了99.15%的成功率。在疾病程度预测问题中,获得了0.15的均方误差(MSE)。使用完整的深度学习数据预处理和分类方法在帕金森病领域表现出了与现有最佳建议相媲美的效果。
https://arxiv.org/abs/2402.05491
Source-free test-time adaptation for medical image segmentation aims to enhance the adaptability of segmentation models to diverse and previously unseen test sets of the target domain, which contributes to the generalizability and robustness of medical image segmentation models without access to the source domain. Ensuring consistency between target edges and paired inputs is crucial for test-time adaptation. To improve the performance of test-time domain adaptation, we propose a multi task consistency guided source-free test-time domain adaptation medical image segmentation method which ensures the consistency of the local boundary predictions and the global prototype representation. Specifically, we introduce a local boundary consistency constraint method that explores the relationship between tissue region segmentation and tissue boundary localization tasks. Additionally, we propose a global feature consistency constraint toto enhance the intra-class compactness. We conduct extensive experiments on the segmentation of benchmark fundus images. Compared to prediction directly by the source domain model, the segmentation Dice score is improved by 6.27\% and 0.96\% in RIM-ONE-r3 and Drishti GS datasets, respectively. Additionally, the results of experiments demonstrate that our proposed method outperforms existing competitive domain adaptation segmentation algorithms.
源自由测试时间适应医疗图像分割旨在增强分割模型的适应性,使其能适应目标域中的多样化和之前未见过的测试集。这有助于实现没有访问源域的医疗图像分割模型的泛化能力和稳健性。确保目标边缘与成对输入之间的一致性对测试时间适应至关重要。为了提高测试时间域适应的性能,我们提出了一个多任务一致性引导的源自由测试时间域适应医疗图像分割方法,确保局部边界预测和全局原型表示的一致性。具体来说,我们引入了组织区域分割与组织边界定位任务之间的关系。此外,我们还提出了全局特征一致性约束以增强类内压缩性。我们对基准基金图像进行了广泛的实验。与直接由源域模型预测的分割结果相比,RIM-ONE-r3和Drishti GS数据集中的分割Dice分数分别提高了6.27%和0.96%。此外,实验结果表明,与现有的竞争域适应分割算法相比,我们提出的方法具有优异的性能。
https://arxiv.org/abs/2310.11766
Logical reasoning is fundamental for humans yet presents a substantial challenge in the domain of Artificial Intelligence. Initially, researchers used Knowledge Representation and Reasoning (KR) systems that did not scale and required non trivial manual effort. Recently, the emergence of large language models (LLMs) has demonstrated the ability to overcome various limitations of formal Knowledge Representation (KR) systems. Consequently, there is a growing interest in using LLMs for logical reasoning via natural language. This work strives to understand the proficiency of LLMs in logical reasoning by offering a brief review of the latest progress in this area; with a focus on the logical reasoning datasets, tasks, and the methods adopted to utilize LLMs for reasoning. To offer a thorough analysis, we have compiled a benchmark titled LogiGLUE. This includes 24 varied datasets encompassing deductive, abductive, and inductive reasoning. We have standardized these datasets into Seq2Seq tasks to facilitate straightforward training and evaluation for future research. Utilizing LogiGLUE as a foundation, we have trained an instruction fine tuned language model, resulting in LogiT5. We study single task training, multi task training, and a chain of thought knowledge distillation fine tuning technique to assess the performance of model across the different logical reasoning categories. By this comprehensive process, we aim to shed light on the capabilities and potential pathways for enhancing logical reasoning proficiency in LLMs, paving the way for more advanced and nuanced developments in this critical field.
逻辑推理是人类基本的思维活动,但在人工智能领域中却面临巨大的挑战。起初,研究人员使用无法扩展且需要大量手动努力的知识表示和推理(KR)系统。最近,大型语言模型(LLM)的出现已经证明了能够克服正式知识表示(KR)系统的各种限制的能力。因此,越来越多的人开始使用LLM来进行自然语言逻辑推理。这项工作旨在通过简要回顾该领域的最新进展,理解LLM在逻辑推理方面的熟练程度。我们焦点关注逻辑推理数据集、任务和利用LLM进行推理的方法。为了进行全面分析,我们汇编了一个基准名为LogiGLUE。该基准包括24个不同的数据集,涵盖了从演绎、归纳和推断推理的各种类型。我们将这些数据集标准化为Seq2Seq任务,以便于未来的研究和 straightforward的训练和评估。利用LogiGLUE作为基础,我们训练了一个优化的语言模型,结果为LogiT5。我们研究单一任务训练、多任务训练和思维知识蒸馏优化技术,以评估模型在不同逻辑推理类别中的表现。通过这种方式,我们旨在阐明LLM在逻辑推理方面的能力和潜在路径,为这个关键领域的更高级、精细的发展铺平道路。
https://arxiv.org/abs/2310.00836
Instruction tuning of language models has demonstrated the ability to enhance model generalization to unseen tasks via in-context learning using a few examples. However, typical supervised learning still requires a plethora of downstream training data for finetuning. Often in real-world situations, there is a scarcity of data available for finetuning, falling somewhere between few shot inference and fully supervised finetuning. In this work, we demonstrate the sample efficiency of instruction tuned models over various tasks by estimating the minimal downstream training data required by them to perform transfer learning and match the performance of state-of-the-art (SOTA) supervised models. We conduct experiments on 119 tasks from Super Natural Instructions (SuperNI) in both the single task learning (STL) and multi task learning (MTL) settings. Our findings reveal that, in the STL setting, instruction tuned models equipped with 25% of the downstream train data surpass the SOTA performance on the downstream tasks. In the MTL setting, an instruction tuned model trained on only 6% of downstream training data achieve SOTA, while using 100% of the training data results in a 3.69% points improvement (ROUGE-L 74.68) over the previous SOTA. We conduct an analysis on T5 vs Tk-Instruct by developing several baselines to demonstrate that instruction tuning aids in increasing both sample efficiency and transfer learning. Additionally, we observe a consistent ~4% performance increase in both settings when pre-finetuning is performed with instructions. Finally, we conduct a categorical study and find that contrary to previous results, tasks in the question rewriting and title generation categories suffer from instruction tuning.
指令优化语言模型已经证明了通过使用几个例子来提高模型对未完成任务的泛化能力的能力。然而,典型的监督学习仍然需要大量的后续训练数据来进行微调。通常,在现实世界的情况下,微调数据的资源非常有限,处于几个Shot推断和完全监督微调之间的中间位置。在本文中,我们使用Super Natural Instructions(超级指令)中的119个任务进行了实验,同时在单任务学习和多任务学习环境中进行了测试。我们的发现表明,在STL环境中,指令优化模型所拥有的25%的后续训练数据超过了后续任务的性能表现(与SOTA相比)。在MTL环境中,只使用后续训练数据训练的指令优化模型达到了SOTA表现,而使用全部训练数据则导致了3.69%点的进步(ROUGE-L 74.68)高于之前的SOTA表现。我们对T5和Tk-Instruct进行了分析,以开发多个基准来表明指令优化有助于增加样本效率和迁移学习。此外,我们在两个环境中观察到一致的 ~4%的性能提升,在执行预微调指令之前。最后,我们进行了分类研究,并发现与之前的结果相反,问题改写和标题生成任务中的任务受到指令优化的影响。
https://arxiv.org/abs/2306.05539
3D reconstruction is a useful tool for surgical planning and guidance. However, the lack of available medical data stunts research and development in this field, as supervised deep learning methods for accurate disparity estimation rely heavily on large datasets containing ground truth information. Alternative approaches to supervision have been explored, such as self-supervision, which can reduce or remove entirely the need for ground truth. However, no proposed alternatives have demonstrated performance capabilities close to what would be expected from a supervised setup. This work aims to alleviate this issue. In this paper, we investigate the learning of structured light projections to enhance the development of direct disparity estimation networks. We show for the first time that it is possible to accurately learn the projection of structured light on a scene, implicitly learning disparity. Secondly, we \textcolor{black}{explore the use of a multi task learning (MTL) framework for the joint training of structured light and disparity. We present results which show that MTL with structured light improves disparity training; without increasing the number of model parameters. Our MTL setup outperformed the single task learning (STL) network in every validation test. Notably, in the medical generalisation test, the STL error was 1.4 times worse than that of the best MTL performance. The benefit of using MTL is emphasised when the training data is limited.} A dataset containing stereoscopic images, disparity maps and structured light projections on medical phantoms and ex vivo tissue was created for evaluation together with virtual scenes. This dataset will be made publicly available in the future.
https://arxiv.org/abs/2301.08140
Perceiving the surrounding environment is essential for enabling autonomous or assisted driving functionalities. Common tasks in this domain include detecting road users, as well as determining lane boundaries and classifying driving conditions. Over the last few years, a large variety of powerful Deep Learning models have been proposed to address individual tasks of camera-based automotive perception with astonishing performances. However, the limited capabilities of in-vehicle embedded computing platforms cannot cope with the computational effort required to run a heavy model for each individual task. In this work, we present CERBERUS (CEnteR Based End-to-end peRception Using a Single model), a lightweight model that leverages a multitask-learning approach to enable the execution of multiple perception tasks at the cost of a single inference. The code will be made publicly available at this https URL
https://arxiv.org/abs/2210.00756
The account of mitotic cells is a key feature in tumor diagnosis. However, due to the variability of mitotic cell morphology, it is a highly challenging task to detect mitotic cells in tumor tissues. At the same time, although advanced deep learning method have achieved great success in cell detection, the performance is often unsatisfactory when tested data from another domain (i.e. the different tumor types and different scanners). Therefore, it is necessary to develop algorithms for detecting mitotic cells with robustness in domain shifts scenarios. Our work further proposes a foreground detection and tumor classification task based on the baseline(Retinanet), and utilizes data augmentation to improve the domain generalization performance of our model. We achieve the state-of-the-art performance (F1 score: 0.5809) on the challenging premilary test dataset.
https://arxiv.org/abs/2208.12657
Massively Multilingual Transformer based Language Models have been observed to be surprisingly effective on zero-shot transfer across languages, though the performance varies from language to language depending on the pivot language(s) used for fine-tuning. In this work, we build upon some of the existing techniques for predicting the zero-shot performance on a task, by modeling it as a multi-task learning problem. We jointly train predictive models for different tasks which helps us build more accurate predictors for tasks where we have test data in very few languages to measure the actual performance of the model. Our approach also lends us the ability to perform a much more robust feature selection and identify a common set of features that influence zero-shot performance across a variety of tasks.
https://arxiv.org/abs/2205.06130
Due to the collection of big data and the development of deep learning, research to predict human emotions in the wild is being actively conducted. We designed a multi-task model using ABAW dataset to predict valence-arousal, expression, and action unit through audio data and face images at in real world. We trained model from the incomplete label by applying the knowledge distillation technique. The teacher model was trained as a supervised learning method, and the student model was trained by using the output of the teacher model as a soft label. As a result we achieved 2.40 in Multi Task Learning task validation dataset.
https://arxiv.org/abs/2203.13072
Retail item data contains many different forms of text like the title of an item, the description of an item, item name and reviews. It is of interest to identify the item name in the other forms of text using a named entity tagger. However, the title of an item and its description are syntactically different (but semantically similar) in that the title is not necessarily a well formed sentence while the description is made up of well formed sentences. In this work, we use a triplet loss to contrast the embeddings of the item title with the description to establish a proof of concept. We find that using the triplet loss in a multi-task NER algorithm improves both the precision and recall by a small percentage. While the improvement is small, we think it is a step in the right direction of using various forms of text in a multi-task algorithm. In addition to precision and recall, the multi task triplet loss method is also found to significantly improve the exact match accuracy i.e. the accuracy of tagging the entire set of tokens in the text with correct tags.
https://arxiv.org/abs/2109.13736
Dynamic balancing under uncertain disturbances is important for a humanoid robot, which requires a good capability of coordinating the entire body redundancy to execute multi tasks. Whole-body control (WBC) based on hierarchical optimization has been generally accepted and utilized in torque-controlled robots. A good hierarchy is the prerequisite for WBC and can be predefined according to prior knowledge. However, the real-time computation would be problematic in the physical applications considering the computational complexity of WBC. For robots with proprioceptive actuation, the joint friction in gear reducer would also degrade the torque tracking performance. In our paper, a reasonable hierarchy of tasks and constraints is first customized for robot dynamic balancing. Then a real-time WBC is implemented via a computationally efficient WBC software. Such a method is solved on a modular master control system UBTMaster characterized by the real-time communication and powerful computing capability. After the joint friction being well covered by the model identification, extensive experiments on various balancing scenarios are conducted on a humanoid Walker3 with proprioceptive actuation. The robot shows an outstanding balance performance even under external impulses as well as the two feet of the robot suffering the inclination and shift disturbances independently. The results demonstrate that with the strict hierarchy, real-time computation and joint friction being handled carefully, the robot with proprioceptive actuation can manage the dynamic physical interactions with the unstructured environments well.
https://arxiv.org/abs/2108.03826
The progress in Computer Aided Diagnosis (CADx) of Wireless Capsule Endoscopy (WCE) is thwarted by the lack of data. The inadequacy in richly representative healthy and abnormal conditions results in isolated analyses of pathologies, that can not handle realistic multi-pathology scenarios. In this work, we explore how to learn more for free, from limited data through solving a WCE multicentric, multi-pathology classification problem. Learning more implies to learning more than full supervision would allow with the same data. This is done by combining self supervision with full supervision, under multi task learning. Additionally, we draw inspiration from the Human Visual System (HVS) in designing self supervision tasks and investigate if seemingly ineffectual signals within the data itself can be exploited to gain performance, if so, which signals would be better than others. Further, we present our analysis of the high level features as a stepping stone towards more robust multi-pathology CADx in WCE.
https://arxiv.org/abs/2106.16162
The promising performance of Deep Neural Networks (DNNs) in text classification, has attracted researchers to use them for fraud review detection. However, the lack of trusted labeled data has limited the performance of the current solutions in detecting fraud reviews. The Generative Adversarial Network (GAN) as a semi-supervised method has demonstrated to be effective for data augmentation purposes. The state-of-the-art solutions utilize GANs to overcome the data scarcity problem. However, they fail to incorporate the behavioral clues in fraud generation. Additionally, state-of-the-art approaches overlook the possible bot-generated reviews in the dataset. Finally, they also suffer from a common limitation in scalability and stability of the GAN, slowing down the training procedure. In this work, we propose ScoreGAN for fraud review detection that makes use of both review text and review rating scores in the generation and detection process. Scores are incorporated through Information Gain Maximization (IGM) into the loss function for three reasons. One is to generate score-correlated reviews based on the scores given to the generator. Second, the generated reviews are employed to train the discriminator, so the discriminator can correctly label the possible bot-generated reviews through joint representations learned from the concatenation of GLobal Vector for Word representation (GLoVe) extracted from the text and the score. Finally, it can be used to improve the stability and scalability of the GAN. Results show that the proposed framework outperformed the existing state-of-the-art framework, namely FakeGAN, in terms of AP by 7\%, and 5\% on the Yelp and TripAdvisor datasets, respectively.
https://arxiv.org/abs/2006.06561
We are interested in learning models of non-stationary environments, which can be framed as a multi-task learning problem. Model-free reinforcement learning algorithms can achieve good asymptotic performance in multi-task learning at a cost of extensive sampling, due to their approach, which requires learning from scratch. While model-based approaches are among the most data efficient learning algorithms, they still struggle with complex tasks and model uncertainties. Meta-reinforcement learning addresses the efficiency and generalization challenges on multi task learning by quickly leveraging the meta-prior policy for a new task. In this paper, we propose a meta-reinforcement learning approach to learn the dynamic model of a non-stationary environment to be used for meta-policy optimization later. Due to the sample efficiency of model-based learning methods, we are able to simultaneously train both the meta-model of the non-stationary environment and the meta-policy until dynamic model convergence. Then, the meta-learned dynamic model of the environment will generate simulated data for meta-policy optimization. Our experiment demonstrates that our proposed method can meta-learn the policy in a non-stationary environment with the data efficiency of model-based learning approaches while achieving the high asymptotic performance of model-free meta-reinforcement learning.
https://arxiv.org/abs/2011.10714
Deep speaker embeddings have become the leading method for encoding speaker identity in speaker recognition tasks. The embedding space should ideally capture the variations between all possible speakers, encoding the multiple aspects that make up speaker identity. In this work, utilizing speaker age as an auxiliary variable in US Supreme Court recordings and speaker nationality with VoxCeleb, we show that by leveraging additional speaker attribute information in a multi task learning setting, deep speaker embedding performance can be increased for verification and diarization tasks, achieving a relative improvement of 17.8% in DER and 8.9% in EER for Supreme Court audio compared to omitting the auxiliary task. Experimental code has been made publicly available.
https://arxiv.org/abs/2010.14269
Deep Neural Networks (DNN) could forget the knowledge about earlier tasks when learning new tasks, and this is known as \textit{catastrophic forgetting}. While recent continual learning methods are capable of alleviating the catastrophic problem on toy-sized datasets, some issues still remain to be tackled when applying them in real-world problems. Recently, the fast mask-based learning method (e.g. piggyback \cite{mallya2018piggyback}) is proposed to address these issues by learning only a binary element-wise mask in a fast manner, while keeping the backbone model fixed. However, the binary mask has limited modeling capacity for new tasks. A more recent work \cite{hung2019compacting} proposes a compress-grow-based method (CPG) to achieve better accuracy for new tasks by partially training backbone model, but with order-higher training cost, which makes it infeasible to be deployed into popular state-of-the-art edge-/mobile-learning. The primary goal of this work is to simultaneously achieve fast and high-accuracy multi task adaption in continual learning setting. Thus motivated, we propose a new training method called \textit{kernel-wise Soft Mask} (KSM), which learns a kernel-wise hybrid binary and real-value soft mask for each task, while using the same backbone model. Such a soft mask can be viewed as a superposition of a binary mask and a properly scaled real-value tensor, which offers a richer representation capability without low-level kernel support to meet the objective of low hardware overhead. We validate KSM on multiple benchmark datasets against recent state-of-the-art methods (e.g. Piggyback, Packnet, CPG, etc.), which shows good improvement in both accuracy and training cost.
https://arxiv.org/abs/2009.05668
The attributes of object contours has great significance for instance segmentation task. However, most of the current popular deep neural networks do not pay much attention to the target edge information. Inspired by the human annotation process when making instance segmentation datasets, in this paper, we propose Mask Point RCNN aiming at promoting the neural networks attention to the target edge information, which can heighten the information propagates between multiple tasks by using different attributes features. Specifically, we present an auxiliary task to Mask RCNN, including utilizing keypoint detection technology to construct the target edge contour, and enhancing the sensitivity of the network to the object edge through multi task learning and feature fusion. These improvements are easy to implement and have a small amount of additional computing overhead. By extensive evaluations on the Cityscapes dataset, the results show that our approach outperforms vanilla Mask RCNN by 5.4 on the validation subset and 5.0 on the test subset.
https://arxiv.org/abs/2008.00460