Paper Reading AI Learner

How to make a pizza: Learning a compositional layer-based GAN model

2019-06-06 23:22:31
Dim P. Papadopoulos, Youssef Tamaazousti, Ferda Ofli, Ingmar Weber, Antonio Torralba

Abstract

A food recipe is an ordered set of instructions for preparing a particular dish. From a visual perspective, every instruction step can be seen as a way to change the visual appearance of the dish by adding extra objects (e.g., adding an ingredient) or changing the appearance of the existing ones (e.g., cooking the dish). In this paper, we aim to teach a machine how to make a pizza by building a generative model that mirrors this step-by-step procedure. To do so, we learn composable module operations which are able to either add or remove a particular ingredient. Each operator is designed as a Generative Adversarial Network (GAN). Given only weak image-level supervision, the operators are trained to generate a visual layer that needs to be added to or removed from the existing image. The proposed model is able to decompose an image into an ordered sequence of layers by applying sequentially in the right order the corresponding removing modules. Experimental results on synthetic and real pizza images demonstrate that our proposed model is able to: (1) segment pizza toppings in a weaklysupervised fashion, (2) remove them by revealing what is occluded underneath them (i.e., inpainting), and (3) infer the ordering of the toppings without any depth ordering supervision. Code, data, and models are available online.

Abstract (translated)

食物配方是为准备一道特定的菜而定的一套说明。从视觉的角度来看,每一个指令步骤都可以被视为通过添加额外的对象(例如添加配料)或更改现有对象(例如烹饪菜)的外观来更改菜的视觉外观的一种方法。在本文中,我们的目标是教一个机器如何通过建立一个生成模型,反映出这个步骤一步一步的过程。为此,我们学习可组合的模块操作,这些操作可以添加或删除特定的成分。每个运营商都被设计成一个生成对抗网络(gan)。由于只有较弱的图像级别监控,操作人员接受培训,以生成需要添加到现有图像或从现有图像中删除的可视层。该模型能够通过按正确的顺序应用相应的去除模块,将图像分解成有序的层序列。对合成和真实比萨图像的实验结果表明,我们提出的模型能够:(1)以一种服务性较弱的方式分割比萨面层,(2)通过揭示其下面被遮挡的部分(即,油漆)将其移除,(3)在没有任何深度订购监督的情况下推断面层的订购。代码、数据和模型在线提供。

URL

https://arxiv.org/abs/1906.02839

PDF

https://arxiv.org/pdf/1906.02839.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot