Designing complex 3D scenes has been a tedious, manual process requiring domain expertise. Emerging text-to-3D generative models show great promise for making this task more intuitive, but existing approaches are limited to object-level generation. We introduce \textbf{locally conditioned diffusion} as an approach to compositional scene diffusion, providing control over semantic parts using text prompts and bounding boxes while ensuring seamless transitions between these parts. We demonstrate a score distillation sampling--based text-to-3D synthesis pipeline that enables compositional 3D scene generation at a higher fidelity than relevant baselines.
设计复杂的三维场景是一项繁琐、需要领域专业知识的手动过程。新兴的文本到三维生成模型显示出希望使任务更加直观的巨大潜力,但现有的方法仅限于对象级别的生成。我们提出了 \textbf{locally conditioned diffusion} 作为组成性场景扩散的方法,使用文本提示和边界框提供对语义部分的控制,同时确保这些部分的无缝过渡。我们展示了基于分数蒸馏采样的文本到三维合成管道,使其生成组成的三维场景比相关基线更加逼真。
https://arxiv.org/abs/2303.12218
Image synthesis driven by computer graphics achieved recently a remarkable realism, yet synthetic image data generated this way reveals a significant domain gap with respect to real-world data. This is especially true in autonomous driving scenarios, which represent a critical aspect for overcoming utilizing synthetic data for training neural networks. We propose a method based on domain-invariant scene representation to directly synthesize traffic scene imagery without rendering. Specifically, we rely on synthetic scene graphs as our internal representation and introduce an unsupervised neural network architecture for realistic traffic scene synthesis. We enhance synthetic scene graphs with spatial information about the scene and demonstrate the effectiveness of our approach through scene manipulation.
以计算机图形驱动的图像合成最近取得了惊人的真实感,但这种方式生成的合成图像数据揭示了与现实世界数据之间存在的重大领域差异。这在自动驾驶场景中尤为重要,这代表了克服利用合成数据训练神经网络的关键问题之一。我们提出了一种基于领域无关场景表示的方法,直接合成交通场景图像而不需要进行渲染。具体来说,我们依靠合成场景图作为内部表示,并引入无监督神经网络架构来实现逼真的交通场景合成。我们利用场景的空间信息进行增强,并通过场景操纵来展示我们的方法的有效性。
https://arxiv.org/abs/2303.08473
We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Previous scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes with deformable tetrahedral grids, and then train a diffusion model on this direct parametrization. We demonstrate the effectiveness of our model on multiple generative tasks.
我们考虑生成逼真的三维形状,这对许多应用如自动场景生成和物理模拟等非常有用。与其他三维表示方法如立方晶格和点云相比,网格在实践中更受欢迎,因为(1)它们能够方便和任意地操纵形状进行照明和模拟,(2)它们能够充分利用现代图形管道的高效性,这些管道主要优化于网格。以前的可扩展生成网格方法通常依赖于次优的后期处理,它们往往产生过于平滑或噪声丰富的表面,缺乏精细的几何细节。为了克服这些缺点,我们利用网格的图结构,并使用一个简单的但非常有效的生成建模方法生成三维网格。具体来说,我们使用可变形的四元组网格表示网格,然后训练一个扩散模型在这个直接参数化上。我们证明了我们模型在多个生成任务中的 effectiveness。
https://arxiv.org/abs/2303.08133
Indoor scene synthesis involves automatically picking and placing furniture appropriately on a floor plan, so that the scene looks realistic and is functionally plausible. Such scenes can serve as a home for immersive 3D experiences, or be used to train embodied agents. Existing methods for this task rely on labeled categories of furniture, e.g. bed, chair or table, to generate contextually relevant combinations of furniture. Whether heuristic or learned, these methods ignore instance-level attributes of objects such as color and style, and as a result may produce visually less coherent scenes. In this paper, we introduce an auto-regressive scene model which can output instance-level predictions, making use of general purpose image embedding based on CLIP. This allows us to learn visual correspondences such as matching color and style, and produce more plausible and aesthetically pleasing scenes. Evaluated on the 3D-FRONT dataset, our model achieves SOTA results in scene generation and improves auto-completion metrics by over 50%. Moreover, our embedding-based approach enables zero-shot text-guided scene generation and editing, which easily generalizes to furniture not seen at training time.
在室内场景合成中,自动选择和在地图上适当放置家具,以使场景看起来真实且功能上合理。这些场景可以作为一个沉浸式3D体验的家,或者用于训练实体代理。目前,对于这个任务的方法,依赖于标记的家具类别,例如床、椅子或桌子,生成上下文相关的家具组合。无论是启发式还是学习的方法,这些方法都忽略了对象如颜色和风格的实例级属性,因此可能会导致视觉不连贯的场景。在本文中,我们介绍了一种自回归场景模型,它可以输出实例级预测,使用基于CLIP的通用图像嵌入。这使我们能够学习视觉对应关系,例如匹配颜色和风格,并产生更加合理和美学上愉悦的场景。在3D-Front数据集上进行评估,我们的模型在场景生成方面取得了SOTA结果,并超过50%的自动完成指标得到了改善。此外,我们的嵌入方法使零次引导的场景生成和编辑成为可能,这轻松地适用于训练时未看到的家具。
https://arxiv.org/abs/2303.03565
3D indoor scenes are widely used in computer graphics, with applications ranging from interior design to gaming to virtual and augmented reality. They also contain rich information, including room layout, as well as furniture type, geometry, and placement. High-quality 3D indoor scenes are highly demanded while it requires expertise and is time-consuming to design high-quality 3D indoor scenes manually. Existing research only addresses partial problems: some works learn to generate room layout, and other works focus on generating detailed structure and geometry of individual furniture objects. However, these partial steps are related and should be addressed together for optimal synthesis. We propose SCENEHGN, a hierarchical graph network for 3D indoor scenes that takes into account the full hierarchy from the room level to the object level, then finally to the object part level. Therefore for the first time, our method is able to directly generate plausible 3D room content, including furniture objects with fine-grained geometry, and their layout. To address the challenge, we introduce functional regions as intermediate proxies between the room and object levels to make learning more manageable. To ensure plausibility, our graph-based representation incorporates both vertical edges connecting child nodes with parent nodes from different levels, and horizontal edges encoding relationships between nodes at the same level. Extensive experiments demonstrate that our method produces superior generation results, even when comparing results of partial steps with alternative methods that can only achieve these. We also demonstrate that our method is effective for various applications such as part-level room editing, room interpolation, and room generation by arbitrary room boundaries.
3D室内场景在计算机图形学中广泛应用,涵盖了室内设计、游戏、虚拟现实和增强现实等领域。它们包含了丰富的信息,包括房间布局、家具类型、几何形状和位置。高质量的3D室内场景非常需求,但需要专业知识,手动设计高质量的3D室内场景需要花费大量时间和精力。现有的研究只解决了部分问题:一些工作学习生成房间布局,其他工作关注生成 individual furniture objects 的详细结构和几何形状。然而,这些部分步骤是相关的,应该一起解决以最佳合成。我们提出了SCENEHGN,一个3D室内场景的Hierarchical Graph Network,考虑了从房间到对象的所有层级,最终到对象部分层级。因此,我们的方法第一次能够直接生成可信赖的3D房间内容,包括精细几何形状的家具对象和他们的位置。为了应对挑战,我们引入了功能区域作为房间和对象层级之间的中间代理,以便更易于管理学习。为了确保可信度,我们的基于图的表示包括垂直Edges连接不同层级的子节点和父节点,以及水平Edges编码同一层级节点之间的关系。广泛的实验表明,我们的方法产生更好的生成结果,即使将这些步骤的结果与只能实现这些步骤的其他方法进行比较。我们还证明,我们的方法适用于各种应用,例如部分级别房间编辑、房间插值和通过任意房间边界生成房间。
https://arxiv.org/abs/2302.10237
In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noises. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our framework starts from an efficient bird's-eye-view (BEV) representation generated from simplex noise, which consists of a height field and a semantic field. The height field represents the surface elevation of 3D scenes, while the semantic field provides detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Furthermore, we propose a novel generative neural hash grid to parameterize the latent space given 3D positions and the scene semantics, which aims to encode generalizable features across scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
在本作品中,我们提出了SceneDreamer,一个无条件生成模型,用于生成无限制的三维场景。该模型从随机噪声中合成大规模的三维地形。我们的框架仅从野生的2D图像集学习,没有任何3D注释。SceneDreamer的核心是一种有原则的学习范式,包括1)高效但表达丰富的3D场景表示,2)生成场景参数化,3)可以利用2D图像知识的有效渲染器。我们的框架从简单的单源噪声生成高效的俯瞰视图(BEV)表示,该表示由高度场和语义场组成。高度场表示3D场景的表面高度,而语义场提供详细的场景语义。这种BEV场景表示可以1)代表具有平方复杂度的3D场景,2)分离几何和语义,3)高效训练。此外,我们提出了一种新的生成神经网络哈希网格,以参数化给定3D位置和场景语义的隐含空间,旨在编码跨场景通用的特征。最后,通过对抗训练从2D图像集学习到的神经网络体积渲染器被用于生成逼真的图像。广泛的实验结果表明SceneDreamer的有效性和在生成丰富但多样性无限的三维世界中的优势。
https://arxiv.org/abs/2302.01330
We propose a method for text-driven perpetual view generation -- synthesizing long videos of arbitrary scenes solely from an input text describing the scene and camera poses. We introduce a novel framework that generates such videos in an online fashion by combining the generative power of a pre-trained text-to-image model with the geometric priors learned by a pre-trained monocular depth prediction model. To achieve 3D consistency, i.e., generating videos that depict geometrically-plausible scenes, we deploy an online test-time training to encourage the predicted depth map of the current frame to be geometrically consistent with the synthesized scene; the depth maps are used to construct a unified mesh representation of the scene, which is updated throughout the generation and is used for rendering. In contrast to previous works, which are applicable only for limited domains (e.g., landscapes), our framework generates diverse scenes, such as walkthroughs in spaceships, caves, or ice castles. Project page: this https URL
我们提出了一种基于文本驱动的永久性视角生成方法——通过从输入文本描述的场景和相机姿态中生成长篇视频。我们介绍了一种新框架,该框架通过将预先训练的文本到图像生成模型的生成能力与预先训练的单眼深度预测模型的学习几何预条件相结合,以在线方式生成此类视频。为了实现3D一致性,即生成几何上可行的场景,我们采用了在线测试时间训练,以鼓励当前帧预测深度图与合成场景几何一致性;深度图用于构建场景的统一网格表示,在整个生成过程中进行更新,用于渲染。与以前的工作不同,这些工作仅适用于某些特定领域(例如风景),而我们的框架可以生成各种场景,例如在飞船、洞穴或冰城堡中的路径观察。项目页面:这个 https URL 链接。
https://arxiv.org/abs/2302.01133
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions. Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency by querying a Text-to-Video (T2V) diffusion-based model. The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment. MAV3D does not require any 3D or 4D data and the T2V model is trained only on Text-Image pairs and unlabeled videos. We demonstrate the effectiveness of our approach using comprehensive quantitative and qualitative experiments and show an improvement over previously established internal baselines. To the best of our knowledge, our method is the first to generate 3D dynamic scenes given a text description.
我们提出了MAV3D(Make-A-Video3D),一种从文本描述生成三维动态场景的方法。我们的方法使用了一种4D的动态神经网络辐射场(NeRF),通过查询文本到视频(T2V)扩散模型优化了场景外观、密度和运动一致性。从提供的文字生成的动态视频输出可以从任何相机位置和角度观看,可以组合成任何3D环境。MAV3D不需要任何3D或4D数据,T2V模型仅基于文本图像对和未标记的视频训练。我们使用全面量化和定性实验证明了我们方法的有效性,并展示了相对于之前建立的内部基准线的改进。据我们所知,我们的方法是第一个根据文本描述生成3D动态场景的方法。
https://arxiv.org/abs/2301.11280
Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at this https URL.
https://arxiv.org/abs/2212.04360
For low-level computer vision and image processing ML tasks, training on large datasets is critical for generalization. However, the standard practice of relying on real-world images primarily from the Internet comes with image quality, scalability, and privacy issues, especially in commercial contexts. To address this, we have developed a procedural synthetic data generation pipeline and dataset tailored to low-level vision tasks. Our Unreal engine-based synthetic data pipeline populates large scenes algorithmically with a combination of random 3D objects, materials, and geometric transformations. Then, we calibrate the camera noise profiles to synthesize the noisy images. From this pipeline, we generated a fully synthetic image denoising dataset (FSID) which consists of 175,000 noisy/clean image pairs. We then trained and validated a CNN-based denoising model, and demonstrated that the model trained on this synthetic data alone can achieve competitive denoising results when evaluated on real-world noisy images captured with smartphone cameras.
https://arxiv.org/abs/2212.03961
Zero-shot learning on 3D point cloud data is a related underexplored problem compared to its 2D image counterpart. 3D data brings new challenges for ZSL due to the unavailability of robust pre-trained feature extraction models. To address this problem, we propose a prompt-guided 3D scene generation and supervision method that augments 3D data to learn the network better, exploring the complex interplay of seen and unseen objects. First, we merge point clouds of two 3D models in certain ways described by a prompt. The prompt acts like the annotation describing each 3D scene. Later, we perform contrastive learning to train our proposed architecture in an end-to-end manner. We argue that 3D scenes can relate objects more efficiently than single objects because popular language models (like BERT) can achieve high performance when objects appear in a context. Our proposed prompt-guided scene generation method encapsulates data augmentation and prompt-based annotation/captioning to improve 3D ZSL performance. We have achieved state-of-the-art ZSL and generalized ZSL performance on synthetic (ModelNet40, ModelNet10) and real-scanned (ScanOjbectNN) 3D object datasets.
https://arxiv.org/abs/2209.14690
We introduce GAUDI, a generative model capable of capturing the distribution of complex and realistic 3D scenes that can be rendered immersively from a moving camera. We tackle this challenging problem with a scalable yet powerful approach, where we first optimize a latent representation that disentangles radiance fields and camera poses. This latent representation is then used to learn a generative model that enables both unconditional and conditional generation of 3D scenes. Our model generalizes previous works that focus on single objects by removing the assumption that the camera pose distribution can be shared across samples. We show that GAUDI obtains state-of-the-art performance in the unconditional generative setting across multiple datasets and allows for conditional generation of 3D scenes given conditioning variables like sparse image observations or text that describes the scene.
https://arxiv.org/abs/2207.13751
We present a method that achieves state-of-the-art results on challenging (few-shot) layout-to-image generation tasks by accurately modeling textures, structures and relationships contained in a complex scene. After compressing RGB images into patch tokens, we propose the Transformer with Focal Attention (TwFA) for exploring dependencies of object-to-object, object-to-patch and patch-to-patch. Compared to existing CNN-based and Transformer-based generation models that entangled modeling on pixel-level&patch-level and object-level&patch-level respectively, the proposed focal attention predicts the current patch token by only focusing on its highly-related tokens that specified by the spatial layout, thereby achieving disambiguation during training. Furthermore, the proposed TwFA largely increases the data efficiency during training, therefore we propose the first few-shot complex scene generation strategy based on the well-trained TwFA. Comprehensive experiments show the superiority of our method, which significantly increases both quantitative metrics and qualitative visual realism with respect to state-of-the-art CNN-based and transformer-based methods. Code is available at this https URL.
https://arxiv.org/abs/2206.00923
We present RLSS: a reinforcement learning algorithm for sequential scene generation. This is based on employing the proximal policy optimization (PPO) algorithm for generative problems. In particular, we consider how to effectively reduce the action space by including a greedy search algorithm in the learning process. Our experiments demonstrate that our method converges for a relatively large number of actions and learns to generate scenes with predefined design objectives. This approach is placing objects iteratively in the virtual scene. In each step, the network chooses which objects to place and selects positions which result in maximal reward. A high reward is assigned if the last action resulted in desired properties whereas the violation of constraints is penalized. We demonstrate the capability of our method to generate plausible and diverse scenes efficiently by solving indoor planning problems and generating Angry Birds levels.
https://arxiv.org/abs/2206.02544
Autonomous Cyber-Physical Systems must often operate under uncertainties like sensor degradation and shifts in the operating conditions, which increases its operational risk. Dynamic Assurance of these systems requires designing runtime safety components like Out-of-Distribution detectors and risk estimators, which require labeled data from different operating modes of the system that belong to scenes with adverse operating conditions, sensors, and actuator faults. Collecting real-world data of these scenes can be expensive and sometimes not feasible. So, scenario description languages with samplers like random and grid search are available to generate synthetic data from simulators, replicating these real-world scenes. However, we point out three limitations in using these conventional samplers. First, they are passive samplers, which do not use the feedback of previous results in the sampling process. Second, the variables to be sampled may have constraints that are often not included. Third, they do not balance the tradeoff between exploration and exploitation, which we hypothesize is necessary for better search space coverage. We present a scene generation approach with two samplers called Random Neighborhood Search (RNS) and Guided Bayesian Optimization (GBO), which extend the conventional random search and Bayesian Optimization search to include the limitations. Also, to facilitate the samplers, we use a risk-based metric that evaluates how risky the scene was for the system. We demonstrate our approach using an Autonomous Vehicle example in CARLA simulation. To evaluate our samplers, we compared them against the baselines of random search, grid search, and Halton sequence search. Our samplers of RNS and GBO sampled a higher percentage of high-risk scenes of 83% and 92%, compared to 56%, 66% and 71% of the grid, random and Halton samplers, respectively.
https://arxiv.org/abs/2202.13510
What does human pose tell us about a scene? We propose a task to answer this question: given human pose as input, hallucinate a compatible scene. Subtle cues captured by human pose -- action semantics, environment affordances, object interactions -- provide surprising insight into which scenes are compatible. We present a large-scale generative adversarial network for pose-conditioned scene generation. We significantly scale the size and complexity of training data, curating a massive meta-dataset containing over 19 million frames of humans in everyday environments. We double the capacity of our model with respect to StyleGAN2 to handle such complex data, and design a pose conditioning mechanism that drives our model to learn the nuanced relationship between pose and scene. We leverage our trained model for various applications: hallucinating pose-compatible scene(s) with or without humans, visualizing incompatible scenes and poses, placing a person from one generated image into another scene, and animating pose. Our model produces diverse samples and outperforms pose-conditioned StyleGAN2 and Pix2Pix baselines in terms of accurate human placement (percent of correct keypoints) and image quality (Frechet inception distance).
https://arxiv.org/abs/2112.06909
Vehicle trajectory prediction is nowadays a fundamental pillar of self-driving cars. Both the industry and research communities have acknowledged the need for such a pillar by running public benchmarks. While state-of-the-art methods are impressive, i.e., they have no off-road prediction, their generalization to cities outside of the benchmark is unknown. In this work, we show that those methods do not generalize to new scenes. We present a novel method that automatically generates realistic scenes that cause state-of-the-art models go off-road. We frame the problem through the lens of adversarial scene generation. We promote a simple yet effective generative model based on atomic scene generation functions along with physical constraints. Our experiments show that more than $60\%$ of the existing scenes from the current benchmarks can be modified in a way to make prediction methods fail (predicting off-road). We further show that (i) the generated scenes are realistic since they do exist in the real world, and (ii) can be used to make existing models robust by 30-40%. Code is available at this https URL.
https://arxiv.org/abs/2112.03909
With the rapid advances in generative adversarial networks (GANs), the visual quality of synthesised scenes keeps improving, including for complex urban scenes with applications to automated driving. We address in this work a continual scene generation setup in which GANs are trained on a stream of distinct domains; ideally, the learned models should eventually be able to generate new scenes in all seen domains. This setup reflects the real-life scenario where data are continuously acquired in different places at different times. In such a continual setup, we aim for learning with zero forgetting, i.e., with no degradation in synthesis quality over earlier domains due to catastrophic forgetting. To this end, we introduce a novel framework that not only (i) enables seamless knowledge transfer in continual training but also (ii) guarantees zero forgetting with a small overhead cost. While being more memory efficient, thanks to continual learning, our model obtains better synthesis quality as compared against the brute-force solution that trains one full model for each domain. Especially, under extreme low-data regimes, our approach significantly outperforms the brute-force one by a large margin.
https://arxiv.org/abs/2112.03252
We introduce the GANformer2 model, an iterative object-oriented transformer, explored for the task of generative modeling. The network incorporates strong and explicit structural priors, to reflect the compositional nature of visual scenes, and synthesizes images through a sequential process. It operates in two stages: a fast and lightweight planning phase, where we draft a high-level scene layout, followed by an attention-based execution phase, where the layout is being refined, evolving into a rich and detailed picture. Our model moves away from conventional black-box GAN architectures that feature a flat and monolithic latent space towards a transparent design that encourages efficiency, controllability and interpretability. We demonstrate GANformer2's strengths and qualities through a careful evaluation over a range of datasets, from multi-object CLEVR scenes to the challenging COCO images, showing it successfully achieves state-of-the-art performance in terms of visual quality, diversity and consistency. Further experiments demonstrate the model's disentanglement and provide a deeper insight into its generative process, as it proceeds step-by-step from a rough initial sketch, to a detailed layout that accounts for objects' depths and dependencies, and up to the final high-resolution depiction of vibrant and intricate real-world scenes. See this https URL for model implementation.
https://arxiv.org/abs/2111.08960
Learning-based methods for training embodied agents typically require a large number of high-quality scenes that contain realistic layouts and support meaningful interactions. However, current simulators for Embodied AI (EAI) challenges only provide simulated indoor scenes with a limited number of layouts. This paper presents Luminous, the first research framework that employs state-of-the-art indoor scene synthesis algorithms to generate large-scale simulated scenes for Embodied AI challenges. Further, we automatically and quantitatively evaluate the quality of generated indoor scenes via their ability to support complex household tasks. Luminous incorporates a novel scene generation algorithm (Constrained Stochastic Scene Generation (CSSG)), which achieves competitive performance with human-designed scenes. Within Luminous, the EAI task executor, task instruction generation module, and video rendering toolkit can collectively generate a massive multimodal dataset of new scenes for the training and evaluation of Embodied AI agents. Extensive experimental results demonstrate the effectiveness of the data generated by Luminous, enabling the comprehensive assessment of embodied agents on generalization and robustness.
https://arxiv.org/abs/2111.05527