We introduce SAOR, a novel approach for estimating the 3D shape, texture, and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons, SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions, we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time, given a single-view image, it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work.
我们介绍了 SAOR,一种从野生图像中估计具有关节的物体的三维形状、纹理和视角的新方法。与之前的方法依赖于预先定义的特定类别3D模板或定制的3D骨骼,SAOR使用无骨骼个体模型从单个视角图像集中学习关节形状,而不需要任何3D物体形状先验。为了防止不整的解决方案,我们提出了一种交叉实例一致性损失,利用分离物体形状变形和关节。这通过一种新的轮廓采样机制在训练期间增强视角多样性。在我们的方法中,只需要在训练期间从现有的预训练网络估计物体轮廓和相对深度图。在推理期间,给定单个视角图像,它高效输出明确的网格表示。与相关的现有工作相比,我们对挑战性的四足动物在质量和数量方面取得了改进的结果。
https://arxiv.org/abs/2303.13514
Most video restoration networks are slow, have high computational load, and can't be used for real-time video enhancement. In this work, we design an efficient and fast framework to perform real-time video enhancement for practical use-cases like live video calls and video streams. Our proposed method, called Recurrent Bottleneck Mixer Network (ReBotNet), employs a dual-branch framework. The first branch learns spatio-temporal features by tokenizing the input frames along the spatial and temporal dimensions using a ConvNext-based encoder and processing these abstract tokens using a bottleneck mixer. To further improve temporal consistency, the second branch employs a mixer directly on tokens extracted from individual frames. A common decoder then merges the features form the two branches to predict the enhanced frame. In addition, we propose a recurrent training approach where the last frame's prediction is leveraged to efficiently enhance the current frame while improving temporal consistency. To evaluate our method, we curate two new datasets that emulate real-world video call and streaming scenarios, and show extensive results on multiple datasets where ReBotNet outperforms existing approaches with lower computations, reduced memory requirements, and faster inference time.
大多数视频恢复网络很慢,具有高计算负载,并且不能用于实时视频增强。在这项工作中,我们设计了一个高效且快速的框架,用于实时视频增强,例如实时视频通话和视频流。我们提出的方法被称为循环瓶颈混合器网络(ReBotNet),采用了双分支框架。第一个分支使用基于ConvNext的编码器将输入帧的空间和时间维度 tokenizing,并使用瓶颈混合器对这些抽象 tokens进行处理。为了进一步改善时间一致性,第二个分支直接使用从每个帧提取的 token 进行混合。一个通用的解码器然后将两个分支的特征合并,以预测增强帧。此外,我们提出了一种循环训练方法,其中利用最后帧的预测来高效地增强当前帧,同时改善时间一致性。为了评估我们的方法,我们编辑了两个模拟实际视频通话和流媒体场景的新数据集,并在多个数据集上展示了广泛的结果,其中 ReBotNet 在计算量更低、内存要求更低和Inference时间更快速的情况下表现更好。
https://arxiv.org/abs/2303.13504
Category-level 6D pose estimation aims to predict the poses and sizes of unseen objects from a specific category. Thanks to prior deformation, which explicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given object instance, prior-based methods attained great success and have become a major research stream. However, obtaining category-specific priors requires collecting a large amount of 3D models, which is labor-consuming and often not accessible in practice. This motivates us to investigate whether priors are necessary to make prior-based methods effective. Our empirical study shows that the 3D prior itself is not the credit to the high performance. The keypoint actually is the explicit deformation process, which aligns camera and world coordinates supervised by world-space 3D models (also called canonical space). Inspired by these observation, we introduce a simple prior-free implicit space transformation network, namely IST-Net, to transform camera-space features to world-space counterparts and build correspondence between them in an implicit manner without relying on 3D priors. Besides, we design camera- and world-space enhancers to enrich the features with pose-sensitive information and geometrical constraints, respectively. Albeit simple, IST-Net becomes the first prior-free method that achieves state-of-the-art performance, with top inference speed on the REAL275 dataset. Our code and models will be publicly available.
类别级别的6D姿态估计旨在预测特定类别中未观测到物体的姿態和大小。由于预先变形方法,它 explicitly 适应特定类别的3D先验(即3D模板)以给定物体实例,这种方法取得了巨大的成功并成为主要研究流。然而,获得特定类别先验需要收集大量的3D模型,这在实际应用中往往难以实现。这激励我们研究是否有必要使用先验方法来使先验方法有效。我们的实证研究表明,3D先验本身并不是高性能的归功于。关键是 explicit 变形过程,它由世界空间3D模型监督 align 相机和世界坐标,(也叫做标准空间)。受到这些观察的启发,我们引入了一个简单的没有先验的隐含空间变换网络,即 IST-Net,它将相机空间特征转换为世界空间对应物,并在它们之间建立联系,而无需依赖3D先验。此外,我们设计相机和世界空间增强器,以丰富特征,具有姿態敏感性信息和几何约束。尽管简单,IST-Net成为第一个没有先验方法实现先进的性能,在真实275数据集上具有最快推理速度的方法。我们的代码和模型将公开可用。
https://arxiv.org/abs/2303.13479
Diffusion-based models for text-to-image generation have gained immense popularity due to recent advancements in efficiency, accessibility, and quality. Although it is becoming increasingly feasible to perform inference with these systems using consumer-grade GPUs, training them from scratch still requires access to large datasets and significant computational resources. In the case of medical image generation, the availability of large, publicly accessible datasets that include text reports is limited due to legal and ethical concerns. While training a diffusion model on a private dataset may address this issue, it is not always feasible for institutions lacking the necessary computational resources. This work demonstrates that pre-trained Stable Diffusion models, originally trained on natural images, can be adapted to various medical imaging modalities by training text embeddings with textual inversion. In this study, we conducted experiments using medical datasets comprising only 100 samples from three medical modalities. Embeddings were trained in a matter of hours, while still retaining diagnostic relevance in image generation. Experiments were designed to achieve several objectives. Firstly, we fine-tuned the training and inference processes of textual inversion, revealing that larger embeddings and more examples are required. Secondly, we validated our approach by demonstrating a 2\% increase in the diagnostic accuracy (AUC) for detecting prostate cancer on MRI, which is a challenging multi-modal imaging modality, from 0.78 to 0.80. Thirdly, we performed simulations by interpolating between healthy and diseased states, combining multiple pathologies, and inpainting to show embedding flexibility and control of disease appearance. Finally, the embeddings trained in this study are small (less than 1 MB), which facilitates easy sharing of medical data with reduced privacy concerns.
散射模型用于文本到图像生成已经因其效率和可用性的进步而获得了极大的流行度。尽管使用消费者级别的GPU进行推断已经成为越来越可行的方法,但对于训练从 scratch 开始的全新模型仍然需要访问大量的数据集和重要的计算资源。在医学图像生成方面,包含文本报告的大规模公共数据集的可用性因为法律和伦理问题而受到限制。虽然训练一个私有数据集可能可以解决这一问题,但对于缺乏必要的计算资源的机构来说并不是 always 可行的。这项工作证明了训练先前训练于自然图像上的稳定扩散模型,可以将其适应各种医学成像模式,通过训练文本嵌入来实现。在本研究中,我们使用仅包含 100 个样本的医疗数据集训练了文本嵌入,仅仅需要几个小时,但仍然能够在图像生成中保留诊断相关性。实验旨在实现多个目标。首先,我们优化了文本逆置的训练和推断过程,揭示了需要更多的嵌入和更多的示例才能实现。其次,我们证明了我们的方法的有效性,通过演示在 MRI 中检测前列腺癌时,诊断准确性(AUC)的提高,从 0.78 提高到了 0.80。第三,我们使用平滑过渡在不同健康和患病状态之间进行建模,结合多种病理学,并进行了涂色,以展示嵌入的灵活性和控制疾病的外观。最后,训练在 this 研究中使用的嵌入非常小(小于 1 MB),这便于更轻松地分享医学数据,同时减少隐私担忧。
https://arxiv.org/abs/2303.13430
Human mesh recovery (HMR) provides rich human body information for various real-world applications such as gaming, human-computer interaction, and virtual reality. Compared to single image-based methods, video-based methods can utilize temporal information to further improve performance by incorporating human body motion priors. However, many-to-many approaches such as VIBE suffer from motion smoothness and temporal inconsistency. While many-to-one approaches such as TCMR and MPS-Net rely on the future frames, which is non-causal and time inefficient during inference. To address these challenges, a novel Diffusion-Driven Transformer-based framework (DDT) for video-based HMR is presented. DDT is designed to decode specific motion patterns from the input sequence, enhancing motion smoothness and temporal consistency. As a many-to-many approach, the decoder of our DDT outputs the human mesh of all the frames, making DDT more viable for real-world applications where time efficiency is crucial and a causal model is desired. Extensive experiments are conducted on the widely used datasets (Human3.6M, MPI-INF-3DHP, and 3DPW), which demonstrated the effectiveness and efficiency of our DDT.
人类网格恢复(HMR)为各种实际应用程序,如游戏、人机交互和虚拟现实,提供了丰富的人体信息。与单张图像方法相比,视频方法可以利用时间信息进一步改善性能,通过引入人体运动先验。然而,许多对许多(VIBE)方法存在运动流畅度和时间一致性的问题。而许多对一的方法(TCMR和MPS-Net)则依赖于未来的帧,在推理期间具有非因果性和时间效率不高。为了解决这些挑战,我们提出了一种新的扩散驱动Transformer-based框架(DDT),用于视频人类网格恢复。DDT旨在从输入序列中解码特定的运动模式,增强运动流畅度和时间一致性。作为一种许多对许多的方法,我们的DDT解码器输出所有帧的人网格,使得DDT更适合那些时间效率是至关重要的且希望使用具有因果模型的应用。广泛的实验在广泛使用的数据集(人类3.6M、MPI-INF-3DHP和3DPW)上进行,证明了我们的DDT的有效性和效率。
https://arxiv.org/abs/2303.13397
In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes.
在本文中,我们探讨了一个开放的研究任务,即从给定的文字描述中生成可控制3D形状纹理。以前的工作要么需要真实的标题标签标注,要么需要进行大量的优化时间。为了解决这些难题,我们提出了一个新的框架TAPS3D,以训练一个基于伪标题的文本引导3D形状生成器。具体来说,基于渲染的2D图像,我们从CLIP词汇库中检索相关词汇,并使用模板使用模板构建伪标题。我们构建的伪标题为生成的3D形状提供了高水平的语义监督。此外,为了产生细致的纹理和提高几何多样性,我们提议采用低层次的图像 Regularization 方法,使伪渲染图像与真实图像对齐。在推理阶段,我们的提议模型可以从给定的文字中不需要任何额外的优化就能生成3D形状纹理。我们进行了广泛的实验来分析我们提出的每个组件,并展示我们框架在生成高保真的3D形状和与文本相关的形状方面的效率。
https://arxiv.org/abs/2303.13273
Universal anomaly detection still remains a challenging prob- lem in machine learning and medical image analysis. It is possible to learn an expected distribution from a single class of normative samples, e.g., through epistemic uncertainty estimates, auto-encoding models, or from synthetic anomalies in a self-supervised way. The performance of self-supervised anomaly detection approaches is still inferior compared to methods that use examples from known unknown classes to shape the decision boundary. However, outlier exposure methods often do not identify unknown unknowns. Here we discuss an improved self-supervised single-class training strategy that supports the approximation of proba- bilistic inference with loosen feature locality constraints. We show that up-scaling of gradients with histogram-equalised images is beneficial for recently proposed self-supervision tasks. Our method is integrated into several out-of-distribution (OOD) detection models and we show evi- dence that our method outperforms the state-of-the-art on various bench- mark datasets. Source code will be publicly available by the time of the conference.
普遍异常检测仍然是机器学习和医学图像分析中一个挑战性的问题。从一类校准样本中学习期望分布,例如通过知识不确定性估计、自动编码模型或通过自监督的方式来合成异常样本。自监督异常检测方法的性能仍然比使用已知未知类样本来 shaping决策边界的方法差。然而,异常暴露方法通常无法识别未知未知例。在此我们讨论了改进的自监督单个类训练策略,支持放宽特征局部限制的概率推断推断。我们表明,对梯度图像进行直方图均衡化可以提高最近提出的自监督任务的性能。我们的方法被集成到多个非分布检测模型中,并证据表明,我们的方法在多个基准数据集上比当前最先进的方法表现更好。源代码将在会议结束后公开可用。
https://arxiv.org/abs/2303.13227
Face recognition models embed a face image into a low-dimensional identity vector containing abstract encodings of identity-specific facial features that allow individuals to be distinguished from one another. We tackle the challenging task of inverting the latent space of pre-trained face recognition models without full model access (i.e. black-box setting). A variety of methods have been proposed in literature for this task, but they have serious shortcomings such as a lack of realistic outputs, long inference times, and strong requirements for the data set and accessibility of the face recognition model. Through an analysis of the black-box inversion problem, we show that the conditional diffusion model loss naturally emerges and that we can effectively sample from the inverse distribution even without an identity-specific loss. Our method, named identity denoising diffusion probabilistic model (ID3PM), leverages the stochastic nature of the denoising diffusion process to produce high-quality, identity-preserving face images with various backgrounds, lighting, poses, and expressions. We demonstrate state-of-the-art performance in terms of identity preservation and diversity both qualitatively and quantitatively. Our method is the first black-box face recognition model inversion method that offers intuitive control over the generation process and does not suffer from any of the common shortcomings from competing methods.
人脸识别模型将面部图像嵌入低维身份向量中,其中包含身份特定的面部特征抽象编码,以便能够区分个体。我们解决了没有全模型访问(即黑盒设置)的挑战,即反转训练后的面部识别模型的潜在空间。在文献中提出了多种方法来完成这项工作,但它们都有严重的缺陷,例如缺乏实际输出、推理时间长、以及对数据集和面部识别模型访问的强烈要求。通过对黑盒反转问题的分析,我们表明条件扩散模型损失自然地出现,并且即使没有身份特定的损失,我们仍然可以从逆分布中有效地采样。我们的方法名为身份去噪扩散概率模型(ID3PM),利用去噪扩散过程的随机性质,以产生各种背景、照明、姿势和面部表情的高质量、身份保留的面部图像。我们证明了身份保留和多样性的高水平、高定量表现。我们的方法是第一个黑盒面部识别模型反转方法,提供了直观的控制生成过程,并且没有与竞争方法的普遍缺点的任何影响。
https://arxiv.org/abs/2303.13006
Efficiently digitizing high-fidelity animatable human avatars from videos is a challenging and active research topic. Recent volume rendering-based neural representations open a new way for human digitization with their friendly usability and photo-realistic reconstruction quality. However, they are inefficient for long optimization times and slow inference speed; their implicit nature results in entangled geometry, materials, and dynamics of humans, which are hard to edit afterward. Such drawbacks prevent their direct applicability to downstream applications, especially the prominent rasterization-based graphic ones. We present EMA, a method that Efficiently learns Meshy neural fields to reconstruct animatable human Avatars. It jointly optimizes explicit triangular canonical mesh, spatial-varying material, and motion dynamics, via inverse rendering in an end-to-end fashion. Each above component is derived from separate neural fields, relaxing the requirement of a template, or rigging. The mesh representation is highly compatible with the efficient rasterization-based renderer, thus our method only takes about an hour of training and can render in real-time. Moreover, only minutes of optimization is enough for plausible reconstruction results. The disentanglement of meshes enables direct downstream applications. Extensive experiments illustrate the very competitive performance and significant speed boost against previous methods. We also showcase applications including novel pose synthesis, material editing, and relighting. The project page: this https URL.
Efficiently digitizing high-fidelity animatable humanAvatars from videos是一个具有挑战性和活力的研究话题。最近的体积渲染基于神经网络表示法开创了一种友好的易用性和逼真重建质量的新途径,使得人类数字输入变得容易实现。然而,这种表示方法在长时间的优化时间和慢速推断速度下效率较低;其隐含的特性导致了人类几何、材料和动力学的纠缠,这使得后续编辑非常困难。这些缺点限制了它们直接适用于下游应用,特别是基于显式渲染的图形应用。我们提出了EMA方法,该方法通过反向渲染有效地学习Meshy神经网络区域以重构可模拟人类Avatar。它通过共同优化明确三角形的Mesh、空间可变材料以及运动动力学,通过端到端的方式实现了高效的逆渲染。每个上述组件都从独立的神经网络区域中推导出来,放松了模板或rigging的要求。Mesh表示与高效的显式渲染器高度兼容,因此我们的方法只需要大约一个小时的培训和就能实时渲染。此外,仅仅优化几分钟就可以生成合理的重建结果。Mesh的解耦使直接适用于下游应用成为可能。广泛的实验展示了非常竞争力的性能和对前方法的显著速度提升。我们还展示了包括新颖姿势合成、材料编辑和重新照明的应用。项目页面:这个https URL。
https://arxiv.org/abs/2303.12965
This paper introduces a general model called CIPNN - Continuous Indeterminate Probability Neural Network, and this model is based on IPNN, which is used for discrete latent random variables. Currently, posterior of continuous latent variables is regarded as intractable, with the new theory proposed by IPNN this problem can be solved. Our contributions are Four-fold. First, we derive the analytical solution of the posterior calculation of continuous latent random variables and propose a general classification model (CIPNN). Second, we propose a general auto-encoder called CIPAE - Continuous Indeterminate Probability Auto-Encoder, the decoder part is not a neural network and uses a fully probabilistic inference model for the first time. Third, we propose a new method to visualize the latent random variables, we use one of N dimensional latent variables as a decoder to reconstruct the input image, which can work even for classification tasks, in this way, we can see what each latent variable has learned. Fourth, IPNN has shown great classification capability, CIPNN has pushed this classification capability to infinity. Theoretical advantages are reflected in experimental results.
本论文介绍了一种通用模型,称为CIPNN - 连续不确定概率神经网络,该模型基于IPNN,用于离散潜在随机变量。目前,连续潜在变量的后项被视为不可逆的,基于IPNN提出的新理论可以解决这个问题。我们的贡献是四个方面。首先,我们推导了连续潜在变量的后项计算的 analytical 解决方案,并提出了一般性的分类模型(CIPNN)。其次,我们提出了一种 general 的自编码器,称为 CIPAE - 连续不确定概率自编码器,解码部分不是神经网络,并首次使用完全概率推断模型。第三,我们提出了一种新方法来可视化潜在随机变量,我们使用一个 N 维潜在变量作为解码器,以重构输入图像,该方法可用于分类任务,这种方式,我们可以看到每个潜在变量学到了什么。第四,IPNN 已经表现出极大的分类能力,CIPNN 将这一分类能力推向无限。理论优势在实验结果中得到了体现。
https://arxiv.org/abs/2303.12964
Surgical scene understanding is a key prerequisite for contextaware decision support in the operating room. While deep learning-based approaches have already reached or even surpassed human performance in various fields, the task of surgical action recognition remains a major challenge. With this contribution, we are the first to investigate the concept of self-distillation as a means of addressing class imbalance and potential label ambiguity in surgical video analysis. Our proposed method is a heterogeneous ensemble of three models that use Swin Transfomers as backbone and the concepts of self-distillation and multi-task learning as core design choices. According to ablation studies performed with the CholecT45 challenge data via cross-validation, the biggest performance boost is achieved by the usage of soft labels obtained by self-distillation. External validation of our method on an independent test set was achieved by providing a Docker container of our inference model to the challenge organizers. According to their analysis, our method outperforms all other solutions submitted to the latest challenge in the field. Our approach thus shows the potential of self-distillation for becoming an important tool in medical image analysis applications.
surgical scene understanding是意识流决策支持在手术房中的关键前提。尽管基于深度学习的方法已经在各种领域中达到了或甚至超过了人类的表现,但识别手术动作仍然是一个 major 的挑战。通过这项工作,我们是第一位研究自我蒸馏概念的,将其作为解决手术视频分析中类别不平衡和潜在标签歧义的手段。我们提出的方法是由三个模型组成的异质组合,其中使用 Swin 流体层作为主干,自我蒸馏和多任务学习作为核心设计选择。根据对 CholecT45 挑战数据进行交叉验证的研究,最大的性能提升是通过使用自我蒸馏的软标签实现的。通过向挑战组织者提供我们的推理模型的 Docker 容器,实现了对独立测试集的外部验证。根据他们的分析,我们的方法在该领域的最新挑战中表现优于所有其他解决方案。我们的方法因此展示了自我蒸馏在医学图像分析应用中成为重要工具的潜力。
https://arxiv.org/abs/2303.12915
Emerging AI applications such as ChatGPT, graph convolutional networks, and other deep neural networks require massive computational resources for training and inference. Contemporary computing platforms such as CPUs, GPUs, and TPUs are struggling to keep up with the demands of these AI applications. Non-coherent optical computing represents a promising approach for light-speed acceleration of AI workloads. In this paper, we show how cross-layer design can overcome challenges in non-coherent optical computing platforms. We describe approaches for optical device engineering, tuning circuit enhancements, and architectural innovations to adapt optical computing to a variety of AI workloads. We also discuss techniques for hardware/software co-design that can intelligently map and adapt AI software to improve its performance on non-coherent optical computing platforms.
新兴人工智能应用,如聊天机器人GPT、图卷积神经网络(CNN)和其他深度神经网络,需要进行大量的训练和推理资源。当前计算机平台,如CPU、GPU和TPU,正在努力满足这些人工智能应用的需要。非相干光学计算是一种有望加速人工智能工作负载光速加速的方法。在本文中,我们展示了如何实现跨层设计来克服非相干光学计算平台上的挑战。我们描述了光学设备工程、调整电路增强和建筑创新的方法,以适应多种人工智能工作负载。我们还讨论了硬件/软件协同设计的方法,这些技术可以智能地映射和适应人工智能软件,以提高其在非相干光学计算平台上的性能。
https://arxiv.org/abs/2303.12910
Pose-conditioned convolutional generative models struggle with high-quality 3D-consistent image generation from single-view datasets, due to their lack of sufficient 3D priors. Recently, the integration of Neural Radiance Fields (NeRFs) and generative models, such as Generative Adversarial Networks (GANs), has transformed 3D-aware generation from single-view images. NeRF-GANs exploit the strong inductive bias of 3D neural representations and volumetric rendering at the cost of higher computational complexity. This study aims at revisiting pose-conditioned 2D GANs for efficient 3D-aware generation at inference time by distilling 3D knowledge from pretrained NeRF-GANS. We propose a simple and effective method, based on re-using the well-disentangled latent space of a pre-trained NeRF-GAN in a pose-conditioned convolutional network to directly generate 3D-consistent images corresponding to the underlying 3D representations. Experiments on several datasets demonstrate that the proposed method obtains results comparable with volumetric rendering in terms of quality and 3D consistency while benefiting from the superior computational advantage of convolutional networks. The code will be available at: this https URL
姿态条件卷积生成模型 struggle with 从单一视图数据集生成高质量的三维一致性图像,因为他们缺乏足够的三维先验。最近,将神经网络辐射场(NeRFs)和生成模型,如生成对抗网络(GANs),集成起来已经改变了从单一视图图像生成的三维意识。NeRF-GANs利用三维神经网络表示和体积渲染的强大归纳偏见,以更高的计算复杂度为代价。本研究旨在重新考虑姿态条件2DGANs,在推断时以从训练好的NeRF-GANS中提取的三维知识蒸馏3D知识,以高效生成与 underlying 3D表示对应的三维一致性图像。我们提出了一种简单而有效的方法,基于在姿态条件卷积网络中重用训练好的NeRF-GANS的清晰分离的隐空间,以直接生成与 underlying 3D表示对应的三维一致性图像。对多个数据集的实验表明,该方法在质量和三维一致性方面的结果与体积渲染相当,同时受益于卷积网络的优越计算优势。代码将放在这个 https URL 上。
https://arxiv.org/abs/2303.12865
While the state-of-the-art for frame semantic parsing has progressed dramatically in recent years, it is still difficult for end-users to apply state-of-the-art models in practice. To address this, we present Frame Semantic Transformer, an open-source Python library which achieves near state-of-the-art performance on FrameNet 1.7, while focusing on ease-of-use. We use a T5 model fine-tuned on Propbank and FrameNet exemplars as a base, and improve performance by using FrameNet lexical units to provide hints to T5 at inference time. We enhance robustness to real-world data by using textual data augmentations during training.
过去几年中,框架语义解析的技术进展非常大,但对用户而言,在实践中应用最先进的模型仍然非常困难。为了解决这个问题,我们提出了Frame Semantic Transformer,一个开源的Python库,可以在FrameNet 1.7上实现最先进的性能,同时注重易用性。我们使用Propbank和FrameNet示例模型中的T5模型作为基础,并通过使用FrameNet词汇单元在推理时向T5提供提示来改进性能。在训练过程中,我们还使用文本数据增强来增强对真实数据的可靠性。
https://arxiv.org/abs/2303.12788
Neuro-symbolic predictors learn a mapping from sub-symbolic inputs to higher-level concepts and then carry out (probabilistic) logical inference on this intermediate representation. This setup offers clear advantages in terms of consistency to symbolic prior knowledge, and is often believed to provide interpretability benefits in that - by virtue of complying with the knowledge - the learned concepts can be better understood by human stakeholders. However, it was recently shown that this setup is affected by reasoning shortcuts whereby predictions attain high accuracy by leveraging concepts with unintended semantics, yielding poor out-of-distribution performance and compromising interpretability. In this short paper, we establish a formal link between reasoning shortcuts and the optima of the loss function, and identify situations in which reasoning shortcuts can arise. Based on this, we discuss limitations of natural mitigation strategies such as reconstruction and concept supervision.
神经符号预测器学习从子符号输入到更高概念的映射,并在此中间表示上进行(概率性)逻辑推断。这个setup在符号先前知识一致性方面具有明显优势,并且通常被认为可以提供解释性好处,因为-通过遵守知识-学习的概念可以更好地为人类利益相关者解释。然而,最近研究表明,这个setup受到推理捷径的影响,通过利用具有 unintended 语义的概念来获得高准确性,产生较差的分布性能,并影响解释性。在本文中,我们建立了推理捷径与损失函数最优解之间的正式联系,并识别推理捷径可能产生的情境。基于 this,我们讨论了自然缓解策略,如重建和概念监督的局限性。
https://arxiv.org/abs/2303.12578
With the development of deep learning processors and accelerators, deep learning models have been widely deployed on edge devices as part of the Internet of Things. Edge device models are generally considered as valuable intellectual properties that are worth for careful protection. Unfortunately, these models have a great risk of being stolen or illegally copied. The existing model protections using encryption algorithms are suffered from high computation overhead which is not practical due to the limited computing capacity on edge devices. In this work, we propose a light-weight, practical, and general Edge device model Pro tection method at neuron level, denoted as EdgePro. Specifically, we select several neurons as authorization neurons and set their activation values to locking values and scale the neuron outputs as the "asswords" during training. EdgePro protects the model by ensuring it can only work correctly when the "passwords" are met, at the cost of encrypting and storing the information of the "passwords" instead of the whole model. Extensive experimental results indicate that EdgePro can work well on the task of protecting on datasets with different modes. The inference time increase of EdgePro is only 60% of state-of-the-art methods, and the accuracy loss is less than 1%. Additionally, EdgePro is robust against adaptive attacks including fine-tuning and pruning, which makes it more practical in real-world applications. EdgePro is also open sourced to facilitate future research: this https URL
随着深度学习处理器和加速器的发展,深度学习模型已经成为物联网的重要组成部分,并在边缘设备上广泛应用。边缘设备模型通常被认为是有价值的知识产权,值得小心保护。不幸的是,这些模型有被窃取或非法复制的巨大风险。现有的使用加密算法保护模型的方法由于边缘设备的计算能力有限而无法实现。在本研究中,我们提出了一种轻量级、实用且一般的边缘设备模型保护方法,称为EdgePro,以表示该方法。具体来说,我们选择几个神经元作为授权神经元,并将它们的激活值设置为锁定值,并在训练期间将神经元输出作为“口令”进行Scale。EdgePro保护模型,并确保只有在“口令”满足时才能正确工作,而代价是加密和存储“口令”的信息,而不是整个模型。广泛的实验结果表明,EdgePro可以在不同类型的数据集上保护任务中表现良好。EdgePro的推理时间增加仅为先进方法的60%,而精度损失不到1%。此外,EdgePro具有 robust 的特性,包括 Fine-tuning 和 Pruning,使其在现实世界应用程序中更加实用。EdgePro也开源以便于未来的研究: this https URL。
https://arxiv.org/abs/2303.12397
In this paper, we propose NUWA-XL, a novel Diffusion over Diffusion architecture for eXtremely Long video generation. Most current work generates long videos segment by segment sequentially, which normally leads to the gap between training on short videos and inferring long videos, and the sequential generation is inefficient. Instead, our approach adopts a ``coarse-to-fine'' process, in which the video can be generated in parallel at the same granularity. A global diffusion model is applied to generate the keyframes across the entire time range, and then local diffusion models recursively fill in the content between nearby frames. This simple yet effective strategy allows us to directly train on long videos (3376 frames) to reduce the training-inference gap, and makes it possible to generate all segments in parallel. To evaluate our model, we build FlintstonesHD dataset, a new benchmark for long video generation. Experiments show that our model not only generates high-quality long videos with both global and local coherence, but also decreases the average inference time from 7.55min to 26s (by 94.26\%) at the same hardware setting when generating 1024 frames. The homepage link is \url{this https URL}
在本文中,我们提出了 NUWA-XL,一种 novel Diffusion over Diffusion 架构,用于极端Long视频生成的新方法。当前大多数工作都是按照片段序列依次生成Long视频片段,这通常会导致训练视频和推断Long视频之间的间隔,Sequential generation 效率低。相反,我们采用了一种“粗到细”的过程,即视频可以在相同的粒度级别上同时生成。全球扩散模型被应用来生成整个时间范围内的关键点,然后当地扩散模型则递归地填充相邻帧之间的内容。这个简单而有效的策略使我们能够直接训练(生成3376帧的)Long视频,以减少训练-推断差距,并使所有片段能够同时生成。为了评估我们的模型,我们建立了FlintstonesHD数据集,成为Long视频生成新的基准。实验表明,我们的模型不仅生成具有全球和 local 一致性高质量的Long视频,而且当生成1024帧时,在相同的硬件设置下,平均推断时间从7.55分钟减少到26秒(94.26%)。主页链接为 \url{this https URL}。
https://arxiv.org/abs/2303.12346
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in untrimmed videos, only taking video-level labels as the supervised information. Pseudo label generation is a promising strategy to solve the challenging problem, but most existing methods are limited to employing snippet-wise classification results to guide the generation, and they ignore that the natural temporal structure of the video can also provide rich information to assist such a generation process. In this paper, we propose a novel weakly-supervised temporal action localization method by inferring snippet-feature affinity. First, we design an affinity inference module that exploits the affinity relationship between temporal neighbor snippets to generate initial coarse pseudo labels. Then, we introduce an information interaction module that refines the coarse labels by enhancing the discriminative nature of snippet-features through exploring intra- and inter-video relationships. Finally, the high-fidelity pseudo labels generated from the information interaction module are used to supervise the training of the action localization network. Extensive experiments on two publicly available datasets, i.e., THUMOS14 and ActivityNet v1.3, demonstrate our proposed method achieves significant improvements compared to the state-of-the-art methods.
弱监督的时间行动定位旨在在未剪辑的视频中提取行动区域并识别行动类别,仅使用视频级别的标签作为监督信息。伪标签生成是一种解决挑战性问题有前途的策略,但大多数现有方法局限于使用片段级别的分类结果来指导生成,并忽视了视频的自然时间结构也可以提供丰富的信息来帮助这种生成过程。在本文中,我们提出了一种新的弱监督的时间行动定位方法,通过推断片段特征亲和力来实现。首先,我们设计了一个亲和力推断模块,利用时间相邻片段之间的亲和力关系来生成初始的粗仿标签。然后,我们引入了一个信息交互模块,通过探索视频内部和外部的关系来优化粗仿标签,并最后使用信息交互模块生成的高保真的仿标签来监督行动定位网络的训练。在两个公开数据集THUMOS14和ActivityNet v1.3上进行广泛的实验,证明了我们提出的方法相比现有方法取得了显著的改进。
https://arxiv.org/abs/2303.12332
Graph Neural Networks (GNNs) have been recently introduced to learn from knowledge graph (KG) and achieved state-of-the-art performance in KG reasoning. However, a theoretical certification for their good empirical performance is still absent. Besides, while logic in KG is important for inductive and interpretable inference, existing GNN-based methods are just designed to fit data distributions with limited knowledge of their logical expressiveness. We propose to fill the above gap in this paper. Specifically, we theoretically analyze GNN from logical expressiveness and find out what kind of logical rules can be captured from KG. Our results first show that GNN can capture logical rules from graded modal logic, providing a new theoretical tool for analyzing the expressiveness of GNN for KG reasoning; and a query labeling trick makes it easier for GNN to capture logical rules, explaining why SOTA methods are mainly based on labeling trick. Finally, insights from our theory motivate the development of an entity labeling method for capturing difficult logical rules. Experimental results are consistent with our theoretical results and verify the effectiveness of our proposed method.
Graph Neural Networks (GNNs) 引入以来,从知识图(KG)学习并在KG推理中取得了最先进的性能。然而,缺乏对其良好经验表现的理论证明。此外,尽管KG的逻辑对于归纳和可解释推理非常重要,但现有的GNN-based方法只是设计以适应数据分布,对其逻辑表达能力的有限知识进行适应。我们建议在这篇论文中填补这些差距。具体而言,我们从理论上分析GNN的逻辑表达能力,并找出可以从KG捕获的逻辑规则的种类。我们的结果显示,GNN可以从等级模态逻辑捕获逻辑规则,为分析GNN在KG推理中表达能力提供了新的理论工具;查询标签技巧使GNN更容易捕获逻辑规则,解释了为什么SOTA方法主要基于标签技巧。最后,我们的理论见解激励开发用于捕获困难逻辑规则实体标签方法。实验结果与我们的理论结果一致,并验证了我们提出的方法的有效性。
https://arxiv.org/abs/2303.12306
Federated learning (FL) is a distributed machine learning technique in which multiple clients cooperate to train a shared model without exchanging their raw data. However, heterogeneity of data distribution among clients usually leads to poor model inference. In this paper, a prototype-based federated learning framework is proposed, which can achieve better inference performance with only a few changes to the last global iteration of the typical federated learning process. In the last iteration, the server aggregates the prototypes transmitted from distributed clients and then sends them back to local clients for their respective model inferences. Experiments on two baseline datasets show that our proposal can achieve higher accuracy (at least 1%) and relatively efficient communication than two popular baselines under different heterogeneous settings.
分布式学习(FL)是一种分布式机器学习技术,它多个客户端合作训练一个共享模型,而无需交换他们的原始数据。然而,客户端数据分布的不一致通常导致模型推理效果不佳。在本文中,我们提出了基于原型的分布式学习框架,该框架仅需要对典型的分布式学习流程的最后一个全局迭代进行少量更改,就能实现更好的推理性能。在最后一个迭代中,服务器将分布式客户端传输的原型聚合起来,然后将它们发送给本地客户端,以进行各自模型推理。对两个基准数据集的实验表明,我们的提议能够在不同不一致环境下(至少达到1%)实现更高的精度(至少1%)和更高效的沟通。
https://arxiv.org/abs/2303.12296