Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.
Clipart是一种预先制作好的图形艺术形式,为描绘视觉内容提供了方便和高效的途径。将静态 clipart 图像转换为动图序列的传统工作流程费力且耗时,需要进行许多复杂的步骤,如绑定、关键帧动画和中间帧处理。近年来在将文本到视频生成模型的研究中取得了很大的进展,有望解决这个问题。然而,直接应用文本到视频生成模型通常很难保留 clipart 图像的视觉身份或生成卡通风格的运动,导致不满意的动画效果。在本文中,我们介绍了 AniClipart 系统,该系统将静态 clipart 图像转换为高质量的动图序列,通过文本到视频先验指导。为了生成卡通风格和流畅的运动,我们首先将 clipart 图像的关键点定义为运动正则化形式。然后通过优化 Video Score Distillation Sampling(VSDS)损失,使关键点的运动轨迹与提供的文本提示对齐,该损失可以表示预训练文本到视频扩散模型中自然运动足够的知识。通过使用可导的 As-Rigid-As-Possible 形状变形算法,我们的方法可以在保持变形刚度的同时进行端到端的优化。实验结果表明,与现有的图像到视频生成模型相比,AniClipart 在文本到视频对齐、视觉身份保留和运动一致性方面 consistently 表现出色。此外,我们还展示了 AniClipart 的多样性,通过将其应用于生成更广泛的动画格式,如分层动画,实现了拓扑变化。
https://arxiv.org/abs/2404.12347
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.
由于其高分辨率和高成像速度,X 射线图像在术中进程中有很高的价值。然而,过度曝光的 X 射线会带来一定的对人类健康的潜在风险。数据驱动的算法从体积扫描到 X 射线图像都受到稀疏的成对 X 射线和体积数据不足的限制。现有方法主要是通过建模整个 X 射线成像过程来实现。在这项研究中,我们提出了一种基于学习的称为 CT2X-GAN 的方法,用于端到端地合成三个不同图像域中的 X 射线图像,通过一系列解耦编码器实现解剖结构信息和风格信息之间的解耦。此外,我们还引入了一个新的一致性正则化项以提高合成 X 射线图像和真实 X 射线图像之间的风格相似度。同时,我们通过计算计算得到的真实 DRR 和合成 DRR 图像的相似度来实现监督过程。我们进一步开发了一个姿态注意模块,以增强从 CT 扫描中解耦得到的内容代码的全面信息,从而在较低的 2D 空间中实现高质量的多视角图像合成。我们对公开可用的 CTSpine1K 数据集进行了广泛的实验,分别实现了 97.8350、0.0842 和 3.0938 的 FID、KID 和用户评分的 X 射线相似度。与 3D 感知方法(π-GAN、EG3D)相比,CT2X-GAN 在提高合成质量和真实性方面具有优势。
https://arxiv.org/abs/2404.11889
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
自监督学习(SSL)作为一种无需标注的学习技术,在医学图像分析领域呈现出巨大的潜力。然而,尽管具有潜在的积极影响,传统的 SSL 方法也存在局限性,包括在实现语义对齐和捕捉细微细节方面遇到的挑战。这导致 suboptimal 表示,无法准确捕捉到解剖学结构和病理细节。为了应对这些限制,我们引入了一个名为 OPTiML 的新 SSL 框架,采用最优传输(OT)技术,以捕捉密集的语义不变性和细粒度细节,从而增强 SSL 在医学图像表示学习中的整体效果。核心思想是将 OT 与跨视点语义注入模块(CV-SIM)相结合,有效地捕捉不同观点下医学图像中复杂、细粒度的细节。除了 CV-SIM 模块之外,OPTiML 对 OT 框架内的方差和协方差进行正则化,以迫使模型将注意力集中在临床相关信息上,而忽略更不相关的特征。通过这些,所提出的框架展示了其学习语义丰富表示的能力,可以应用于各种医学成像任务。为了验证其有效性,我们在三个公开可用的数据集(包括胸部 X 光摄影模式)上进行了实验研究。我们的实证结果表明,OPTiML 在所有评估任务上都优于最先进的 methods。
https://arxiv.org/abs/2404.11868
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.
量化降低内存使用、计算要求和延迟,通过使用更少的比特来表示模型权重和激活。在这项工作中,我们研究了量化神经网络的泛化特性,尽管这对模型性能有着深刻的意义,但这个特性并未受到太多的关注。特别是,我们为神经网络的量化开发了一个理论模型,并证明了量化作为一种正则化形式的作用。第二,为了研究最近工作连接损失函数的尖度与泛化之间的关系,我们推导了量化模型的泛化近界,基于量化噪声的数量。然后,我们通过在CIFAR-10、CIFAR-100和ImageNet数据集上使用卷积和Transformer基模型训练超过2000个模型来验证我们的假设。
https://arxiv.org/abs/2404.11769
Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework. This framework employs neural displacement fields to convert letters into base shapes and applies per-frame motion, encouraging coherence with the intended textual concept. Shape preservation techniques and perceptual loss regularization are employed to maintain legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our end-to-end methodology over baseline methods, which might comprise separate tasks. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability. Our code is available at: this https URL.
文本动画是一种表达性的媒介,通过将文字注入运动以唤起情感、强调意义并构建引人入胜的故事,将静态通信转化为动态体验。打造语义意识到的动画 poses 显著的挑战,需要掌握图形设计和动画的专业知识。我们提出了一个自动文本动画方案,称为“动态字体”,结合了两个具有挑战性的任务。它通过变形字母来传达语义意义,并基于用户提示充满活力地注入生动的运动。我们的技术利用了向量图形表示和基于端到端优化的框架。该框架采用神经微分场来将字母转换为基本形状,并应用每一帧的运动,鼓励与预期文本概念的连贯性。形状保留技术和感知损失 regularization 被采用来保持动画过程的清晰度和结构完整性。我们通过定量和定性评估证明了我们的方法在生成连贯的文本动画,同时保留可读性的效果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.11614
Estimating the sound absorption in situ relies on accurately describing the measured sound field. Evidence suggests that modeling the reflection of impinging spherical waves is important, especially for compact measurement systems. This article proposes a method for estimating the sound absorption coefficient of a material sample by mapping the sound pressure, measured by a microphone array, to a distribution of monopoles along a line in the complex plane. The proposed method is compared to modeling the sound field as a superposition of two sources (a monopole and an image source). The obtained inverse problems are solved with Tikhonov regularization, with automatic choice of the regularization parameter by the L-curve criterion. The sound absorption measurement is tested with simulations of the sound field above infinite and finite porous absorbers. The approaches are compared to the plane-wave absorption coefficient and the one obtained by spherical wave incidence. Experimental analysis of two porous samples and one resonant absorber is also carried out in situ. Four arrays were tested with an increasing aperture and number of sensors. It was demonstrated that measurements are feasible even with an array with only a few microphones. The discretization of the integral equation led to a more accurate reconstruction of the sound pressure and particle velocity at the sample's surface. The resulting absorption coefficient agrees with the one obtained for spherical wave incidence, indicating that including more monopoles along the complex line is an essential feature of the sound field.
估计材料样本中的声吸收依赖于准确描述测量到的声场。证据表明,在紧凑型测量系统中建模入射球波的反射非常重要。本文提出了一种通过将测量到的声压通过麦克风阵列映射到复平面上的极化子分布中,估计材料样本的声吸收系数的算法。该方法与将声场建模为两个源(一个球体源和一个图像源)的超平面波传播模型的方法进行了比较。通过L-曲线准则自动选择截距参数。通过模拟无限和有限孔隙吸收器的声场,测试了声吸收测量。将平面波吸收系数和通过球波入射获得的声吸收系数进行了比较。还在现场进行了两个多孔样本和一个谐振吸收器的实验分析。四个阵列分别用逐渐扩大的孔径和更多传感器进行测试。结果表明,即使只有几个麦克风,测量也是可行的。离散化积分方程导致样本表面上的声压和颗粒速度更精确的重建。得到的吸收系数与球波入射时获得的相同,表明在复平面上包括更多的极化子是声场的一个重要特征。
https://arxiv.org/abs/2404.11399
The application of artificial intelligence (AI) models in fields such as engineering is limited by the known difficulty of quantifying the reliability of an AI's decision. A well-calibrated AI model must correctly report its accuracy on in-distribution (ID) inputs, while also enabling the detection of out-of-distribution (OOD) inputs. A conventional approach to improve calibration is the application of Bayesian ensembling. However, owing to computational limitations and model misspecification, practical ensembling strategies do not necessarily enhance calibration. This paper proposes an extension of variational inference (VI)-based Bayesian learning that integrates calibration regularization for improved ID performance, confidence minimization for OOD detection, and selective calibration to ensure a synergistic use of calibration regularization and confidence minimization. The scheme is constructed successively by first introducing calibration-regularized Bayesian learning (CBNN), then incorporating out-of-distribution confidence minimization (OCM) to yield CBNN-OCM, and finally integrating also selective calibration to produce selective CBNN-OCM (SCBNN-OCM). Selective calibration rejects inputs for which the calibration performance is expected to be insufficient. Numerical results illustrate the trade-offs between ID accuracy, ID calibration, and OOD calibration attained by both frequentist and Bayesian learning methods. Among the main conclusions, SCBNN-OCM is seen to achieve best ID and OOD performance as compared to existing state-of-the-art approaches at the cost of rejecting a sufficiently large number of inputs.
人工智能(AI)模型在工程领域的应用受到已知的美誉度量化决策可靠性难度的限制。一个经过良好校准的AI模型必须正确报告其精度在离散(ID)输入上,同时还能检测到离散(OOD)输入。提高校准的一种传统方法是应用贝叶斯集成。然而,由于计算限制和模型不规范,实际集成策略并不一定增强校准。本文提出了一种扩展基于变分推理(VI)的贝叶斯学习,该学习整合了改进ID性能的校准正则化、用于检测离散(OOD)输入的置信最小化以及选择性校准,以确保校准正则化和置信最小化的协同使用。该方案通过依次引入校准正则化的贝叶斯学习(CBNN),将离散置信最小化(OCM)集成到CBNN-OCM中,最后将选择性校准集成到 selective CBNN-OCM(SCBNN-OCM)中。选择性校准拒绝预计校准性能不足的输入。数值结果表明,频率方法和贝叶斯学习方法在ID准确性、ID校准和OOD校准方面的权衡。与现有先进方法相比,SCBNN-OCM在ID和OOD性能方面都取得了最佳结果,但以拒绝足够大的输入数量为代价。
https://arxiv.org/abs/2404.11350
Content moderation faces a challenging task as social media's ability to spread hate speech contrasts with its role in promoting global connectivity. With rapidly evolving slang and hate speech, the adaptability of conventional deep learning to the fluid landscape of online dialogue remains limited. In response, causality inspired disentanglement has shown promise by segregating platform specific peculiarities from universal hate indicators. However, its dependency on available ground truth target labels for discerning these nuances faces practical hurdles with the incessant evolution of platforms and the mutable nature of hate speech. Using confidence based reweighting and contrastive regularization, this study presents HATE WATCH, a novel framework of weakly supervised causal disentanglement that circumvents the need for explicit target labeling and effectively disentangles input features into invariant representations of hate. Empirical validation across platforms two with target labels and two without positions HATE WATCH as a novel method in cross platform hate speech detection with superior performance. HATE WATCH advances scalable content moderation techniques towards developing safer online communities.
内容审查面临着一个具有挑战性的任务,因为社交媒体传播仇恨言论的能力与促进全球连通性的作用相矛盾。随着迅速变化的俚语和仇恨言论,传统深度学习对在线对话灵活领域的适应性仍然有限。为了应对这一挑战,因果性启发下的解耦方法已经表现出通过隔离平台特定奇异特点与通用仇恨指标的 fluid 场景的潜力。然而,其对可用目标标签进行判断的依赖性,在平台不断演进和仇恨言论多变性的情况下,面临着实际障碍。通过基于信心的重新加权和平衡对比 regularization,本研究提出了 HATE WATCH,一种新颖的弱监督因果解码框架,它绕过了明确的目标标签的需要,有效将输入特征解耦为不变的仇恨表示。在两个带有目标标签的平台和两个没有位置的平台进行实证验证后,HATE WATCH 作为跨平台仇恨言论检测的一种新颖方法,具有卓越的性能。HATE WATCH 为实现更安全的在线社区提供了可扩展的审查方法。
https://arxiv.org/abs/2404.11036
Semi-supervised image classification, leveraging pseudo supervision and consistency regularization, has demonstrated remarkable success. However, the ongoing challenge lies in fully exploiting the potential of unlabeled data. To address this, we employ information entropy neural estimation to harness the potential of unlabeled samples. Inspired by contrastive learning, the entropy is estimated by maximizing a lower bound on mutual information across different augmented views. Moreover, we theoretically analyze that the information entropy of the posterior of an image classifier is approximated by maximizing the likelihood function of the softmax predictions. Guided by these insights, we optimize our model from both perspectives to ensure that the predicted probability distribution closely aligns with the ground-truth distribution. Given the theoretical connection to information entropy, we name our method \textit{InfoMatch}. Through extensive experiments, we show its superior performance.
半监督图像分类利用伪监督和一致性正则化取得了显著的成功。然而,当前的挑战在于充分利用未标记数据的潜力。为解决这个问题,我们采用信息熵神经估计来发掘未标记样本的潜力。受到对比学习启发,我们通过最大化不同增强视图间互信息下界来估计熵。此外,我们理论分析得出,分类器后验信息的熵近似于最大软max预测的概率函数。在这些启示下,我们从两个角度优化我们的模型,以确保预测概率分布与真实分布非常接近。由于与信息熵之间存在理论联系,我们将方法命名为 \textit{InfoMatch》。通过大量实验,我们证明了其卓越性能。
https://arxiv.org/abs/2404.11003
With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.
在深度神经网络(NN)在各种领域的成功应用中,训练和部署大型NN的计算和存储需求已成为进一步改进的瓶颈。因此,稀疏化方法应运而生,成为解决这些问题的一个领先方法。在这篇工作中,我们考虑了一种简单而有效的稀疏化方法,该方法基于Bridge或者$L_p$正则化在训练过程中。我们引入了一种新颖的权重衰减方案,它将标准$L_2$正则扩展到了任何$p$范数。我们证明了这种方案与自适应优化器兼容,同时避免了$0<p<1$范数下的梯度扩散。我们通过实验验证,这种方法导致生成具有很高稀疏性的网络,同时保持与标准$L_2$正则性相当的学习能力。
https://arxiv.org/abs/2404.10824
Recently, 3D Gaussian Splatting (3DGS) has demonstrated impressive novel view synthesis results, while allowing the rendering of high-resolution images in real-time. However, leveraging 3D Gaussians for surface reconstruction poses significant challenges due to the explicit and disconnected nature of 3D Gaussians. In this work, we present Gaussian Opacity Fields (GOF), a novel approach for efficient, high-quality, and compact surface reconstruction in unbounded scenes. Our GOF is derived from ray-tracing-based volume rendering of 3D Gaussians, enabling direct geometry extraction from 3D Gaussians by identifying its levelset, without resorting to Poisson reconstruction or TSDF fusion as in previous work. We approximate the surface normal of Gaussians as the normal of the ray-Gaussian intersection plane, enabling the application of regularization that significantly enhances geometry. Furthermore, we develop an efficient geometry extraction method utilizing marching tetrahedra, where the tetrahedral grids are induced from 3D Gaussians and thus adapt to the scene's complexity. Our evaluations reveal that GOF surpasses existing 3DGS-based methods in surface reconstruction and novel view synthesis. Further, it compares favorably to, or even outperforms, neural implicit methods in both quality and speed.
近年来,3D高斯平铺(3DGS)已经展示了令人印象深刻的全新视图合成结果,同时允许在实时渲染高分辨率图像。然而,利用3D高斯进行表面复原存在重大挑战,因为3D高斯具有显式和离散的性质。在本文中,我们提出了 Gaussian Opacity Fields (GOF),一种用于在无边界场景中实现高效、高质量和紧凑表面复原的新方法。我们的GOF是基于3D高斯的光线追踪体积渲染派生而来的,通过确定其境界线,直接从3D高斯中提取几何,而不需要求助于Poisson重建或TSDF融合,如以前的工作。我们将高斯表面的法线近似为光线与高斯平面相交平面的法线,使得可以应用正则化,显著增强几何。此外,我们开发了一种利用步进四边形进行有效几何提取的方法,其中四边形网格是由3D高斯引起的,因此可以适应场景的复杂性。我们的评估显示,GOF在表面复原和全新视图合成方面超过了现有的3DGS方法。此外,它与神经隐式方法在质量和速度上相比,具有优势,甚至优异。
https://arxiv.org/abs/2404.10772
Diffusion models have exhibited remarkable capabilities in text-to-image generation. However, their performance in image-to-text generation, specifically image captioning, has lagged behind Auto-Regressive (AR) models, casting doubt on their applicability for such tasks. In this work, we revisit diffusion models, highlighting their capacity for holistic context modeling and parallel decoding. With these benefits, diffusion models can alleviate the inherent limitations of AR methods, including their slow inference speed, error propagation, and unidirectional constraints. Furthermore, we identify the prior underperformance of diffusion models stemming from the absence of an effective latent space for image-text alignment, and the discrepancy between continuous diffusion processes and discrete textual data. In response, we introduce a novel architecture, LaDiC, which utilizes a split BERT to create a dedicated latent space for captions and integrates a regularization module to manage varying text lengths. Our framework also includes a diffuser for semantic image-to-text conversion and a Back&Refine technique to enhance token interactivity during inference. LaDiC achieves state-of-the-art performance for diffusion-based methods on the MS COCO dataset with 38.2 BLEU@4 and 126.2 CIDEr, demonstrating exceptional performance without pre-training or ancillary modules. This indicates strong competitiveness with AR models, revealing the previously untapped potential of diffusion models in image-to-text generation.
扩散模型在文本到图像生成方面的表现引人注目。然而,在图像到文本生成方面,特别是图像标题生成,它们的性能已经落后于自回归模型。在这项工作中,我们重新审视了扩散模型,突出了它们整体上下文建模和并行解码的能力。借助这些优点,扩散模型可以减轻AR方法固有的限制,包括其缓慢的推理速度、错误传播和单向约束。此外,我们指出了扩散模型由于缺乏有效的图像文本对齐的潜在空间而表现出的先前的低性能,以及连续扩散过程和离散文本数据之间的差异。为了应对这些问题,我们引入了一种名为LaDiC的新架构,它利用分裂的BERT创建了专用的潜在空间,并包括一个正则化模块来管理不同的文本长度。我们的框架还包括一个扩散器用于语义图像到文本转换和 Back&Refine 技术,用于在推理过程中增强标记交互。LaDiC 在基于扩散的方法在 MS COCO 数据集上实现了最先进的性能,达到38.2 BLEU@4 和126.2 CIDEr,这表明 LaDiC 在没有预训练或辅助模块的情况下具有出色的性能。这揭示了扩散模型在图像到文本生成方面的潜力,这是 AR 模型所无法匹敌的。
https://arxiv.org/abs/2404.10763
Inductive biases are crucial in disentangled representation learning for narrowing down an underspecified solution set. In this work, we consider endowing a neural network autoencoder with three select inductive biases from the literature: data compression into a grid-like latent space via quantization, collective independence amongst latents, and minimal functional influence of any latent on how other latents determine data generation. In principle, these inductive biases are deeply complementary: they most directly specify properties of the latent space, encoder, and decoder, respectively. In practice, however, naively combining existing techniques instantiating these inductive biases fails to yield significant benefits. To address this, we propose adaptations to the three techniques that simplify the learning problem, equip key regularization terms with stabilizing invariances, and quash degenerate incentives. The resulting model, Tripod, achieves state-of-the-art results on a suite of four image disentanglement benchmarks. We also verify that Tripod significantly improves upon its naive incarnation and that all three of its "legs" are necessary for best performance.
归纳偏见在解离表示学习中的关键作用在于缩小不明确解集。在这项工作中,我们考虑为神经网络自编码器赋予来自文献中的三个选择性归纳偏见:通过量化将数据压缩到类似于网格状的潜在空间,以及在潜在之间实现集体独立,以及对任何潜在对其他潜在如何确定数据生成的最小功能影响。原则上,这些归纳偏见是深刻互补的:它们最直接地指定潜在空间的性质、编码器和解码器的属性。然而,在实践中,简单地组合现有的技术实例这些归纳偏见往往无法带来显著的益处。为了解决这个问题,我们提出了三种适应技术,简化学习问题,为关键正则化项分配稳定不变性,以及遏制退化激励。所得到的模型Tripod在四个图像解离基准测试中实现了最先进的结果。我们还验证了Tripod在它的原始形式上明显优于 naive 版本,而且它的三个“腿”对于最佳性能都是必要的。
https://arxiv.org/abs/2404.10282
Early identification of drought stress in crops is vital for implementing effective mitigation measures and reducing yield loss. Non-invasive imaging techniques hold immense potential by capturing subtle physiological changes in plants under water deficit. Sensor based imaging data serves as a rich source of information for machine learning and deep learning algorithms, facilitating further analysis aimed at identifying drought stress. While these approaches yield favorable results, real-time field applications requires algorithms specifically designed for the complexities of natural agricultural conditions. Our work proposes a novel deep learning framework for classifying drought stress in potato crops captured by UAVs in natural settings. The novelty lies in the synergistic combination of a pretrained network with carefully designed custom layers. This architecture leverages feature extraction capabilities of the pre-trained network while the custom layers enable targeted dimensionality reduction and enhanced regularization, ultimately leading to improved performance. A key innovation of our work involves the integration of Gradient-Class Activation Mapping (Grad-CAM), an explainability technique. Grad-CAM sheds light on the internal workings of the deep learning model, typically referred to as a black box. By visualizing the focus areas of the model within the images, Grad-CAM fosters interpretability and builds trust in the decision-making process of the model. Our proposed framework achieves superior performance, particularly with the DenseNet121 pre-trained network, reaching a precision of 98% to identify the stressed class with an overall accuracy of 90%. Comparative analysis of existing state-of-the-art object detection algorithms reveals the superiority of our approach in significantly higher precision and accuracy.
早期对作物干旱 stress的识别对于实施有效的缓解措施和减少产量损失至关重要。非侵入性成像技术通过捕捉植物在缺水条件下微妙的生理变化,具有巨大的潜力。基于传感器的成像数据为机器学习和深度学习算法提供了丰富的信息,促进进一步分析以确定干旱 stress。虽然这些方法产生了积极的结果,但实时现场应用需要专门针对自然农业条件复杂性的算法。我们的工作提出了一个新颖的深度学习框架,用于在自然环境中的通过UAV捕捉的马铃薯作物中分类干旱 stress。新颖之处在于预训练网络与精心设计的自定义层之间的协同作用。该架构利用预训练网络的特征提取能力,同时自定义层可以实现针对性的维度减缩和增强正则化,从而实现提高性能。我们工作的关键创新包括引入 Gradient-Class Activation Mapping(Grad-CAM),这是一种解释性技术。Grad-CAM揭示了深度学习模型的内部工作原理,通常被称为黑盒。通过可视化模型在图像中的关注区域,Grad-CAM促进了可解释性和模型决策过程的信任。我们提出的框架在DenseNet121预训练网络上的性能尤为卓越,其精度达到了98%,总体准确度为90%。与现有的最先进的物体检测算法进行比较分析,我们的方法在精度和对称性方面具有显著优势。
https://arxiv.org/abs/2404.10073
Studies continually find that message-passing graph convolutional networks suffer from the over-smoothing issue. Basically, the issue of over-smoothing refers to the phenomenon that the learned embeddings for all nodes can become very similar to one another and therefore are uninformative after repeatedly applying message passing iterations. Intuitively, we can expect the generated embeddings become smooth asymptotically layerwisely, that is each layer of graph convolution generates a smoothed version of embeddings as compared to that generated by the previous layer. Based on this intuition, we propose RandAlign, a stochastic regularization method for graph convolutional networks. The idea of RandAlign is to randomly align the learned embedding for each node with that of the previous layer using randomly interpolation in each graph convolution layer. Through alignment, the smoothness of the generated embeddings is explicitly reduced. To better maintain the benefit yielded by the graph convolution, in the alignment step we introduce to first scale the embedding of the previous layer to the same norm as the generated embedding and then perform random interpolation for aligning the generated embedding. RandAlign is a parameter-free method and can be directly applied without introducing additional trainable weights or hyper-parameters. We experimentally evaluate RandAlign on different graph domain tasks on seven benchmark datasets. The experimental results show that RandAlign is a general method that improves the generalization performance of various graph convolutional network models and also improves the numerical stability of optimization, advancing the state of the art performance for graph representation learning.
研究表明,消息传递图卷积网络存在过平滑问题。本质上,过平滑是指所有节点的学习嵌入变得非常相似,因此多次应用消息传递迭代后,这些嵌入变得不再具有信息价值。直观上,我们可以预期生成的嵌入在层际上会变得平滑,即与前一层生成的嵌入相比,每个层生成的嵌入都会产生平滑的版本。根据这个直觉,我们提出了RandAlign,一种随机 regularization 方法,用于图卷积网络。RandAlign 的思想是通过在图卷积层中使用随机插值来随机对齐每个节点的学习嵌入,从而减少生成的嵌入的平滑度。为了更好地保持图卷积带来的好处,在对齐步骤中,我们首先将前层的嵌入缩放到与生成的嵌入相同的正则下限,然后对生成的嵌入进行随机插值以进行对齐。RandAlign 是一种无参数方法,可以直接应用而无需引入额外的训练权重或超参数。我们在七个基准数据集上对RandAlign 进行了实验评估。实验结果表明,RandAlign 是一种通用的方法,可以提高各种图卷积网络模型的泛化性能,同时提高优化算法的数值稳定性,推动图形表示学习领域的最新进展。
https://arxiv.org/abs/2404.09774
The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image, which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame, which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder, which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance, head pose, and facial expressions. Thus, they are perfectly suited for cross-reenactment. In contrast to most related work, our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency.
面部复原的任务是将来自驾驶视频的头动量和面部表情转移到源图像的 appearance上,这可能是不同的人(跨复原)。现有的方法基于CNN,估计源图像到当前驾驶帧的视差,然后修复和优化以产生输出动画。我们提出了一种基于Transformer的编码器来计算源图像的集合潜在表示。然后,我们使用基于Transformer的解码器预测查询像素的输出颜色,其中条件基于关键点和从驾驶帧中提取的面部表情向量。 源人物的潜在表示是在自监督的方式下学习,将他们的外观、头姿势和面部表情分解成不同的组件。因此,它们非常适合跨复原。与大多数相关的工作不同,我们的方法自然地扩展到多个源图像,从而可以适应个性化的面部动态。我们还提出了数据增强和正则化方案,以防止过拟合和支持学习表示的泛化。我们在随机用户研究中评估了我们的方法。结果表明,与最先进的技术相比,在运动传递质量和时间一致性方面具有优越的性能。
https://arxiv.org/abs/2404.09736
Parameter-efficient fine-tuning methods, represented by LoRA, play an essential role in adapting large-scale pre-trained models to downstream tasks. However, fine-tuning LoRA-series models also faces the risk of overfitting on the training dataset, and yet there's still a lack of theoretical guidance and practical mechanism to control overfitting on LoRA-based PEFT methods. In this paper, we propose a LoRA Dropout mechanism for the LoRA-based methods by introducing random noises to the learnable low-rank matrices and increasing parameter sparsity. We then demonstrate the theoretical mechanism of our LoRA Dropout mechanism from the perspective of sparsity regularization by providing a generalization error bound under this framework. Theoretical results show that appropriate sparsity would help tighten the gap between empirical and generalization risks and thereby control overfitting. Furthermore, based on the LoRA Dropout framework, we introduce a test-time ensemble strategy and provide theoretical evidence demonstrating that the ensemble method can further compress the error bound, and lead to better performance during inference time. Extensive experiments on various NLP tasks provide practical validations of the effectiveness of our LoRA Dropout framework in improving model accuracy and calibration.
参数高效的微调方法,以LoRA为代表,在将大规模预训练模型适应下游任务方面起着关键作用。然而,微调LoRA系列模型也面临着在训练数据集上过拟合的风险,而且目前尚无理论指导和实际机制来控制基于LoRA的PEFT方法上的过拟合。在本文中,我们提出了一种LoRA Dropout机制,通过引入随机噪声来学习低秩矩阵,并增加参数稀疏度,从而为基于LoRA的PEFT方法提供了一种新的思路。然后,我们从稀疏 regularization 的角度来证明我们LoRA Dropout机制的理论机制。理论结果表明,适当的稀疏度有助于缩小经验风险和泛化风险之间的差距,从而控制过拟合。此外,我们还基于LoRA Dropout框架引入了一种测试时间集成策略,并提供了理论证据,表明集成方法可以进一步压缩误差上限,并在推理时间实现更好的性能。在各种自然语言处理任务上进行的大量实验证明了我们LoRA Dropout框架在提高模型精度和校准方面的有效性。
https://arxiv.org/abs/2404.09610
Neural implicit fields have established a new paradigm for scene representation, with subsequent work achieving high-quality real-time rendering. However, reconstructing 3D scenes from oblique aerial photography presents unique challenges, such as varying spatial scale distributions and a constrained range of tilt angles, often resulting in high memory consumption and reduced rendering quality at extrapolated viewpoints. In this paper, we enhance MERF to accommodate these data characteristics by introducing an innovative adaptive occupancy plane optimized during the volume rendering process and a smoothness regularization term for view-dependent color to address these issues. Our approach, termed Oblique-MERF, surpasses state-of-the-art real-time methods by approximately 0.7 dB, reduces VRAM usage by about 40%, and achieves higher rendering frame rates with more realistic rendering outcomes across most viewpoints.
神经隐含场已经建立了一种新的场景表示范式,后续工作在实现高质量实时渲染方面取得了成功。然而,从俯视摄影中重建三维场景仍然具有独特的挑战,例如空间尺度分布的变化和限制的转角范围,通常导致在扩展视角下内存消耗和高渲染质量下降。在本文中,我们通过引入一种在体积渲染过程中创新的适应性占用平面和针对视点依赖的颜色平滑度约束来增强MERF,以适应这些数据特征。我们的方法被称为俯视-MERF,超越了最先进的实时方法约0.7 dB,降低了约40%的VRAM使用,并在大多数视角下实现了更真实的渲染帧率。
https://arxiv.org/abs/2404.09531
Federated learning (FL) facilitates a privacy-preserving neural network training paradigm through collaboration between edge clients and a central server. One significant challenge is that the distributed data is not independently and identically distributed (non-IID), typically including both intra-domain and inter-domain heterogeneity. However, recent research is limited to simply using averaged signals as a form of regularization and only focusing on one aspect of these non-IID challenges. Given these limitations, this paper clarifies these two non-IID challenges and attempts to introduce cluster representation to address them from both local and global perspectives. Specifically, we propose a dual-clustered feature contrast-based FL framework with dual focuses. First, we employ clustering on the local representations of each client, aiming to capture intra-class information based on these local clusters at a high level of granularity. Then, we facilitate cross-client knowledge sharing by pulling the local representation closer to clusters shared by clients with similar semantics while pushing them away from clusters with dissimilar semantics. Second, since the sizes of local clusters belonging to the same class may differ for each client, we further utilize clustering on the global side and conduct averaging to create a consistent global signal for guiding each local training in a contrastive manner. Experimental results on multiple datasets demonstrate that our proposal achieves comparable or superior performance gain under intra-domain and inter-domain heterogeneity.
联邦学习(FL)通过客户端与中央服务器之间的协作,实现了一种隐私保护的神经网络训练范式。一个重要的挑战是,分布式数据通常不是独立且同分布的(非IID),通常包括类内和类间异质性。然而,最近的研究仅局限于使用平均信号作为正则化形式,并仅关注这些非IID挑战的一个方面。鉴于这些限制,本文澄清了这两个非IID挑战,并尝试从本地和全局角度引入聚类表示来解决它们。具体来说,我们提出了一个具有双聚类的特征对比的FL框架,具有双焦点。首先,我们对每个客户端的局部表示进行聚类,旨在在低粒度水平上基于这些局部聚类捕捉类内信息。然后,通过将具有相似语义特征的客户端的局部表示靠近聚类,将它们推开具有不同语义特征的聚类,从而促进跨客户端知识共享。其次,由于每个客户端属于同一类的局部聚类大小可能会有所不同,我们在全局侧进一步使用聚类,并通过平均创建一个一致的全球信号,以指导每个局部训练以对比方式进行。在多个数据集上的实验结果表明,我们的建议在类内和类间异质性下实现了与或更好的性能提升。
https://arxiv.org/abs/2404.09259
Cinemagraph is a unique form of visual media that combines elements of still photography and subtle motion to create a captivating experience. However, the majority of videos generated by recent works lack depth information and are confined to the constraints of 2D image space. In this paper, inspired by significant progress in the field of novel view synthesis (NVS) achieved by 3D Gaussian Splatting (3D-GS), we propose LoopGaussian to elevate cinemagraph from 2D image space to 3D space using 3D Gaussian modeling. To achieve this, we first employ the 3D-GS method to reconstruct 3D Gaussian point clouds from multi-view images of static scenes,incorporating shape regularization terms to prevent blurring or artifacts caused by object deformation. We then adopt an autoencoder tailored for 3D Gaussian to project it into feature space. To maintain the local continuity of the scene, we devise SuperGaussian for clustering based on the acquired features. By calculating the similarity between clusters and employing a two-stage estimation method, we derive an Eulerian motion field to describe velocities across the entire scene. The 3D Gaussian points then move within the estimated Eulerian motion field. Through bidirectional animation techniques, we ultimately generate a 3D Cinemagraph that exhibits natural and seamlessly loopable dynamics. Experiment results validate the effectiveness of our approach, demonstrating high-quality and visually appealing scene generation.
电影图像是将静态照片和微动态运动的元素结合在一起,创造了一种引人入胜的体验。然而,由最近作品生成的多数视频缺乏深度信息,并局限于2D图像空间的限制。在本文中,我们受到3D高斯平滑(3D-GS)在 novel view synthesis(NVS)领域的重大进步的启发,提出了一种将电影图象从2D图像空间提升到3D空间使用3D高斯建模的方法。为了实现这一目标,我们首先使用3D-GS方法从静态场景的多视角图像中重构3D高斯点云,包括形状正则化项,以防止由于物体变形引起的模糊或伪影。然后我们采用一个专为3D高斯设计的自动编码器将其投影到特征空间。为了保持场景的局部连续性,我们根据获得的特征设计 SuperGaussian for clustering。通过计算聚类之间的相似度并使用双阶段估计方法,我们得到一个欧拉运动场,描述了场景中整个空间的瞬时速度。然后,通过双向动画技术,我们最终生成一个3D电影图象,展示了自然和无缝的循环动态。实验结果证实了我们的方法的有效性,表明了高质量和视觉上吸引人的场景生成。
https://arxiv.org/abs/2404.08966