Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to transfer the style of the whole dataset into generation of images. It can minimize the learning biases caused by content of images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, a unique token embedding corresponding to this new style is learned by a task-wise token learning module, which could preserve historical knowledge from past styles with the limitation of LoRA parameter quantity. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.
预训练的大型文本-图像(T2I)模型,特别是适当的文本提示,在定制图像生成领域引起了越来越多的兴趣。然而,灾难性遗忘问题使得在保留学习到的样式的同时,持续生成新的用户提供样式变得困难。在本文中,我们提出了一种名为MuseumMaker的方法,使您能够在无尽的方式下根据一组自定义样式合成图像,并逐渐积累这些创意艺术作品作为一个博物馆。当面临新的定制样式时,我们开发了一种风格蒸馏损失模块,将整个数据集的风格传递给图像生成。它可以减小由于图片内容而产生的学习偏移,并解决由少样本图像引起的灾难性过拟合问题。为了处理过去学习到的样式中的灾难性遗忘,我们为共享LoRA模块设计了双重正则化,以优化模型更新方向,分别从权重和特征方面对扩散模型进行正则。同时,通过任务级别的标记学习模块,学习到一个与新样式对应的独特标记嵌入,这可以保留过去样式的历史知识,同时限制LoRA参数的数量。随着任何新的用户提供样式,我们的MuseumMaker可以捕捉到新风格的细微差别,同时保留学习到的样式的细节。在多样风格数据集上的实验结果证实了我们对MuseumMaker方法的有效性,展示了其在各种场景的稳健性和多样性。
https://arxiv.org/abs/2404.16612
Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.
保持时间稳定性对于多智能体轨迹预测至关重要。通常,保持这种稳定性需要足够的正则化来保持,否则会导致运动状态的波动,从而导致预测的不一致性和误差放大的问题。在这项研究中,我们引入了一个名为多智能体轨迹预测通过神经相互作用能量(MATE)的框架。这个框架通过使用神经相互作用能量来评估智能体的相互作用运动,捕捉了互动的动态,并突出了它们对智能体未来轨迹的影响。为了加强时间稳定性,我们引入了两个约束:智能体间交互约束和智能体间运动约束。这些约束共同作用,确保了系统和智能体层面的时间稳定性,有效地减轻了多智能体系统中的预测波动。与之前的方法相比,在四个不同的数据集上的比较评估结果表明,我们模型的预测准确性和泛化能力都具有优势。
https://arxiv.org/abs/2404.16579
Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
模型无关强化学习方法缺乏对训练后策略施加行为约束的固有机制。虽然存在某些扩展,但它们仍然局限于特定的约束类型,例如带有额外奖励信号的价值约束或访问密度约束。在这项工作中,我们试图统一这些现有技术,并使用基于价值的actor-critic强化学习方法的泛化二次框架来弥合与经典优化和控制理论之间的差距。所得到的双重形式展开在很大程度上有助于对学习到的策略施加额外的约束,因为这种双重约束(或 regularization 项)与原初在值上的约束之间揭示了一种固有的关系。此外,利用这种框架,我们能够引入一些新颖的约束类型,使得能够对策略的动作密度或连续状态和动作之间的转移成本施加限制。从调整后的原初-二次优化问题中,我们得到了一个实际算法,它在训练过程中自动处理各种策略约束。利用不同的约束组合,对两个可解释的环境进行了评估。结果表明,该方法非常有效,最终为设计这种系统的设计者提供了一个丰富的策略约束工具箱。
https://arxiv.org/abs/2404.16468
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
本文通过关注卷积神经网络的清晰度图来研究可解释性。大多数基于类激活图(CAM)的方法结合了全连接层的信息和反向传播中的梯度。然而,人们普遍认为梯度是噪声,因此出现了类似于指导反向传播(GSP)的方法来获得更好的推理可视化。在这项工作中,我们提出了一个新颖的训练方法来提高梯度的质量。特别地,我们引入了一个正则化损失,使得通过标准反向传播获得的输入图像的梯度与通过指导反向传播获得的梯度相似。我们发现,通过这种方法得到的梯度在质上是更少的噪声,并且通过使用几种可解释性方法,提高了不同网络的定量可解释性特性。
https://arxiv.org/abs/2404.15024
In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional data. However, existing methods mainly focus on synthetic datasets with heavy reliance on intervention targets and ignore the textual information hidden in real-world systems, failing to conduct causal discovery for real industrial scenarios. To tackle this problem, in this paper we propose to investigate temporal causal discovery in industrial scenarios, which faces two critical challenges: 1) how to discover causal relationships without the interventional targets that are costly to obtain in practice, and 2) how to discover causal relations via leveraging the textual information in systems which can be complex yet abundant in industrial contexts. To address these challenges, we propose the RealTCD framework, which is able to leverage domain knowledge to discover temporal causal relationships without interventional targets. Specifically, we first develop a score-based temporal causal discovery method capable of discovering causal relations for root cause analysis without relying on interventional targets through strategic masking and regularization. Furthermore, by employing Large Language Models (LLMs) to handle texts and integrate domain knowledge, we introduce LLM-guided meta-initialization to extract the meta-knowledge from textual information hidden in systems to boost the quality of discovery. We conduct extensive experiments on simulation and real-world datasets to show the superiority of our proposed RealTCD framework over existing baselines in discovering temporal causal structures.
在人工智能信息技术运营领域,因果发现对于图形构建的操作和维护至关重要,这有助于下游工业任务的根因分析,如原因分析。 temporal因果发现作为一种新兴的方法,旨在通过利用干预数据直接识别出变量的因果关系,从而实现对真实世界系统中隐含文本信息的发现。然而,现有的方法主要集中在依赖干预目标的合成数据集上,忽视了真实世界系统中隐含的文本信息,因此无法对真实工业场景进行因果发现。为了解决这个问题,本文提出了一种研究工业场景中因果发现的方案,面临着两个关键挑战:1)如何在不花费实践成本的干预目标之间发现因果关系,2)如何通过利用复杂但丰富的工业环境中系统的文本信息来发现因果关系。为解决这些挑战,我们提出了 RealTCD 框架,它能够利用领域知识来发现没有干预目标时的因果关系。具体来说,我们首先开发了一种基于分数的时序因果发现方法,通过战略遮蔽和正则化能够发现根原因分析中的因果关系。此外,通过使用大型语言模型(LLMs)处理文本并整合领域知识,我们引入了 LLM-guided meta-initialization,以提取系统中的文本信息以提高发现质量。我们在模拟和真实世界数据集上进行广泛的实验,证明了我们提出的 RealTCD 框架在发现时序因果结构方面优于现有基线。
https://arxiv.org/abs/2404.14786
This paper introduces \textbf{Q-tuning}, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs previous prompts in the queue with a learnable low-rank matrix. Once the prompt queue reaches its maximum capacity, we leverage a PCA-based eviction rule to reduce the queue's size, allowing the newly trained prompt to be added while preserving the primary knowledge of old tasks. In order to mitigate the accumulation of information loss caused by the eviction, we additionally propose a globally shared prefix prompt and a memory retention regularization based on information theory. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods substantially on continual prompt tuning benchmarks. Moreover, our approach enables lifelong learning on linearly growing task sequences while requiring constant complexity for training and inference.
本文介绍了一种名为 \textbf{Q-tuning} 的新方法,用于持续 prompt 调整,从而实现预训练语言模型的终身学习。在学习新任务时,Q-tuning 通过将新任务加入一个由 older 任务提示组成的提示队列中,来训练一个任务特定的提示。为了更好地转移旧任务的知識,我們設計了一種自適應的知識聚合技術,通過可學習的低秩矩陣重新權重队列中的先前提示。一旦提示隊列達到其最大容量,我們利用基于主成分分析(PCA)的驱逐规则来减少队列的大小,从而在保留主要舊任务知识的同时,允许新训练的提示加入队列。为了减轻驱逐操作造成的信息损失的累积,我们还提出了一个全局共享前缀提示和基于信息理论的内存保留 regularization。大量实验证明,与最先进的 methods相比,我们的方法在持续 prompt 调整基准测试中显著表现出优势。此外,我们的方法在 linearly growing 任务序列上实现终身学习,同时需要训练和推理的常规模度为不变。
https://arxiv.org/abs/2404.14607
Continual learning, the ability to acquire knowledge from new data while retaining previously learned information, is a fundamental challenge in machine learning. Various approaches, including memory replay, knowledge distillation, model regularization, and dynamic network expansion, have been proposed to address this issue. Thus far, dynamic network expansion methods have achieved state-of-the-art performance at the cost of incurring significant computational overhead. This is due to the need for additional model buffers, which makes it less feasible in resource-constrained settings, particularly in the medical domain. To overcome this challenge, we propose Dynamic Model Merging, DynaMMo, a method that merges multiple networks at different stages of model training to achieve better computational efficiency. Specifically, we employ lightweight learnable modules for each task and combine them into a unified model to minimize computational overhead. DynaMMo achieves this without compromising performance, offering a cost-effective solution for continual learning in medical applications. We evaluate DynaMMo on three publicly available datasets, demonstrating its effectiveness compared to existing approaches. DynaMMo offers around 10-fold reduction in GFLOPS with a small drop of 2.76 in average accuracy when compared to state-of-the-art dynamic-based approaches. The code implementation of this work will be available upon the acceptance of this work at this https URL.
持续学习,在新数据上获取知识的同时保留之前学到的信息,是机器学习中的一个基本挑战。为解决这个问题,已经提出了各种方法,包括记忆回放、知识蒸馏、模型正则化和动态网络扩展。迄今为止,动态网络扩展方法已经在尖端性能的同时造成了显著的计算开销。这是因为需要额外的模型缓冲区,这使得在资源受限的环境中,特别是在医疗领域,实现起来更加困难。为了克服这个挑战,我们提出了动态模型合并,DynaMMo,一种将多个网络在模型训练的不同阶段合并以实现更好的计算效率的方法。具体来说,我们为每个任务采用轻量化的可学习模块,并将它们合并成一个统一的模型,以最小化计算开销。DynaMMo在没有牺牲性能的情况下实现了这一目标,为医疗应用中的持续学习提供了经济有效的解决方案。我们在三个公开可用的数据集上评估DynaMMo,证明了其与现有方法的优越性。与基于动态的方法相比,DynaMMo在GFLOPS方面的减少大约为10倍,而在平均准确性方面的下降不到3.00。本工作的代码实现将在本工作的提交被接受时在上述链接处提供。
https://arxiv.org/abs/2404.14099
Building fair deep neural networks (DNNs) is a crucial step towards achieving trustworthy artificial intelligence. Delving into deeper factors that affect the fairness of DNNs is paramount and serves as the foundation for mitigating model biases. However, current methods are limited in accurately predicting DNN biases, relying solely on the number of training samples and lacking more precise measurement tools. Here, we establish a geometric perspective for analyzing the fairness of DNNs, comprehensively exploring how DNNs internally shape the intrinsic geometric characteristics of datasets-the intrinsic dimensions (IDs) of perceptual manifolds, and the impact of IDs on the fairness of DNNs. Based on multiple findings, we propose Intrinsic Dimension Regularization (IDR), which enhances the fairness and performance of models by promoting the learning of concise and ID-balanced class perceptual manifolds. In various image recognition benchmark tests, IDR significantly mitigates model bias while improving its performance.
建立公平的深度神经网络(DNN)是实现可靠人工智能的重要一步。深入研究影响DNN公平性的因素至关重要,并为减轻模型偏见提供基础。然而,现有方法在准确预测DNN偏见方面存在局限,仅依赖训练样本数量,缺乏更精确的测量工具。在这里,我们建立了分析DNN公平性的几何视角,全面探讨了DNN内部如何塑造数据集的固有几何特征-感知曼弗朗格的固有维度(IDs)以及ID对DNN公平性的影响。根据多个研究结果,我们提出了 Intrinsic Dimension Regularization(IDR),通过促进简洁且ID平衡的类感知曼弗朗格的学习来增强模型的公平性和性能。在各种图像识别基准测试中,IDR显著减轻了模型的偏差,同时提高了其性能。
https://arxiv.org/abs/2404.13859
A high-precision feature extraction model is crucial for change detection (CD). In the past, many deep learning-based supervised CD methods learned to recognize change feature patterns from a large number of labelled bi-temporal images, whereas labelling bi-temporal remote sensing images is very expensive and often time-consuming; therefore, we propose a coarse-to-fine semi-supervised CD method based on consistency regularization (C2F-SemiCD), which includes a coarse-to-fine CD network with a multiscale attention mechanism (C2FNet) and a semi-supervised update method. Among them, the C2FNet network gradually completes the extraction of change features from coarse-grained to fine-grained through multiscale feature fusion, channel attention mechanism, spatial attention mechanism, global context module, feature refine module, initial aggregation module, and final aggregation module. The semi-supervised update method uses the mean teacher method. The parameters of the student model are updated to the parameters of the teacher Model by using the exponential moving average (EMA) method. Through extensive experiments on three datasets and meticulous ablation studies, including crossover experiments across datasets, we verify the significant effectiveness and efficiency of the proposed C2F-SemiCD method. The code will be open at: this https URL.
为了检测变化(CD),高精度特征提取模型至关重要。在过去,许多基于深度学习的监督CD方法从大量带标签的时序图像中学习变化特征模式,而给定带标签的遥感图像的标注成本很高,并且通常需要花费很长时间。因此,我们提出了一个基于一致性正则化(C2F-SemiCD)的粗到细半监督CD方法,该方法包括具有多尺度注意机制(C2FNet)的粗到细CD网络和半监督更新方法。在这些方法中,C2FNet网络通过多尺度特征融合、通道关注机制、空间关注机制、全局上下文模块、特征优化模块、初始聚合模块和最终聚合模块,从粗粒度到细粒度完成了变化特征的提取。半监督更新方法使用平均教师方法。通过在三个数据集上进行广泛的实验和仔细的消融分析,包括跨数据集的交叉实验,我们证实了所提出的C2F-SemiCD方法的有效性和效率。代码将在这个链接上公开:https://this URL。
https://arxiv.org/abs/2404.13838
Domain generalized semantic segmentation is an essential computer vision task, for which models only leverage source data to learn the capability of generalized semantic segmentation towards the unseen target domains. Previous works typically address this challenge by global style randomization or feature regularization. In this paper, we argue that given the observation that different local semantic regions perform different visual characteristics from the source domain to the target domain, methods focusing on global operations are hard to capture such regional discrepancies, thus failing to construct domain-invariant representations with the consistency from local to global level. Therefore, we propose the Semantic-Rearrangement-based Multi-Level Alignment (SRMA) to overcome this problem. SRMA first incorporates a Semantic Rearrangement Module (SRM), which conducts semantic region randomization to enhance the diversity of the source domain sufficiently. A Multi-Level Alignment module (MLA) is subsequently proposed with the help of such diversity to establish the global-regional-local consistent domain-invariant representations. By aligning features across randomized samples with domain-neutral knowledge at multiple levels, SRMA provides a more robust way to handle the source-target domain gap. Extensive experiments demonstrate the superiority of SRMA over the current state-of-the-art works on various benchmarks.
领域泛化语义分割是一个重要的计算机视觉任务,因为模型仅利用源数据来学习泛化语义分割在未见过的目标领域中的能力。之前的工作通常通过全局样式随机化或特征正则化来解决这个挑战。在本文中,我们认为,鉴于观察到不同局部语义区域在源领域到目标领域中的视觉特征有所不同,关注全局操作的方法很难捕捉这种区域差异,因此无法在局部到全局级别上保持一致性来构建域间可变的表示。因此,我们提出了基于语义重新排列的多层次对齐(SRMA)来解决这个问题。SRMA首先引入了一个语义重新排列模块(SRM),通过进行语义区域随机化来增强源领域的多样性。接下来,通过这种多样性引入了多层对齐模块(MLA),以建立全局-区域-局部一致的域间可变表示。通过在随机样本的特征与域中性知识之间进行对齐,SRMA提供了一种更健壮的方式来处理源领域和目标领域之间的差距。通过在各种基准上进行的广泛实验,SRMA优越地超过了目前最先进的工作。
https://arxiv.org/abs/2404.13701
The robustness of Transformer-based Natural Language Inference encoders is frequently compromised as they tend to rely more on dataset biases than on the intended task-relevant features. Recent studies have attempted to mitigate this by reducing the weight of biased samples during the training process. However, these debiasing methods primarily focus on identifying which samples are biased without explicitly determining the biased components within each case. This limitation restricts those methods' capability in out-of-distribution inference. To address this issue, we aim to train models to adopt the logic humans use in explaining causality. We propose a simple, comprehensive, and interpretable method: Explanation based Bias Decoupling Regularization (EBD-Reg). EBD-Reg employs human explanations as criteria, guiding the encoder to establish a tripartite parallel supervision of Distinguishing, Decoupling and Aligning. This method enables encoders to identify and focus on keywords that represent the task-relevant features during inference, while discarding the residual elements acting as biases. Empirical evidence underscores that EBD-Reg effectively guides various Transformer-based encoders to decouple biases through a human-centric lens, significantly surpassing other methods in terms of out-of-distribution inference capabilities.
基于Transformer的自然语言推理模型的稳健性经常受到数据集偏差的影响,而不是依赖于预期的任务相关特征。为了解决这个问题,最近的研究尝试通过在训练过程中减少偏差样本的权重来减轻这种偏差。然而,这些去偏方法主要关注于确定哪些样本存在偏差,而没有明确确定每个案例中的偏差组件。这个限制限制了这些方法在离散分布推理方面的能力。为了应对这个问题,我们旨在训练模型以模仿人类解释因果关系的逻辑。我们提出了一种简单、全面、可解释的方法:解释偏差去耦正则化(EBD-Reg)。EBD-Reg利用人类解释作为标准,指导编码器建立区分、去耦和align的三角关系。这种方法使编码器能够在推理过程中识别和关注代表任务相关特征的关键字,同时丢弃作为偏见的残余元素。实证证据表明,EBD-Reg有效地通过人本主义视角引导各种Transformer编码器通过去偏来解耦偏差,在离散分布推理能力上显著超越其他方法。
https://arxiv.org/abs/2404.13390
We introduce Contrastive Gaussian Clustering, a novel approach capable of provide segmentation masks from any viewpoint and of enabling 3D segmentation of the scene. Recent works in novel-view synthesis have shown how to model the appearance of a scene via a cloud of 3D Gaussians, and how to generate accurate images from a given viewpoint by projecting on it the Gaussians before $\alpha$ blending their color. Following this example, we train a model to include also a segmentation feature vector for each Gaussian. These can then be used for 3D scene segmentation, by clustering Gaussians according to their feature vectors; and to generate 2D segmentation masks, by projecting the Gaussians on a plane and $\alpha$ blending over their segmentation features. Using a combination of contrastive learning and spatial regularization, our method can be trained on inconsistent 2D segmentation masks, and still learn to generate segmentation masks consistent across all views. Moreover, the resulting model is extremely accurate, improving the IoU accuracy of the predicted masks by $+8\%$ over the state of the art. Code and trained models will be released soon.
我们介绍了一种名为 Contrastive Gaussian Clustering 的新方法,它可以从任何视角提供分割掩码,并实现场景的 3D 分割。最近的新视图合成工作展示了如何通过 3D 高斯云来建模场景的 appearance,以及如何在给定视角上投影高斯并在 $\alpha$ 融合后生成准确图像的方法。遵循这个例子,我们训练了一个模型,每个高斯还包括一个分割特征向量。这些特征向量可以用于 3D 场景分割,通过根据其特征向量聚类高斯;还可以用于生成 2D 分割掩码,通过在平面上投影高斯并在其分割特征上进行 $\alpha$ 融合。通过对比学习与空间正则化,我们的方法可以在不一致的 2D 分割掩码上进行训练,同时仍然能在所有视角上生成一致的分割掩码。此外,所得模型非常准确,预测掩码的 IoU 准确率提高了 $+8\%$ 以上。代码和训练好的模型不久将发布。
https://arxiv.org/abs/2404.12784
Representation rank is an important concept for understanding the role of Neural Networks (NNs) in Deep Reinforcement learning (DRL), which measures the expressive capacity of value networks. Existing studies focus on unboundedly maximizing this rank; nevertheless, that approach would introduce overly complex models in the learning, thus undermining performance. Hence, fine-tuning representation rank presents a challenging and crucial optimization problem. To address this issue, we find a guiding principle for adaptive control of the representation rank. We employ the Bellman equation as a theoretical foundation and derive an upper bound on the cosine similarity of consecutive state-action pairs representations of value networks. We then leverage this upper bound to propose a novel regularizer, namely BEllman Equation-based automatic rank Regularizer (BEER). This regularizer adaptively regularizes the representation rank, thus improving the DRL agent's performance. We first validate the effectiveness of automatic control of rank on illustrative experiments. Then, we scale up BEER to complex continuous control tasks by combining it with the deterministic policy gradient method. Among 12 challenging DeepMind control tasks, BEER outperforms the baselines by a large margin. Besides, BEER demonstrates significant advantages in Q-value approximation. Our code is available at this https URL.
表示 rank是对神经网络(NNs)在深度强化学习(DRL)中的作用的一个重要概念,它衡量了价值网络的表征能力。现有研究集中于无限制地最大化这个排名;然而,那样的方法会在学习中引入过于复杂模型,从而削弱性能。因此,微调表示排名呈现了一个具有挑战性和关键性的优化问题。为解决这一问题,我们找到了一个指导原则用于自适应控制表示排名。我们利用贝叶斯方程作为理论基础,并得出连续状态-动作对价值网络表示的余弦相似性的上界。然后,我们利用这个上界提出了一种新 regularizer,即基于贝叶斯方程的自排名 regularizer (BEER)。这个 regularizer 会自适应地调整表示排名,从而提高 DRL 代理器的性能。我们首先通过示例实验验证了自动控制排名的有效性。然后,我们将 BEER 与确定性策略梯度方法结合,用于放大复杂连续控制任务。在 12 个具有挑战性的 DeepMind 控制任务中,BEER 超越了基线。此外,BEER 在 Q 值近似方面表现出显著优势。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.12754
Clipart, a pre-made graphic art form, offers a convenient and efficient way of illustrating visual content. Traditional workflows to convert static clipart images into motion sequences are laborious and time-consuming, involving numerous intricate steps like rigging, key animation and in-betweening. Recent advancements in text-to-video generation hold great potential in resolving this problem. Nevertheless, direct application of text-to-video generation models often struggles to retain the visual identity of clipart images or generate cartoon-style motions, resulting in unsatisfactory animation outcomes. In this paper, we introduce AniClipart, a system that transforms static clipart images into high-quality motion sequences guided by text-to-video priors. To generate cartoon-style and smooth motion, we first define Bézier curves over keypoints of the clipart image as a form of motion regularization. We then align the motion trajectories of the keypoints with the provided text prompt by optimizing the Video Score Distillation Sampling (VSDS) loss, which encodes adequate knowledge of natural motion within a pretrained text-to-video diffusion model. With a differentiable As-Rigid-As-Possible shape deformation algorithm, our method can be end-to-end optimized while maintaining deformation rigidity. Experimental results show that the proposed AniClipart consistently outperforms existing image-to-video generation models, in terms of text-video alignment, visual identity preservation, and motion consistency. Furthermore, we showcase the versatility of AniClipart by adapting it to generate a broader array of animation formats, such as layered animation, which allows topological changes.
Clipart是一种预先制作好的图形艺术形式,为描绘视觉内容提供了方便和高效的途径。将静态 clipart 图像转换为动图序列的传统工作流程费力且耗时,需要进行许多复杂的步骤,如绑定、关键帧动画和中间帧处理。近年来在将文本到视频生成模型的研究中取得了很大的进展,有望解决这个问题。然而,直接应用文本到视频生成模型通常很难保留 clipart 图像的视觉身份或生成卡通风格的运动,导致不满意的动画效果。在本文中,我们介绍了 AniClipart 系统,该系统将静态 clipart 图像转换为高质量的动图序列,通过文本到视频先验指导。为了生成卡通风格和流畅的运动,我们首先将 clipart 图像的关键点定义为运动正则化形式。然后通过优化 Video Score Distillation Sampling(VSDS)损失,使关键点的运动轨迹与提供的文本提示对齐,该损失可以表示预训练文本到视频扩散模型中自然运动足够的知识。通过使用可导的 As-Rigid-As-Possible 形状变形算法,我们的方法可以在保持变形刚度的同时进行端到端的优化。实验结果表明,与现有的图像到视频生成模型相比,AniClipart 在文本到视频对齐、视觉身份保留和运动一致性方面 consistently 表现出色。此外,我们还展示了 AniClipart 的多样性,通过将其应用于生成更广泛的动画格式,如分层动画,实现了拓扑变化。
https://arxiv.org/abs/2404.12347
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.
由于其高分辨率和高成像速度,X 射线图像在术中进程中有很高的价值。然而,过度曝光的 X 射线会带来一定的对人类健康的潜在风险。数据驱动的算法从体积扫描到 X 射线图像都受到稀疏的成对 X 射线和体积数据不足的限制。现有方法主要是通过建模整个 X 射线成像过程来实现。在这项研究中,我们提出了一种基于学习的称为 CT2X-GAN 的方法,用于端到端地合成三个不同图像域中的 X 射线图像,通过一系列解耦编码器实现解剖结构信息和风格信息之间的解耦。此外,我们还引入了一个新的一致性正则化项以提高合成 X 射线图像和真实 X 射线图像之间的风格相似度。同时,我们通过计算计算得到的真实 DRR 和合成 DRR 图像的相似度来实现监督过程。我们进一步开发了一个姿态注意模块,以增强从 CT 扫描中解耦得到的内容代码的全面信息,从而在较低的 2D 空间中实现高质量的多视角图像合成。我们对公开可用的 CTSpine1K 数据集进行了广泛的实验,分别实现了 97.8350、0.0842 和 3.0938 的 FID、KID 和用户评分的 X 射线相似度。与 3D 感知方法(π-GAN、EG3D)相比,CT2X-GAN 在提高合成质量和真实性方面具有优势。
https://arxiv.org/abs/2404.11889
Self-supervised learning (SSL) has emerged as a promising technique for medical image analysis due to its ability to learn without annotations. However, despite the promising potential, conventional SSL methods encounter limitations, including challenges in achieving semantic alignment and capturing subtle details. This leads to suboptimal representations, which fail to accurately capture the underlying anatomical structures and pathological details. In response to these constraints, we introduce a novel SSL framework OPTiML, employing optimal transport (OT), to capture the dense semantic invariance and fine-grained details, thereby enhancing the overall effectiveness of SSL in medical image representation learning. The core idea is to integrate OT with a cross-viewpoint semantics infusion module (CV-SIM), which effectively captures complex, fine-grained details inherent in medical images across different viewpoints. In addition to the CV-SIM module, OPTiML imposes the variance and covariance regularizations within OT framework to force the model focus on clinically relevant information while discarding less informative features. Through these, the proposed framework demonstrates its capacity to learn semantically rich representations that can be applied to various medical imaging tasks. To validate its effectiveness, we conduct experimental studies on three publicly available datasets from chest X-ray modality. Our empirical results reveal OPTiML's superiority over state-of-the-art methods across all evaluated tasks.
自监督学习(SSL)作为一种无需标注的学习技术,在医学图像分析领域呈现出巨大的潜力。然而,尽管具有潜在的积极影响,传统的 SSL 方法也存在局限性,包括在实现语义对齐和捕捉细微细节方面遇到的挑战。这导致 suboptimal 表示,无法准确捕捉到解剖学结构和病理细节。为了应对这些限制,我们引入了一个名为 OPTiML 的新 SSL 框架,采用最优传输(OT)技术,以捕捉密集的语义不变性和细粒度细节,从而增强 SSL 在医学图像表示学习中的整体效果。核心思想是将 OT 与跨视点语义注入模块(CV-SIM)相结合,有效地捕捉不同观点下医学图像中复杂、细粒度的细节。除了 CV-SIM 模块之外,OPTiML 对 OT 框架内的方差和协方差进行正则化,以迫使模型将注意力集中在临床相关信息上,而忽略更不相关的特征。通过这些,所提出的框架展示了其学习语义丰富表示的能力,可以应用于各种医学成像任务。为了验证其有效性,我们在三个公开可用的数据集(包括胸部 X 光摄影模式)上进行了实验研究。我们的实证结果表明,OPTiML 在所有评估任务上都优于最先进的 methods。
https://arxiv.org/abs/2404.11868
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.
量化降低内存使用、计算要求和延迟,通过使用更少的比特来表示模型权重和激活。在这项工作中,我们研究了量化神经网络的泛化特性,尽管这对模型性能有着深刻的意义,但这个特性并未受到太多的关注。特别是,我们为神经网络的量化开发了一个理论模型,并证明了量化作为一种正则化形式的作用。第二,为了研究最近工作连接损失函数的尖度与泛化之间的关系,我们推导了量化模型的泛化近界,基于量化噪声的数量。然后,我们通过在CIFAR-10、CIFAR-100和ImageNet数据集上使用卷积和Transformer基模型训练超过2000个模型来验证我们的假设。
https://arxiv.org/abs/2404.11769
Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which combines two challenging tasks. It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework. This framework employs neural displacement fields to convert letters into base shapes and applies per-frame motion, encouraging coherence with the intended textual concept. Shape preservation techniques and perceptual loss regularization are employed to maintain legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our end-to-end methodology over baseline methods, which might comprise separate tasks. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability. Our code is available at: this https URL.
文本动画是一种表达性的媒介,通过将文字注入运动以唤起情感、强调意义并构建引人入胜的故事,将静态通信转化为动态体验。打造语义意识到的动画 poses 显著的挑战,需要掌握图形设计和动画的专业知识。我们提出了一个自动文本动画方案,称为“动态字体”,结合了两个具有挑战性的任务。它通过变形字母来传达语义意义,并基于用户提示充满活力地注入生动的运动。我们的技术利用了向量图形表示和基于端到端优化的框架。该框架采用神经微分场来将字母转换为基本形状,并应用每一帧的运动,鼓励与预期文本概念的连贯性。形状保留技术和感知损失 regularization 被采用来保持动画过程的清晰度和结构完整性。我们通过定量和定性评估证明了我们的方法在生成连贯的文本动画,同时保留可读性的效果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.11614
Estimating the sound absorption in situ relies on accurately describing the measured sound field. Evidence suggests that modeling the reflection of impinging spherical waves is important, especially for compact measurement systems. This article proposes a method for estimating the sound absorption coefficient of a material sample by mapping the sound pressure, measured by a microphone array, to a distribution of monopoles along a line in the complex plane. The proposed method is compared to modeling the sound field as a superposition of two sources (a monopole and an image source). The obtained inverse problems are solved with Tikhonov regularization, with automatic choice of the regularization parameter by the L-curve criterion. The sound absorption measurement is tested with simulations of the sound field above infinite and finite porous absorbers. The approaches are compared to the plane-wave absorption coefficient and the one obtained by spherical wave incidence. Experimental analysis of two porous samples and one resonant absorber is also carried out in situ. Four arrays were tested with an increasing aperture and number of sensors. It was demonstrated that measurements are feasible even with an array with only a few microphones. The discretization of the integral equation led to a more accurate reconstruction of the sound pressure and particle velocity at the sample's surface. The resulting absorption coefficient agrees with the one obtained for spherical wave incidence, indicating that including more monopoles along the complex line is an essential feature of the sound field.
估计材料样本中的声吸收依赖于准确描述测量到的声场。证据表明,在紧凑型测量系统中建模入射球波的反射非常重要。本文提出了一种通过将测量到的声压通过麦克风阵列映射到复平面上的极化子分布中,估计材料样本的声吸收系数的算法。该方法与将声场建模为两个源(一个球体源和一个图像源)的超平面波传播模型的方法进行了比较。通过L-曲线准则自动选择截距参数。通过模拟无限和有限孔隙吸收器的声场,测试了声吸收测量。将平面波吸收系数和通过球波入射获得的声吸收系数进行了比较。还在现场进行了两个多孔样本和一个谐振吸收器的实验分析。四个阵列分别用逐渐扩大的孔径和更多传感器进行测试。结果表明,即使只有几个麦克风,测量也是可行的。离散化积分方程导致样本表面上的声压和颗粒速度更精确的重建。得到的吸收系数与球波入射时获得的相同,表明在复平面上包括更多的极化子是声场的一个重要特征。
https://arxiv.org/abs/2404.11399