The scarcity of labeled data in real-world scenarios is a critical bottleneck of deep learning's effectiveness. Semi-supervised semantic segmentation has been a typical solution to achieve a desirable tradeoff between annotation cost and segmentation performance. However, previous approaches, whether based on consistency regularization or self-training, tend to neglect the contextual knowledge embedded within inter-pixel relations. This negligence leads to suboptimal performance and limited generalization. In this paper, we propose a novel approach IPixMatch designed to mine the neglected but valuable Inter-Pixel information for semi-supervised learning. Specifically, IPixMatch is constructed as an extension of the standard teacher-student network, incorporating additional loss terms to capture inter-pixel relations. It shines in low-data regimes by efficiently leveraging the limited labeled data and extracting maximum utility from the available unlabeled data. Furthermore, IPixMatch can be integrated seamlessly into most teacher-student frameworks without the need of model modification or adding additional components. Our straightforward IPixMatch method demonstrates consistent performance improvements across various benchmark datasets under different partitioning protocols.
在现实场景中,有标签数据的稀缺是深度学习效果的一个关键瓶颈。半监督语义分割是一种常见的解决方案,以实现注释成本和分割性能之间的理想平衡。然而,以前的方法,无论是基于一致性正则化还是自训练,都倾向于忽视内部像素关系中固有的上下文知识。这种疏忽导致 suboptimal 的性能和有限的泛化能力。在本文中,我们提出了一种名为 IPixMatch 的新颖方法,旨在通过半监督学习挖掘被忽视但有益的跨像素信息。具体来说,IPixMatch 是一个标准的老师-学生网络的扩展,包括额外的损失项来捕捉跨像素关系。它在低数据量的情况下通过有效地利用有限的标记数据并从可用未标记数据中挖掘最大效用来闪耀。此外,IPixMatch 可以无缝地集成到大多数老师-学生框架中,而无需对模型进行修改或添加额外组件。我们直接使用 IPixMatch 的方法在不同的分片协议下展示了 consistent 的性能提升。
https://arxiv.org/abs/2404.18891
In this paper, we present a different way to use two modalities, in which either one modality or the other is seen by a single model. This can be useful when adapting an unimodal model to leverage more information while respecting a limited computational budget. This would mean having a single model that is able to deal with any modalities. To describe this, we coined the term anymodal learning. An example of this, is a use case where, surveillance in a room when the lights are off would be much more valuable using an infrared modality while a visible one would provide more discriminative information when lights are on. This work investigates how to efficiently leverage visible and infrared/thermal modalities for transformer-based object detection backbone to create an anymodal architecture. Our work does not create any inference overhead during the testing while exploring an effective way to exploit the two modalities during the training. To accomplish such a task, we introduce the novel anymodal training technique: Mixed Patches (MiPa), in conjunction with a patch-wise domain agnostic module, which is responsible of learning the best way to find a common representation of both modalities. This approach proves to be able to balance modalities by reaching competitive results on individual modality benchmarks with the alternative of using an unimodal architecture on three different visible-infrared object detection datasets. Finally, our proposed method, when used as a regularization for the strongest modality, can beat the performance of multimodal fusion methods while only requiring a single modality during inference. Notably, MiPa became the state-of-the-art on the LLVIP visible/infrared benchmark. Code: this https URL
在本文中,我们提出了另一种使用两种方式的方法,其中一种方式是让一个模型看到一种模式,而另一种方式是让一个模型看到另一种模式。当适应一个单一模态的模型以利用更多的信息,同时遵守有限计算预算时,这种方法可以很有用。这意味着要有一个能够处理任何模态的模型。为了描述这一点,我们定义了一个术语:多模态学习。一个这种多模态学习的例子是在房间里有灯光熄灭时进行监控,使用红外模式会比使用可见模式提供更有价值的监控信息,而灯光打开时,可见模式会提供更有区分性的信息。本文研究了如何有效地将可见和红外/热模态用于基于Transformer的对象检测骨干网络以创建多模态架构。我们的工作在测试过程中没有产生任何推理开销,同时探索了在训练过程中有效利用两种模态的最佳方法。为了实现这一目标,我们引入了新的多模态训练技术:Mixed Patches(MiPa),并与一个补丁域无关的模块相结合,该模块负责学习找到两种模态之间共同表示的最佳方式。这种方法通过在单个模态基准上实现竞争性的结果,证明了能够平衡模态,同时使用三个不同的可见-红外物体检测数据集上的单一模态架构。最后,当我们将该方法用作最强的模态的正则化时,可以在仅需要一个模态的情况下击败多模态融合方法的性能。值得注意的是,MiPa在LLVIP可见/红外基准上达到了最先进的水平。代码:https:// this URL
https://arxiv.org/abs/2404.18849
Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.
大型语言模型(LLMs)通过一次或多次微调阶段已成为解锁各种功能的有必要的一步,使LLM能够遵循自然语言指令或与人类偏好对齐。然而,在序列训练过程中,它们可能面临灾难性遗忘的风险,以前阶段学到的参数或能力可能会被传入的训练数据所压倒。在本文中,我们发现,通过定期重置部分参数,LLM可以恢复一些原始知识。受到这一启发,我们引入了半微调(HFT)用于LLM,作为完全微调(FFT)的替代方案,以减轻遗忘问题,其中一半参数用于学习新任务,而另一半参数则保持不变以保留先前的知识。我们从优化的角度进行了可行性分析,并将参数选择操作解释为正则化项。在没有改变模型架构的情况下,HFT可以轻松地融入现有的微调框架。对监督微调、直接偏好优化和持续学习的大规模实验和分析都证实了HFT的有效性、稳健性和效率。与FFT相比,HFT不仅显著减轻了遗忘问题,而且在一系列下游基准测试中实现了最佳性能,训练时间约减少30%。
https://arxiv.org/abs/2404.18466
Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).
准确地模拟不同场景中异质代理人的多样行为是自动驾驶模拟的基本任务。由于行为分布的多维度性、驾驶场景的高维度性、分布变化和信息不完整,这个任务具有挑战性。我们的第一个洞察是通过可导模拟通过状态匹配提供有意义的学习信号,实现策略的有效分配。这可以通过揭示梯度高速公路和异质代理人的梯度通道来证明。然而,在低密度区域中发现了梯度爆炸和弱监督的问题。我们的第二个洞察是,通过应用双策略 regularization 缩小函数空间可以解决这些问题。再考虑多样性,我们第三个洞察是,数据集中的异质代理人的行为可以用一系列原型向量有效地压缩检索。这导致基于模型的强化模仿学习框架(MRIC)。MRIC 引入了开环模型基于仿真的 regularization 以稳定训练,以及基于模型的强化学习 (RL) 基于领域知识的 regularization。RL regularization 涉及可导的 Minkowskidifference-based 碰撞避免和基于投影的道路和交通规则遵守奖励。还进一步提出了动态乘数机制,消除 regularization 的干扰,同时确保其有效性。使用大型 Waymo 开放运动数据集进行实验研究,结果表明 MRIC 在多样性、行为真实性和分布真实性方面超过了最先进的基线,在某些关键指标(如碰撞率、minSADE 和时间到碰撞 JSD)上具有很大的优势(e.g., collision rate, minSADE, and time-to-collision JSD)。
https://arxiv.org/abs/2404.18464
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at this https URL.
MTL是一种学习范式,有效利用任务特定和共享信息来同时解决多个相关任务。与STL相比,MTL提供了一系列增强训练过程和推理效率的优势。MTL的关键优势包括简化模型架构、性能提升和跨领域泛化。在过去的二十年中,MTL已经成为许多领域广泛认可的灵活有效的解决方案,包括CV、自然语言处理、推荐系统、疾病预后和诊断、以及机器人领域。本次调查全面回顾了MTL的发展历程,从传统方法的尖端技术到深度学习的最新趋势,以及预训练基础模型的最新趋势。我们的调查系统地分类MTL技术为五个关键领域:正则化、关系学习、特征传播、优化和预训练。这种分类不仅按时间顺序描述了MTL的发展,还深入研究了每个领域的各种专业策略。此外,调查揭示了MTL如何从处理固定任务转变为更加灵活的方法,摆脱了任务或模型约束。它探讨了任务提示的和无条件的训练概念,以及ZSL(零样本学习)的能力,揭示了这一历史悠久的值得称赞的学习范式所蕴含的潜力。总的来说,我们希望通过这次调查为研究社区提供MTL从1997年创立到2023年的全面概述。我们关注当前的挑战,展望未来的机遇,以一种全面的方式揭示MTL研究在各个领域的机会和潜在途径。这个项目在https://这个URL上公开可用。
https://arxiv.org/abs/2404.18961
Multi-robot simultaneous localization and mapping (SLAM) enables a robot team to achieve coordinated tasks relying on a common map. However, centralized processing of robot observations is undesirable because it creates a single point of failure and requires pre-existing infrastructure and significant multi-hop communication throughput. This paper formulates multi-robot object SLAM as a variational inference problem over a communication graph. We impose a consensus constraint on the objects maintained by different nodes to ensure agreement on a common map. To solve the problem, we develop a distributed mirror descent algorithm with a regularization term enforcing consensus. Using Gaussian distributions in the algorithm, we derive a distributed multi-state constraint Kalman filter (MSCKF) for multi-robot object SLAM. Experiments on real and simulated data show that our method improves the trajectory and object estimates, compared to individual-robot SLAM, while achieving better scaling to large robot teams, compared to centralized multi-robot SLAM. Code is available at this https URL.
多机器人同时定位与映射(SLAM)使得机器人团队能够依靠共同的地图实现协同任务。然而,集中式处理机器人观测是一个不愉快的特点,因为它创造了一个单点故障,并需要依赖预先存在的设施和显著的多跳通信带宽。本文将多机器人对象SLAM建模为通信图上的变分推理问题。我们在不同节点维护的物体之间施加共识约束,以确保对共同地图的一致同意。为了解决这个问题,我们开发了一个具有正则化项的分布式镜像下降算法。使用高斯分布算法,我们推导出多机器人对象SLAM的分布式多状态约束Kalman滤波器(MSCKF)。在真实和模拟数据上的实验表明,与单独机器人SLAM相比,我们的方法提高了轨迹和物体估计,同时实现了更好的对大型机器人团队的比例扩展。代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2404.18331
This paper serves to introduce the Align, Minimize and Diversify (AMD) method, a Source-Free Unsupervised Domain Adaptation approach for Handwritten Text Recognition (HTR). This framework decouples the adaptation process from the source data, thus not only sidestepping the resource-intensive retraining process but also making it possible to leverage the wealth of pre-trained knowledge encoded in modern Deep Learning architectures. Our method explicitly eliminates the need to revisit the source data during adaptation by incorporating three distinct regularization terms: the Align term, which reduces the feature distribution discrepancy between source and target data, ensuring the transferability of the pre-trained representation; the Minimize term, which encourages the model to make assertive predictions, pushing the outputs towards one-hot-like distributions in order to minimize prediction uncertainty, and finally, the Diversify term, which safeguards against the degeneracy in predictions by promoting varied and distinctive sequences throughout the target data, preventing informational collapse. Experimental results from several benchmarks demonstrated the effectiveness and robustness of AMD, showing it to be competitive and often outperforming DA methods in HTR.
本文旨在介绍一种名为Align、Minimize和Diversify(AMD)的方法,一种用于手写文本识别(HTR)的源自由无监督领域适应方法。该框架将适应过程与源数据解耦,从而不仅避免了资源密集的重新训练过程,而且还可以利用现代深度学习架构中编码的丰富知识。我们的方法明确消除在适应过程中重新访问源数据的需求,通过引入三个不同的正则化项实现:Align项,该项减少了源数据和目标数据之间的特征分布差异,确保了预训练表示的转移性;Minimize项,该项鼓励模型做出果断的预测,将输出推向one-hot-like分布,以最小化预测不确定性;最后,Diversify项,通过促进目标数据中多样且具有差异的序列,防止信息坍塌,从而保护预测的可靠性。来自多个基准测试的实验结果表明,AMD的有效性和鲁棒性,使其在HTR领域与DA方法竞争, often outperforming them。
https://arxiv.org/abs/2404.18260
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.
持续学习(CL)是深度神经网络中一个长期存在的挑战,因为之前学习的知识会因为梯度消失而丢失。尽管基于练习的方法在减轻梯度消失方面相当成功,但它们在缓冲样本和先验信息损失方面过于拟合,阻碍了在低缓冲 regime下的泛化能力。受到人类使用强归纳偏见学习的方式启发,我们提出了一种基于对比学习(CRL)的低缓冲时经验回放(IMEX-Reg)方法,以提高在低缓冲 regime 下 CL 的泛化性能。具体来说,我们采用对比表示学习(CRL)中的双峰隐式-显式正则化方法,并结合一致性正则化。为了更好地利用使用 CRL 学习到的表示之间的关系,我们提出了一种引导分类器朝 CRL 中单位球体激活关联的方向的规范化策略。我们的结果表明,IMEX-Reg 显著提高了泛化性能,在多个 CL 场景中超过了基于练习的方法。它还对于自然和对抗性污染具有较低的任务晚期偏见。此外,我们还提供了进一步支持我们设计决策的理论和实验洞察。
https://arxiv.org/abs/2404.18161
In today's world, image processing plays a crucial role across various fields, from scientific research to industrial applications. But one particularly exciting application is image captioning. The potential impact of effective image captioning is vast. It can significantly boost the accuracy of search engines, making it easier to find relevant information. Moreover, it can greatly enhance accessibility for visually impaired individuals, providing them with a more immersive experience of digital content. However, despite its promise, image captioning presents several challenges. One major hurdle is extracting meaningful visual information from images and transforming it into coherent language. This requires bridging the gap between the visual and linguistic domains, a task that demands sophisticated algorithms and models. Our project is focused on addressing these challenges by developing an automatic image captioning architecture that combines the strengths of convolutional neural networks (CNNs) and encoder-decoder models. The CNN model is used to extract the visual features from images, and later, with the help of the encoder-decoder framework, captions are generated. We also did a performance comparison where we delved into the realm of pre-trained CNN models, experimenting with multiple architectures to understand their performance variations. In our quest for optimization, we also explored the integration of frequency regularization techniques to compress the "AlexNet" and "EfficientNetB0" model. We aimed to see if this compressed model could maintain its effectiveness in generating image captions while being more resource-efficient.
在当今世界,图像处理在各个领域都扮演着关键角色,从科学研究到工业应用。但其中最令人兴奋的应用是图像标注。有效图像标注的机会潜力是巨大的。它可以大大提高搜索引擎的准确性,使其更容易找到相关信息。此外,它还可以大大提高盲人人士的可访问性,为他们提供更加沉浸式的数字内容体验。然而,尽管它具有很大的潜力,图像标注仍然面临着几个挑战。一个主要的障碍是从图像中提取有意义的视觉信息并将其转化为连贯的语言。这需要跨越视觉和语言域的鸿沟,这是一个需要复杂算法和模型的任务。我们的项目专注于通过开发结合卷积神经网络(CNN)和编码器-解码器模型的自动图像标注架构来解决这些挑战。CNN模型用于提取图像的视觉特征,然后,通过编码器-解码器框架,生成标注。我们还进行了一次性能比较,深入研究了预训练的 CNN 模型,尝试使用多种架构来了解它们的性能变化。在我们寻求优化的过程中,我们还探索了引入频率正则化技术来压缩“AlexNet”和“EfficientNetB0”模型的 integration。我们试图看看这个压缩模型是否能在生成图像标注的同时更加高效地使用资源。
https://arxiv.org/abs/2404.18062
Unpaired image dehazing (UID) holds significant research importance due to the challenges in acquiring haze/clear image pairs with identical backgrounds. This paper proposes a novel method for UID named Orthogonal Decoupling Contrastive Regularization (ODCR). Our method is grounded in the assumption that an image consists of both haze-related features, which influence the degree of haze, and haze-unrelated features, such as texture and semantic information. ODCR aims to ensure that the haze-related features of the dehazing result closely resemble those of the clear image, while the haze-unrelated features align with the input hazy image. To accomplish the motivation, Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed, which can project image features into an orthogonal space, thereby reducing the relevance between different features. Furthermore, a task-driven Depth-wise Feature Classifier (DWFC) is proposed, which assigns weights to the orthogonal features based on the contribution of each channel's feature in predicting whether the feature source is hazy or clear in a self-supervised fashion. Finally, a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling of haze-related features in the output image toward those of clear images, while bringing haze-unrelated features close to those of the hazy input. Extensive experiments demonstrate the superior performance of our ODCR method on UID.
未配对图像去雾(UID)具有重要研究价值,因为获取具有相同背景的雾/清晰图像对具有挑战性。本文提出了一种名为Orthogonal Decoupling Contrastive Regularization(ODCR)的新方法来解决UID。我们的方法基于一个假设,即图像由雾相关特征和雾无关特征(如纹理和语义信息)组成。ODCR旨在确保去雾结果的雾相关特征与清晰图像的雾相关特征相似,而雾无关特征与输入雾图像对齐。为了实现这一目标,我们提出了Orthogonal MLPs优化几何地在Stiefel维度的方法,这些方法可以投影图像特征到正交空间,从而降低不同特征之间的相关性。此外,我们还提出了一个基于任务的有条件深度卷积特征分类器(DWFC),其中基于每个通道特征对预测功能源是否为雾或清晰进行自监督的贡献为权重分配给正交特征。最后,我们引入了加权PatchNCE(WPNCE)损失,以实现将输出图像中雾相关特征向清晰图像的雾相关特征的拉动,同时将雾无关特征带到雾输入的附近。大量实验证明,我们的ODCR方法在UID上具有卓越的性能。
https://arxiv.org/abs/2404.17825
This paper introduces Least Volume-a simple yet effective regularization inspired by geometric intuition-that can reduce the necessary number of latent dimensions needed by an autoencoder without requiring any prior knowledge of the intrinsic dimensionality of the dataset. We show that the Lipschitz continuity of the decoder is the key to making it work, provide a proof that PCA is just a linear special case of it, and reveal that it has a similar PCA-like importance ordering effect when applied to nonlinear models. We demonstrate the intuition behind the regularization on some pedagogical toy problems, and its effectiveness on several benchmark problems, including MNIST, CIFAR-10 and CelebA.
本文介绍了一种简单而有效的正则化方法,名为 Least Volume,该方法受到几何直觉的启发,可以减少不需要的潜在维度,同时不需要任何关于数据集固有维度的先验知识。我们证明,解码器的Lipschitz连续性是其工作的关键,给出PCA只是其线性特殊情况的证明,并揭示其在非线性模型上具有与PCA类似的重要性排序效应。我们证明了在某些教育玩具问题和多个基准问题上的直觉,包括MNIST,CIFAR-10和CelebA。
https://arxiv.org/abs/2404.17773
In class incremental learning, neural networks typically suffer from catastrophic forgetting. We show that an MLP featuring a sparse activation function and an adaptive learning rate optimizer can compete with established regularization techniques in the Split-MNIST task. We highlight the effectiveness of the Adaptive SwisH (ASH) activation function in this context and introduce a novel variant, Hard Adaptive SwisH (Hard ASH) to further enhance the learning retention.
在班级增量学习过程中,神经网络通常会遭受灾难性遗忘。我们证明了具有稀疏激活函数和自适应学习率优化器的MLP可以在分割MNIST任务中与已有的正则化技术竞争。我们突出了在這種情況下Adaptive SwisH(ASH)激活函数的有效性,并引入了一种新的变体,即硬自适应SwisH(Hard ASH),以进一步增强学习保留。
https://arxiv.org/abs/2404.17651
Multi-view learning has become a popular research topic in recent years, but research on the cross-application of classic multi-label classification and multi-view learning is still in its early stages. In this paper, we focus on the complex yet highly realistic task of incomplete multi-view weak multi-label learning and propose a masked two-channel decoupling framework based on deep neural networks to solve this problem. The core innovation of our method lies in decoupling the single-channel view-level representation, which is common in deep multi-view learning methods, into a shared representation and a view-proprietary representation. We also design a cross-channel contrastive loss to enhance the semantic property of the two channels. Additionally, we exploit supervised information to design a label-guided graph regularization loss, helping the extracted embedding features preserve the geometric structure among samples. Inspired by the success of masking mechanisms in image and text analysis, we develop a random fragment masking strategy for vector features to improve the learning ability of encoders. Finally, it is important to emphasize that our model is fully adaptable to arbitrary view and label absences while also performing well on the ideal full data. We have conducted sufficient and convincing experiments to confirm the effectiveness and advancement of our model.
多标签分类和多视角学习近年来已经成为一个热门的研究课题,但跨应用经典多标签分类和多视角学习的研究仍处于早期阶段。在本文中,我们关注具有复杂但高度现实感的 incomplete multi-view weak multi-label learning 问题,并提出了一种基于深度神经网络的遮罩式 two-channel decoupling 框架来解决这个问题。我们方法的核心创新在于将深度多标签学习方法中常见的单通道视图级表示解耦为共享表示和视图特有表示。我们还设计了一个跨通道对比损失项来增强两个通道的语义特征。此外,我们利用监督信息设计了一个标签指导的图正则化损失项,帮助提取的嵌入特征在样本之间保留几何结构。受到图像和文本分析中遮掩机制的成功启发,我们为向量特征开发了一种随机的片段掩码策略,以提高编码器的学习能力。最后,我们需要强调的是,我们的模型对任意视图和标签缺失都具有完全适应性,同时在理想的全数据上表现出色。我们已经进行了充分且有力的实验来证实我们模型的有效性和进步。
https://arxiv.org/abs/2404.17340
Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to transfer the style of the whole dataset into generation of images. It can minimize the learning biases caused by content of images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, a unique token embedding corresponding to this new style is learned by a task-wise token learning module, which could preserve historical knowledge from past styles with the limitation of LoRA parameter quantity. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.
预训练的大型文本-图像(T2I)模型,特别是适当的文本提示,在定制图像生成领域引起了越来越多的兴趣。然而,灾难性遗忘问题使得在保留学习到的样式的同时,持续生成新的用户提供样式变得困难。在本文中,我们提出了一种名为MuseumMaker的方法,使您能够在无尽的方式下根据一组自定义样式合成图像,并逐渐积累这些创意艺术作品作为一个博物馆。当面临新的定制样式时,我们开发了一种风格蒸馏损失模块,将整个数据集的风格传递给图像生成。它可以减小由于图片内容而产生的学习偏移,并解决由少样本图像引起的灾难性过拟合问题。为了处理过去学习到的样式中的灾难性遗忘,我们为共享LoRA模块设计了双重正则化,以优化模型更新方向,分别从权重和特征方面对扩散模型进行正则。同时,通过任务级别的标记学习模块,学习到一个与新样式对应的独特标记嵌入,这可以保留过去样式的历史知识,同时限制LoRA参数的数量。随着任何新的用户提供样式,我们的MuseumMaker可以捕捉到新风格的细微差别,同时保留学习到的样式的细节。在多样风格数据集上的实验结果证实了我们对MuseumMaker方法的有效性,展示了其在各种场景的稳健性和多样性。
https://arxiv.org/abs/2404.16612
Maintaining temporal stability is crucial in multi-agent trajectory prediction. Insufficient regularization to uphold this stability often results in fluctuations in kinematic states, leading to inconsistent predictions and the amplification of errors. In this study, we introduce a framework called Multi-Agent Trajectory prediction via neural interaction Energy (MATE). This framework assesses the interactive motion of agents by employing neural interaction energy, which captures the dynamics of interactions and illustrates their influence on the future trajectories of agents. To bolster temporal stability, we introduce two constraints: inter-agent interaction constraint and intra-agent motion constraint. These constraints work together to ensure temporal stability at both the system and agent levels, effectively mitigating prediction fluctuations inherent in multi-agent systems. Comparative evaluations against previous methods on four diverse datasets highlight the superior prediction accuracy and generalization capabilities of our model.
保持时间稳定性对于多智能体轨迹预测至关重要。通常,保持这种稳定性需要足够的正则化来保持,否则会导致运动状态的波动,从而导致预测的不一致性和误差放大的问题。在这项研究中,我们引入了一个名为多智能体轨迹预测通过神经相互作用能量(MATE)的框架。这个框架通过使用神经相互作用能量来评估智能体的相互作用运动,捕捉了互动的动态,并突出了它们对智能体未来轨迹的影响。为了加强时间稳定性,我们引入了两个约束:智能体间交互约束和智能体间运动约束。这些约束共同作用,确保了系统和智能体层面的时间稳定性,有效地减轻了多智能体系统中的预测波动。与之前的方法相比,在四个不同的数据集上的比较评估结果表明,我们模型的预测准确性和泛化能力都具有优势。
https://arxiv.org/abs/2404.16579
Model-free reinforcement learning methods lack an inherent mechanism to impose behavioural constraints on the trained policies. While certain extensions exist, they remain limited to specific types of constraints, such as value constraints with additional reward signals or visitation density constraints. In this work we try to unify these existing techniques and bridge the gap with classical optimization and control theory, using a generic primal-dual framework for value-based and actor-critic reinforcement learning methods. The obtained dual formulations turn out to be especially useful for imposing additional constraints on the learned policy, as an intrinsic relationship between such dual constraints (or regularization terms) and reward modifications in the primal is reveiled. Furthermore, using this framework, we are able to introduce some novel types of constraints, allowing to impose bounds on the policy's action density or on costs associated with transitions between consecutive states and actions. From the adjusted primal-dual optimization problems, a practical algorithm is derived that supports various combinations of policy constraints that are automatically handled throughout training using trainable reward modifications. The resulting $\texttt{DualCRL}$ method is examined in more detail and evaluated under different (combinations of) constraints on two interpretable environments. The results highlight the efficacy of the method, which ultimately provides the designer of such systems with a versatile toolbox of possible policy constraints.
模型无关强化学习方法缺乏对训练后策略施加行为约束的固有机制。虽然存在某些扩展,但它们仍然局限于特定的约束类型,例如带有额外奖励信号的价值约束或访问密度约束。在这项工作中,我们试图统一这些现有技术,并使用基于价值的actor-critic强化学习方法的泛化二次框架来弥合与经典优化和控制理论之间的差距。所得到的双重形式展开在很大程度上有助于对学习到的策略施加额外的约束,因为这种双重约束(或 regularization 项)与原初在值上的约束之间揭示了一种固有的关系。此外,利用这种框架,我们能够引入一些新颖的约束类型,使得能够对策略的动作密度或连续状态和动作之间的转移成本施加限制。从调整后的原初-二次优化问题中,我们得到了一个实际算法,它在训练过程中自动处理各种策略约束。利用不同的约束组合,对两个可解释的环境进行了评估。结果表明,该方法非常有效,最终为设计这种系统的设计者提供了一个丰富的策略约束工具箱。
https://arxiv.org/abs/2404.16468
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
This paper studies interpretability of convolutional networks by means of saliency maps. Most approaches based on Class Activation Maps (CAM) combine information from fully connected layers and gradient through variants of backpropagation. However, it is well understood that gradients are noisy and alternatives like guided backpropagation have been proposed to obtain better visualization at inference. In this work, we present a novel training approach to improve the quality of gradients for interpretability. In particular, we introduce a regularization loss such that the gradient with respect to the input image obtained by standard backpropagation is similar to the gradient obtained by guided backpropagation. We find that the resulting gradient is qualitatively less noisy and improves quantitatively the interpretability properties of different networks, using several interpretability methods.
本文通过关注卷积神经网络的清晰度图来研究可解释性。大多数基于类激活图(CAM)的方法结合了全连接层的信息和反向传播中的梯度。然而,人们普遍认为梯度是噪声,因此出现了类似于指导反向传播(GSP)的方法来获得更好的推理可视化。在这项工作中,我们提出了一个新颖的训练方法来提高梯度的质量。特别地,我们引入了一个正则化损失,使得通过标准反向传播获得的输入图像的梯度与通过指导反向传播获得的梯度相似。我们发现,通过这种方法得到的梯度在质上是更少的噪声,并且通过使用几种可解释性方法,提高了不同网络的定量可解释性特性。
https://arxiv.org/abs/2404.15024
In the field of Artificial Intelligence for Information Technology Operations, causal discovery is pivotal for operation and maintenance of graph construction, facilitating downstream industrial tasks such as root cause analysis. Temporal causal discovery, as an emerging method, aims to identify temporal causal relationships between variables directly from observations by utilizing interventional data. However, existing methods mainly focus on synthetic datasets with heavy reliance on intervention targets and ignore the textual information hidden in real-world systems, failing to conduct causal discovery for real industrial scenarios. To tackle this problem, in this paper we propose to investigate temporal causal discovery in industrial scenarios, which faces two critical challenges: 1) how to discover causal relationships without the interventional targets that are costly to obtain in practice, and 2) how to discover causal relations via leveraging the textual information in systems which can be complex yet abundant in industrial contexts. To address these challenges, we propose the RealTCD framework, which is able to leverage domain knowledge to discover temporal causal relationships without interventional targets. Specifically, we first develop a score-based temporal causal discovery method capable of discovering causal relations for root cause analysis without relying on interventional targets through strategic masking and regularization. Furthermore, by employing Large Language Models (LLMs) to handle texts and integrate domain knowledge, we introduce LLM-guided meta-initialization to extract the meta-knowledge from textual information hidden in systems to boost the quality of discovery. We conduct extensive experiments on simulation and real-world datasets to show the superiority of our proposed RealTCD framework over existing baselines in discovering temporal causal structures.
在人工智能信息技术运营领域,因果发现对于图形构建的操作和维护至关重要,这有助于下游工业任务的根因分析,如原因分析。 temporal因果发现作为一种新兴的方法,旨在通过利用干预数据直接识别出变量的因果关系,从而实现对真实世界系统中隐含文本信息的发现。然而,现有的方法主要集中在依赖干预目标的合成数据集上,忽视了真实世界系统中隐含的文本信息,因此无法对真实工业场景进行因果发现。为了解决这个问题,本文提出了一种研究工业场景中因果发现的方案,面临着两个关键挑战:1)如何在不花费实践成本的干预目标之间发现因果关系,2)如何通过利用复杂但丰富的工业环境中系统的文本信息来发现因果关系。为解决这些挑战,我们提出了 RealTCD 框架,它能够利用领域知识来发现没有干预目标时的因果关系。具体来说,我们首先开发了一种基于分数的时序因果发现方法,通过战略遮蔽和正则化能够发现根原因分析中的因果关系。此外,通过使用大型语言模型(LLMs)处理文本并整合领域知识,我们引入了 LLM-guided meta-initialization,以提取系统中的文本信息以提高发现质量。我们在模拟和真实世界数据集上进行广泛的实验,证明了我们提出的 RealTCD 框架在发现时序因果结构方面优于现有基线。
https://arxiv.org/abs/2404.14786
This paper introduces \textbf{Q-tuning}, a novel approach for continual prompt tuning that enables the lifelong learning of a pre-trained language model. When learning a new task, Q-tuning trains a task-specific prompt by adding it to a prompt queue consisting of the prompts from older tasks. To better transfer the knowledge of old tasks, we design an adaptive knowledge aggregation technique that reweighs previous prompts in the queue with a learnable low-rank matrix. Once the prompt queue reaches its maximum capacity, we leverage a PCA-based eviction rule to reduce the queue's size, allowing the newly trained prompt to be added while preserving the primary knowledge of old tasks. In order to mitigate the accumulation of information loss caused by the eviction, we additionally propose a globally shared prefix prompt and a memory retention regularization based on information theory. Extensive experiments demonstrate that our approach outperforms the state-of-the-art methods substantially on continual prompt tuning benchmarks. Moreover, our approach enables lifelong learning on linearly growing task sequences while requiring constant complexity for training and inference.
本文介绍了一种名为 \textbf{Q-tuning} 的新方法,用于持续 prompt 调整,从而实现预训练语言模型的终身学习。在学习新任务时,Q-tuning 通过将新任务加入一个由 older 任务提示组成的提示队列中,来训练一个任务特定的提示。为了更好地转移旧任务的知識,我們設計了一種自適應的知識聚合技術,通過可學習的低秩矩陣重新權重队列中的先前提示。一旦提示隊列達到其最大容量,我們利用基于主成分分析(PCA)的驱逐规则来减少队列的大小,从而在保留主要舊任务知识的同时,允许新训练的提示加入队列。为了减轻驱逐操作造成的信息损失的累积,我们还提出了一个全局共享前缀提示和基于信息理论的内存保留 regularization。大量实验证明,与最先进的 methods相比,我们的方法在持续 prompt 调整基准测试中显著表现出优势。此外,我们的方法在 linearly growing 任务序列上实现终身学习,同时需要训练和推理的常规模度为不变。
https://arxiv.org/abs/2404.14607