Image fusion plays a key role in a variety of multi-sensor-based vision systems, especially for enhancing visual quality and/or extracting aggregated features for perception. However, most existing methods just consider image fusion as an individual task, thus ignoring its underlying relationship with these downstream vision problems. Furthermore, designing proper fusion architectures often requires huge engineering labor. It also lacks mechanisms to improve the flexibility and generalization ability of current fusion approaches. To mitigate these issues, we establish a Task-guided, Implicit-searched and Meta-initialized (TIM) deep model to address the image fusion problem in a challenging real-world scenario. Specifically, we first propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion. Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency. In addition, a pretext meta initialization technique is introduced to leverage divergence fusion data to support fast adaptation for different kinds of image fusion tasks. Qualitative and quantitative experimental results on different categories of image fusion problems and related downstream tasks (e.g., visual enhancement and semantic understanding) substantiate the flexibility and effectiveness of our TIM. The source code will be available at this https URL.
图像融合在多种多传感器为基础的视觉系统中发挥着关键作用,特别是用于提高视觉质量和/或提取聚合特征以感知。然而,大多数现有方法只是将图像融合视为个人任务,从而忽视了它与这些后续视觉问题的潜在关系。此外,设计适当的融合架构往往需要巨大的工程劳动。它也缺乏机制来改善当前融合方法的灵活性和泛化能力。为了缓解这些问题,我们建立了一种任务引导、隐含搜索和元初始化(TIM)的深层模型,以在一个挑战性的现实世界场景中解决图像融合问题。具体来说,我们首先提出了一种约束策略,以从后续任务中引入信息,指导 unsupervised 的图像融合学习过程。在这个框架内,我们 then 设计了一种隐含搜索策略,以高效地自动发现我们的融合模型的紧凑架构。此外,我们还引入了一种基于 pretext 的元初始化技术,利用分化融合数据支持各种图像融合任务的快速适应。不同类别的图像融合问题和相关的后续任务(例如,视觉增强和语义理解)的定量和定性实验结果证实了我们的 TIM 的灵活性和有效性。源代码将在本 https URL 上提供。
https://arxiv.org/abs/2305.15862
Many existing multi-modality studies are based on the assumption of modality integrity. However, the problem of missing arbitrary modalities is very common in real life, and this problem is less studied, but actually important in the task of multi-modality person re-identification (Re-ID). To this end, we design a novel dynamic enhancement network (DENet), which allows missing arbitrary modalities while maintaining the representation ability of multiple modalities, for partial multi-modality person Re-ID. To be specific, the multi-modal representation of the RGB, near-infrared (NIR) and thermal-infrared (TIR) images is learned by three branches, in which the information of missing modalities is recovered by the feature transformation module. Since the missing state might be changeable, we design a dynamic enhancement module, which dynamically enhances modality features according to the missing state in an adaptive manner, to improve the multi-modality representation. Extensive experiments on multi-modality person Re-ID dataset RGBNT201 and vehicle Re-ID dataset RGBNT100 comparing to the state-of-the-art methods verify the effectiveness of our method in complex and changeable environments.
许多现有的多模态研究都基于模态完整性假设。然而,在生活中,出现任意模态缺失的问题很常见,这个问题在多模态人物重识别任务(Re-ID)中并不是很少研究,但确实很重要。为此,我们设计了一种全新的动态增强网络(DENet),它可以允许任意模态缺失,同时保持多种模态的表示能力,用于部分多模态人物重识别。具体来说,我们学习了RGB、近红外(NIR)和热红外(TIR)图像的多模态表示,其中缺失的模态信息通过特征转换模块恢复。由于缺失状态可能可以改变,我们设计了动态增强模块,它根据缺失状态自适应地动态增强多种模态特征,以改善多模态表示。我们对多模态人物重识别数据集RGBNT201和车辆重识别数据集RGBNT100与最先进的方法进行了广泛的比较,以验证我们方法在复杂和可改变的环境中的有效性。
https://arxiv.org/abs/2305.15762
Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.
获取高质量的数据用于训练区分模型是构建有效预测系统的关键而又具有挑战性方面。在本文中,我们介绍了扩散逆转(Diffusion Inversion)方法,这是一种简单而有效的方法,利用预先训练的生成模型稳定扩散,以生成多种高质量的图像分类训练数据。我们的方法捕捉了原始数据分布,并确保数据覆盖,通过反转图像到稳定扩散的潜在空间,并通过对这些向量的噪声版本 conditioning 生成模型,生成多种新颖的训练图像。我们确定了三个关键组件,这些组件允许我们生成的图像成功地取代了原始数据集,导致样本复杂性的提高2-3倍,采样时间减少6.5倍。此外,我们的方法在多个数据集上 consistently outperforms 通用 prompt-based指导方法和 KNN 检索基准方法。此外,我们展示了我们的方法与广泛应用的数据增强技术的兼容性,以及生成数据的可靠性,以支持各种神经网络架构并增强少量学习。
https://arxiv.org/abs/2305.15316
Large-scale Pretrained Language Models~(LLMs), such as ChatGPT and GPT4, have shown strong abilities in multilingual translations, without being explicitly trained on parallel corpora. It is interesting how the LLMs obtain their ability to carry out translation instructions for different languages. In this paper, we present a detailed analysis by finetuning a multilingual pretrained language model, XGLM-7B, to perform multilingual translation following given instructions. Firstly, we show that the multilingual LLMs have stronger translation abilities than previously demonstrated. For a certain language pair, the performance depends on both the language families and the amount of data used in the pretraining phase. Secondly, we find that LLMs' ability to carry out translation instructions relies on the understanding of translation instruction and the alignment among different languages. With proper enhancement, LLMs could perform the translation task well even for those language pairs unseen during the instruction tuning phase.
大型预训练语言模型(LLMs),如ChatGPT和GPT4,在多语言翻译方面表现出强大的能力,而并没有在并行语料库上 explicitly 训练。这很有趣,LLMs 如何获得对不同语言的翻译指令的能力。在本文中,我们将通过微调一个多语言预训练语言模型XGLM-7B,按照给定指令进行多语言翻译,详细分析。首先,我们表明,多语言LLMs比先前表现出的翻译能力更强。对于某些语言对,性能取决于语言家族和使用预训练阶段的数据量。其次,我们发现LLMs的翻译指令执行能力取决于对翻译指令的理解和不同语言的对齐。通过适当的增强,LLMs甚至可以在指令调整阶段未访问的语言对上表现良好。
https://arxiv.org/abs/2305.15083
Learning-based image compression methods have made great progress. Most of them are designed for generic natural images. In fact, low-light images frequently occur due to unavoidable environmental influences or technical limitations, such as insufficient lighting or limited exposure time. %When general-purpose image compression algorithms compress low-light images, useful detail information is lost, resulting in a dramatic decrease in image enhancement. Once low-light images are compressed by existing general image compression approaches, useful information(e.g., texture details) would be lost resulting in a dramatic performance decrease in low-light image enhancement. To simultaneously achieve a higher compression rate and better enhancement performance for low-light images, we propose a novel image compression framework with joint optimization of low-light image enhancement. We design an end-to-end trainable two-branch architecture with lower computational cost, which includes the main enhancement branch and the signal-to-noise ratio~(SNR) aware branch. Experimental results show that our proposed joint optimization framework achieves a significant improvement over existing ``Compress before Enhance" or ``Enhance before Compress" sequential solutions for low-light images. Source codes are included in the supplementary material.
基于学习的图像压缩方法已经取得了巨大的进展。大部分设计用于通用自然图像。实际上,暗态图像经常由于不可避免的环境影响或技术限制,如光线不足或曝光时间有限,而发生。当通用图像压缩算法压缩暗态图像时,有用的细节信息被丢失,导致图像增强效果急剧下降。一旦暗态图像通过现有的通用图像压缩方法压缩,有用的信息(如纹理细节)将丢失,导致暗态图像增强效果的性能急剧下降。为了同时实现较高的压缩率和更好的增强性能,我们对暗态图像提出一种新的图像压缩框架,同时优化暗态图像增强。我们设计了一种 end-to-end 可训练的二分支架构,计算成本较低,其中包括主要增强分支和信号-噪声比~(SNR) aware分支。实验结果表明,我们提出的联合优化框架在暗态图像增强方面实现了与现有“压缩前增强”或“增强前压缩”方案的显著改进。源代码已包括补充材料。
https://arxiv.org/abs/2305.15030
Audio-visual speech enhancement (AV-SE) aims to enhance degraded speech along with extra visual information such as lip videos, and has been shown to be more effective than audio-only speech enhancement. This paper proposes further incorporating ultrasound tongue images to improve lip-based AV-SE systems' performance. Knowledge distillation is employed at the training stage to address the challenge of acquiring ultrasound tongue images during inference, enabling an audio-lip speech enhancement student model to learn from a pre-trained audio-lip-tongue speech enhancement teacher model. Experimental results demonstrate significant improvements in the quality and intelligibility of the speech enhanced by the proposed method compared to the traditional audio-lip speech enhancement baselines. Further analysis using phone error rates (PER) of automatic speech recognition (ASR) shows that palatal and velar consonants benefit most from the introduction of ultrasound tongue images.
音频-视频语音增强(AV-SE)旨在结合额外的视觉信息(如唇视频)来改善退化的语音质量,并已经证明比仅使用音频的语音增强更有效。本文提议进一步添加超声波唇图像,以提高基于唇的AV-SE系统的性能。在训练阶段,使用知识蒸馏来解决在推理期间获取超声波唇图像的挑战,使音频-唇语音增强学生模型可以从预先训练的音频-唇-超声波语音增强教师模型学习。实验结果表明,相比传统的音频-唇语音增强基准,采用该方法的语音增强质量和可理解性得到了显著改善。进一步使用自动语音识别(ASR)语音错误率进行分析,显示Palatal和Veloar辅音最受益于引入超声波唇图像。
https://arxiv.org/abs/2305.14933
Large language models (LLMs) have demonstrated remarkable performance in a range of natural language understanding and generation tasks. Yet, their ability to generate counterfactuals, which can be used for areas like data augmentation, remains under-explored. This study aims to investigate the counterfactual generation capabilities of LLMs and analysis factors that influence this ability. First, we evaluate how effective are LLMs in counterfactual generation through data augmentation experiments for small language models (SLMs) across four tasks: sentiment analysis, natural language inference, named entity recognition, and relation extraction. While LLMs show promising enhancements in various settings, they struggle in complex tasks due to their self-limitations and the lack of logical guidance to produce counterfactuals that align with commonsense. Second, our analysis reveals the pivotal role of providing accurate task definitions and detailed step-by-step instructions to LLMs in generating counterfactuals. Interestingly, we also find that LLMs can generate reasonable counterfactuals even with unreasonable demonstrations, which illustrates that demonstrations are primarily to regulate the output format.This study provides the first comprehensive insight into counterfactual generation abilities of LLMs, and offers a novel perspective on utilizing LLMs for data augmentation to enhance SLMs.
大型语言模型(LLMs)在自然语言理解和生成任务中表现出了卓越的表现。然而,他们生成 counterfactuals 的能力,可以用于数据增强等领域,仍然未被充分探索。本研究旨在研究LLMs的 counterfactual generation 能力以及影响 this 能力的分析因素。首先,我们通过为小型语言模型(SLMs)在不同任务中进行数据增强实验,评估了LLMs在 counterfactual 生成方面的效能。虽然在各种环境下LLMs 表现出良好的增强效果,但由于自身的限制和缺乏逻辑指导,无法生成符合常识的 counterfactuals。其次,我们的分析揭示了提供准确任务定义和详细步骤指示对LLMs生成 counterfactuals 的关键作用。有趣的是,我们还发现,即使提供不合理的演示,LLMs 仍然能够生成合理的 counterfactuals,这表明演示的主要作用是规范输出格式。本研究提供了对LLMs counterfactual generation 能力的全面认识,并提供了一个利用LLMs进行数据增强以增强小语言模型的新视角。
https://arxiv.org/abs/2305.14791
Self-supervised learning (SSL) is the latest breakthrough in speech processing, especially for label-scarce downstream tasks by leveraging massive unlabeled audio data. The noise robustness of the SSL is one of the important challenges to expanding its application. We can use speech enhancement (SE) to tackle this issue. However, the mismatch between the SE model and SSL models potentially limits its effect. In this work, we propose a new SE training criterion that minimizes the distance between clean and enhanced signals in the feature representation of the SSL model to alleviate the mismatch. We expect that the loss in the SSL domain could guide SE training to preserve or enhance various levels of characteristics of the speech signals that may be required for high-level downstream tasks. Experiments show that our proposal improves the performance of an SE and SSL pipeline on five downstream tasks with noisy input while maintaining the SE performance.
自监督学习(SSL)是语音处理的最新突破,特别是通过利用大量未标记的音频数据,利用SSL模型的优势来解决缺乏标签的下游任务。SSL模型的噪声鲁棒性是扩展其应用范围的一个重要挑战。我们可以使用语音增强(SE)来解决这个问题。但是,SE模型和SSL模型之间的不匹配可能会限制其影响。在本文中,我们提出了一种新的SE训练标准,以减少SSL模型特征表示中干净和增强信号之间的距离,减轻不匹配。我们期望SSL领域的损失可以指导SE训练,以保留或增强可能对于高层次下游任务所需的语音信号的各种特征级别。实验表明,我们的提议可以提高带有噪声输入的SE和SSL管道的五项下游任务的性能,同时保持SE表现。
https://arxiv.org/abs/2305.14723
Vision-and-language navigation (VLN) agents are trained to navigate in real-world environments by following natural language instructions. A major challenge in VLN is the limited availability of training data, which hinders the models' ability to generalize effectively. Previous approaches have attempted to address this issue by introducing additional supervision during training, often requiring costly human-annotated data that restricts scalability. In this paper, we introduce a masked path modeling (MPM) objective, which pretrains an agent using self-collected data for downstream navigation tasks. Our proposed method involves allowing the agent to actively explore navigation environments without a specific goal and collect the paths it traverses. Subsequently, we train the agent on this collected data to reconstruct the original path given a randomly masked subpath. This way, the agent can actively accumulate a diverse and substantial amount of data while learning conditional action generation. To evaluate the effectiveness of our technique, we conduct experiments on various VLN datasets and demonstrate the versatility of MPM across different levels of instruction complexity. Our results exhibit significant improvements in success rates, with enhancements of 1.32\%, 1.05\%, and 1.19\% on the val-unseen split of the Room-to-Room, Room-for-Room, and Room-across-Room datasets, respectively. Furthermore, we conduct an analysis that highlights the potential for additional improvements when the agent is allowed to explore unseen environments prior to testing.
视觉和语言导航(VLN)代理通过遵循自然语言指令在真实环境中导航的训练。VLN的一个主要挑战是训练数据有限,这限制了模型的泛化能力。先前的方法试图通过在训练期间引入额外监督来解决这一问题,通常需要昂贵的人类标注数据,限制了可扩展性。在本文中,我们介绍了掩膜路径建模(MPM)目标,该目标使用自收集的数据为后续导航任务 pretrain 代理。我们的方法涉及允许代理在没有特定目标的情况下积极探索导航环境并收集路径,然后训练代理根据随机掩膜子路径恢复原始路径。这样,代理可以积极积累大量 diverse 和实质性的数据,同时学习条件行动生成。为了评估我们技术的有效性,我们对各种VLN数据集进行了实验,并证明了 MPM 在不同指令复杂度级别的泛化能力。我们的结果在成功率方面表现出显著的改善,在房间到房间、房间对房间和房间跨越房间的数据集的可见度分割split 中,成功率的提高分别为 1.32\%、1.05\% 和 1.19\%。此外,我们还进行了分析,突出了在测试前代理探索未知环境的潜在改进潜力。
https://arxiv.org/abs/2305.14268
Biogenic Volatile Organic Compounds (BVOCs) emitted from the terrestrial ecosystem into the Earth's atmosphere are an important component of atmospheric chemistry. Due to the scarcity of measurement, a reliable enhancement of BVOCs emission maps can aid in providing denser data for atmospheric chemical, climate, and air quality models. In this work, we propose a strategy to super-resolve coarse BVOC emission maps by simultaneously exploiting the contributions of different compounds. To this purpose, we first accurately investigate the spatial inter-connections between several BVOC species. Then, we exploit the found similarities to build a Multi-Image Super-Resolution (MISR) system, in which a number of emission maps associated with diverse compounds are aggregated to boost Super-Resolution (SR) performance. We compare different configurations regarding the species and the number of joined BVOCs. Our experimental results show that incorporating BVOCs' relationship into the process can substantially improve the accuracy of the super-resolved maps. Interestingly, the best results are achieved when we aggregate the emission maps of strongly uncorrelated compounds. This peculiarity seems to confirm what was already guessed for other data-domains, i.e., joined uncorrelated information are more helpful than correlated ones to boost MISR performance. Nonetheless, the proposed work represents the first attempt in SR of BVOC emissions through the fusion of multiple different compounds.
生物生成 Volatile有机化合物(BVOCs)从地面生态系统释放到地球大气中是大气化学的一个重要组成部分。由于测量资源的短缺,可靠的增强BVOCs排放图的能力可以帮助为大气化学、气候和空气质量模型提供更密集的数据。在本研究中,我们提出了一种策略,通过同时利用不同化合物的贡献来超级分辨率粗粒度的BVOC排放图。为此,我们首先准确地研究了几种BVOC物种的空间相互联系。然后,我们利用发现相似之处构建了一个多图像超级分辨率(MISR)系统,其中与多种化合物相关的排放图被聚合以增强超级分辨率(SR)性能。我们比较了不同物种和合并BVOC数量的Configurations。我们的实验结果表明,将BVOCs的关系融入过程中可以极大地改善超级分辨率地图的准确性。有趣的是,最好的结果是在聚合强烈无相关性的化合物的排放图时实现。这个特性似乎证实了对其他数据域的预测,即合并无相关性信息比相关信息以增强MISR性能更有效。然而, proposed 的研究代表了通过融合多个不同化合物的方式来实现BVOC排放超级分辨率的第一次尝试。
https://arxiv.org/abs/2305.14180
The performance of current supervised AI systems is tightly connected to the availability of annotated datasets. Annotations are usually collected through annotation tools, which are often designed for specific tasks and are difficult to customize. Moreover, existing annotation tools with an active learning mechanism often only support limited use cases. To address these limitations, we present EASE, an Easily-Customized Annotation System Powered by Efficiency Enhancement Mechanisms. \sysname provides modular annotation units for building customized annotation interfaces and also provides multiple back-end options that suggest annotations using (1) multi-task active learning; (2) demographic feature based active learning; (3) a prompt system that can query the API of large language models. We conduct multiple experiments and user studies to evaluate our system's flexibility and effectiveness. Our results show that our system can meet the diverse needs of NLP researchers and significantly accelerate the annotation process.
当前监督人工智能系统的性能与标注数据可用性密切相关。通常,标注数据是通过标注工具收集的,这些工具通常专为特定任务设计,难以自定义。此外,现有的具有主动学习机制的标注工具通常只支持有限的使用场景。为了解决这些限制,我们提出了Ease,一个 easily- customized Annotation System 动力于效率增强机制的自定义标注系统。sysname提供了模块化的标注单元,以构建自定义标注接口,并提供了多个后端选项,建议使用(1)多任务主动学习;(2)基于年龄特征的主动学习;(3)一个可以查询大型语言模型 API 的提示系统。我们进行了多个实验和用户研究,以评估我们的系统的灵活性和效果。我们的结果表明,我们的系统可以满足不同NLP研究人员的需求,并显著加速标注过程。
https://arxiv.org/abs/2305.14169
Low-light image enhancement (LLIE) aims to improve the illuminance of images due to insufficient light exposure. Recently, various lightweight learning-based LLIE methods have been proposed to handle the challenges of unfavorable prevailing low contrast, low brightness, etc. In this paper, we have streamlined the architecture of the network to the utmost degree. By utilizing the effective structural re-parameterization technique, a single convolutional layer model (SCLM) is proposed that provides global low-light enhancement as the coarsely enhanced results. In addition, we introduce a local adaptation module that learns a set of shared parameters to accomplish local illumination correction to address the issue of varied exposure levels in different image regions. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art LLIE methods in both objective metrics and subjective visual effects. Additionally, our method has fewer parameters and lower inference complexity compared to other learning-based schemes.
弱光图像增强(LLIE)旨在通过不足的光线曝光来提高图像的亮度。最近,提出了各种轻量级基于学习的LLIE方法,以应对不良的低对比度、低亮度等挑战。在本文中,我们尽可能简化了网络架构。通过利用有效的结构参数重采样技术,我们提出了一种单卷积层模型(SCLM),该模型提供全球弱光增强,作为粗劣增强结果。此外,我们引入了一个局部适应模块,学习一组共享参数,实现局部照明修正,以解决不同图像区域 vary 的曝光水平问题。实验结果显示, proposed 方法在客观指标和主观视觉效果方面表现良好,与先进的Llie方法相比,我们的方法和其他基于学习的方案相比,参数更少,推理复杂度更低。
https://arxiv.org/abs/2305.14039
In this paper, we delve deeper into the Kullback-Leibler (KL) Divergence loss and observe that it is equivalent to the Doupled Kullback-Leibler (DKL) Divergence loss that consists of 1) a weighted Mean Square Error (wMSE) loss and 2) a Cross-Entropy loss incorporating soft labels. From our analysis of the DKL loss, we have identified two areas for improvement. Firstly, we address the limitation of DKL in scenarios like knowledge distillation by breaking its asymmetry property in training optimization. This modification ensures that the wMSE component is always effective during training, providing extra constructive cues. Secondly, we introduce global information into DKL for intra-class consistency regularization. With these two enhancements, we derive the Improved Kullback-Leibler (IKL) Divergence loss and evaluate its effectiveness by conducting experiments on CIFAR-10/100 and ImageNet datasets, focusing on adversarial training and knowledge distillation tasks. The proposed approach achieves new state-of-the-art performance on both tasks, demonstrating the substantial practical merits. Code and models will be available soon at this https URL.
在本文中,我们深入探讨了Kullback-Leibler(KL)差异损失,并观察发现它等价于包含1个加权平方误差(wMSE)损失和2个交叉熵损失的DKL差异损失。从我们对DKL损失的分析中,我们确定了两个改进方向。首先,我们针对知识蒸馏等场景的DKL限制进行了改进,通过在训练优化中打破其不对称性质,确保wMSE部分在训练中总是有效,提供了额外的建设性提示。其次,我们引入了全局信息到DKL中,用于Intraclass一致性 Regularization。通过这两个改进,我们推导出了改进的Kullback-Leibler(IKL)差异损失,并通过实验在CIFAR-10/100和ImageNet数据集上重点关注对抗训练和知识蒸馏任务,评估了其有效性。提出的这种方法在两个任务上都实现了新的先进技术表现,展示了实质性的实用优势。代码和模型将在不久的将来在这个https URL上可用。
https://arxiv.org/abs/2305.13948
We propose SE-Bridge, a novel method for speech enhancement (SE). After recently applying the diffusion models to speech enhancement, we can achieve speech enhancement by solving a stochastic differential equation (SDE). Each SDE corresponds to a probabilistic flow ordinary differential equation (PF-ODE), and the trajectory of the PF-ODE solution consists of the speech states at different moments. Our approach is based on consistency model that ensure any speech states on the same PF-ODE trajectory, correspond to the same initial state. By integrating the Brownian Bridge process, the model is able to generate high-intelligibility speech samples without adversarial training. This is the first attempt that applies the consistency models to SE task, achieving state-of-the-art results in several metrics while saving 15 x the time required for sampling compared to the diffusion-based baseline. Our experiments on multiple datasets demonstrate the effectiveness of SE-Bridge in SE. Furthermore, we show through extensive experiments on downstream tasks, including Automatic Speech Recognition (ASR) and Speaker Verification (SV), that SE-Bridge can effectively support multiple downstream tasks.
我们提出SE-Bridge,一种用于语音增强的新颖方法(SE)。最近,我们应用扩散模型对语音增强进行了尝试,我们可以通过解决一宗随机微分方程(SDE)来实现语音增强。每个SDE对应着一宗概率流普通微分方程(PF-ODE),PF-ODE解的轨迹包含不同时刻的语音状态。我们的方法是基于一致性模型的,该模型确保在同一PF-ODE解轨迹上的任何语音状态都对应着相同的初始状态。通过整合布朗运动桥过程,模型能够生成高清晰度语音样本而无需对抗训练。这是第一个尝试将一致性模型应用于SE任务,在多个指标上实现了最先进的结果,而与扩散基线相比,采样所需的时间节省了15倍。我们对各种数据集的实验表明,SE-Bridge在SE任务中非常有效。此外,我们通过对后续任务,包括自动语音识别(ASR)和语音识别(SV)等广泛的实验,证明了SE-Bridge能够有效支持多个后续任务。
https://arxiv.org/abs/2305.13796
Multi-spectral vehicle re-identification aims to address the challenge of identifying vehicles in complex lighting conditions by incorporating complementary visible and infrared information. However, in harsh environments, the discriminative cues in RGB and NIR modalities are often lost due to strong flares from vehicle lamps or sunlight, and existing multi-modal fusion methods are limited in their ability to recover these important cues. To address this problem, we propose a Flare-Aware Cross-modal Enhancement Network that adaptively restores flare-corrupted RGB and NIR features with guidance from the flare-immunized thermal infrared spectrum. First, to reduce the influence of locally degraded appearance due to intense flare, we propose a Mutual Flare Mask Prediction module to jointly obtain flare-corrupted masks in RGB and NIR modalities in a self-supervised manner. Second, to use the flare-immunized TI information to enhance the masked RGB and NIR, we propose a Flare-Aware Cross-modal Enhancement module that adaptively guides feature extraction of masked RGB and NIR spectra with prior flare-immunized knowledge from the TI spectrum. Third, to extract common informative semantic information from RGB and NIR, we propose an Inter-modality Consistency loss that enforces semantic consistency between the two modalities. Finally, to evaluate the proposed FACENet in handling intense flare, we introduce a new multi-spectral vehicle re-ID dataset, called WMVEID863, with additional challenges such as motion blur, significant background changes, and particularly intense flare degradation. Comprehensive experiments on both the newly collected dataset and public benchmark multi-spectral vehicle re-ID datasets demonstrate the superior performance of the proposed FACENet compared to state-of-the-art methods, especially in handling strong flares. The code and dataset will be released soon.
多光谱车辆重识别旨在解决在复杂照明条件下识别车辆的挑战,通过引入互补的可见和红外信息。然而,在恶劣环境中,RGB和NIRmodalities中的有用特征经常由于车辆灯具或阳光的强烈flare而丢失,现有的多模态融合方法在恢复这些重要特征方面能力有限。为了解决这一问题,我们提出了一种flare-aware Cross-modal Enhancement Network,该网络自适应地恢复flare-损坏的RGB和NIR特征,通过flare-免疫的红外热像图的指导进行恢复。首先,为了减少由于强烈的flare而Locally损坏的外观影响,我们提出了一种相互依存的flareMask预测模块,以共同self-supervised地获得flare损坏的RGB和NIR masks。第二,利用flare-免疫的TI信息增强遮罩RGB和NIR的特征,我们提出了一种flare-aware Cross-modal Enhancement模块,自适应地指导从TI光谱中预先免疫flare的知识特征提取遮罩RGB和NIR光谱。第三,从RGB和NIR中提取通用的 informative semantic information,我们提出了一种Inter-modalityConsistency Loss,该损失强制了两个modality之间的语义一致性。最后,为了评估所提出的FaceNet在处理强烈的flare方面的性能,我们介绍了一个新的多光谱车辆重识别数据集,名为 WMVEID863,并增加了一些挑战,如运动模糊、显著背景变化和特别是强烈的flare降解。在新收集的数据集和公开基准的多光谱车辆重识别数据集上进行了全面的实验,证明了所提出的FaceNet相比当前方法的优越性能,特别是在处理强烈的flare方面。代码和数据集将很快发布。
https://arxiv.org/abs/2305.13659
Despite their impressive capabilities, diffusion-based text-to-image (T2I) models can lack faithfulness to the text prompt, where generated images may not contain all the mentioned objects, attributes or relations. To alleviate these issues, recent works proposed post-hoc methods to improve model faithfulness without costly retraining, by modifying how the model utilizes the input prompt. In this work, we take a step back and show that large T2I diffusion models are more faithful than usually assumed, and can generate images faithful to even complex prompts without the need to manipulate the generative process. Based on that, we show how faithfulness can be simply treated as a candidate selection problem instead, and introduce a straightforward pipeline that generates candidate images for a text prompt and picks the best one according to an automatic scoring system that can leverage already existing T2I evaluation metrics. Quantitative comparisons alongside user studies on diverse benchmarks show consistently improved faithfulness over post-hoc enhancement methods, with comparable or lower computational cost. Code is available at \url{this https URL}.
尽管这些基于扩散的文本到图像模型(T2I)非常强大,但它们可能无法准确地响应文本提示,生成的图像可能无法包含所有提到的对象、属性或关系。为了解决这个问题,近年来提出了一种非事后增强方法,无需进行昂贵的训练,而是通过修改模型如何利用输入提示来改善模型的准确性。在本研究中,我们倒退一步,表明大型T2I扩散模型比通常假设的更加准确,并且能够生成准确地响应甚至复杂的提示的图像,而无需操纵生成过程。基于这一点,我们展示了准确性可以被视为候选人选择问题,并介绍了一个简单管道,生成文本提示的候选人图像,并根据可以利用已经存在的T2I评估指标的自动评分系统选择最佳图像。与多种基准用户的比较以及定量比较显示,相比非事后增强方法,该方法一致性地提高了准确性,计算成本相似或更低。代码可在\url{this https URL}上获取。
https://arxiv.org/abs/2305.13308
Multispectral methods have gained considerable attention due to their promising performance across various fields. However, most existing methods cannot effectively utilize information from two modalities while optimizing time efficiency. These methods often prioritize accuracy or time efficiency, leaving room for improvement in their performance. To this end, we propose a new method bright channel prior attention for enhancing pedestrian detection in low-light conditions by integrating image enhancement and detection within a unified framework. The method uses the V-channel of the HSV image of the thermal image as an attention map to trigger the unsupervised auto-encoder for visible light images, which gradually emphasizes pedestrian features across layers. Moreover, we utilize unsupervised bright channel prior algorithms to address light compensation in low light images. The proposed method includes a self-attention enhancement module and a detection module, which work together to improve object detection. An initial illumination map is estimated using the BCP, guiding the learning of the self-attention map from the enhancement network to obtain more informative representation focused on pedestrians. The extensive experiments show effectiveness of the proposed method is demonstrated through.
多光谱方法因其在各种领域的令人瞩目的表现而受到了广泛关注。然而,大部分现有方法在优化时间效率的同时无法有效地利用两种模式的相关信息。这些方法通常优先考虑准确性或时间效率,因此其性能仍有待提高。为此,我们提出了一种新方法,即 bright channel prior attention,以提高低光条件下的行人检测能力,通过将图像增强和检测在一个统一框架内集成来实现。该方法利用热图像HSV图像中的V通道作为注意力图,触发无监督的自编码器,以处理可见光图像,逐渐强调不同层上的行人特征。此外,我们利用无监督的 bright channel prior算法来解决低光图像中的照明补偿。该方法包括一个自注意力增强器和一个检测器,共同努力提高物体检测。最初的照亮图通过BCP估计,指导增强网络中的自注意力图的学习,以获得专注于行人的更 informative 的表示。广泛的实验结果表明,该方法的有效性通过得到了证明。
https://arxiv.org/abs/2305.12845
Camouflaged objects are typically assimilated into their backgrounds and exhibit fuzzy boundaries. The complex environmental conditions and the high intrinsic similarity between camouflaged targets and their surroundings pose significant challenges in accurately locating and segmenting these objects in their entirety. While existing methods have demonstrated remarkable performance in various real-world scenarios, they still face limitations when confronted with difficult cases, such as small targets, thin structures, and indistinct boundaries. Drawing inspiration from human visual perception when observing images containing camouflaged objects, we propose a three-stage model that enables coarse-to-fine segmentation in a single iteration. Specifically, our model employs three decoders to sequentially process subsampled features, cropped features, and high-resolution original features. This proposed approach not only reduces computational overhead but also mitigates interference caused by background noise. Furthermore, considering the significance of multi-scale information, we have designed a multi-scale feature enhancement module that enlarges the receptive field while preserving detailed structural cues. Additionally, a boundary enhancement module has been developed to enhance performance by leveraging boundary information. Subsequently, a mask-guided fusion module is proposed to generate fine-grained results by integrating coarse prediction maps with high-resolution feature maps. Our network surpasses state-of-the-art CNN-based counterparts without unnecessary complexities. Upon acceptance of the paper, the source code will be made publicly available at this https URL.
伪装的物体通常会将其背景融入其中,表现出模糊的边界。复杂的环境条件和伪装目标与其周围环境的高内在相似性在准确定位和分割这些物体整个方面提出了巨大的挑战。虽然现有的方法在各种不同的现实世界场景中表现出卓越的性能,但在面对困难的情况,如小型目标、薄结构以及模糊的边界时仍面临限制。从观察包含伪装物体的图像时的人类视觉感知中汲取灵感,我们提出了一个三阶段模型,可以在一次迭代中实现粗到精的分割。具体而言,我们的模型使用三个解码器依次处理缩小的特征、裁剪的特征和高分辨率原始特征。这个 proposed 的方法不仅减少了计算开销,而且还抵消了背景噪声引起的干扰。此外,考虑到多尺度信息的重要性,我们设计了多尺度特征增强模块,扩大接收域并保留详细的结构线索。此外,我们还开发了边界增强模块,通过利用边界信息来提高性能。随后,我们提出了一个用 mask 引导的 fusion 模块通过集成粗预测映射和高分辨率特征映射生成精细结果的方法。我们的网络在没有不必要的复杂度的情况下超越了最先进的卷积神经网络相关器。一旦论文接受,源代码将在此 https URL 上公开发布。
https://arxiv.org/abs/2305.12635
Real-world complex acoustic environments especially the ones with a low signal-to-noise ratio (SNR) will bring tremendous challenges to a keyword spotting (KWS) system. Inspired by the recent advances of neural speech enhancement and context bias in speech recognition, we propose a robust audio context bias based DCCRN-KWS model to address this challenge. We form the whole architecture as a multi-task learning framework for both denosing and keyword spotting, where the DCCRN encoder is connected with the KWS model. Helped with the denoising task, we further introduce an audio context bias mod?ule to leverage the real keyword samples and bias the network to better iscriminate keywords in noisy conditions. Feature merge and complex context linear modules are also introduced to strength such discrimination and to effectively leverage contextual information respectively. Experiments on the internal challenging dataset and the HIMIYA public dataset show that our DCCRN-KWS system is superior in performance, while ablation study demonstrates the good design of the whole model.
真实的复杂声学环境,特别是低信号-噪声比(SNR)的环境将给关键字spotting(KWS)系统带来巨大的挑战。受最近神经语音增强和语音识别上下文偏见的进展启发,我们提出了一个基于音频上下文偏见的DCCRN-KWS模型来解决这个问题。我们将整个架构构建为一个任务跨职能学习框架,既能进行去噪,又能进行关键字 spotting。在去噪任务中提供帮助,我们进一步引入了音频上下文偏见模块,利用真实的关键字样本,并将网络的倾向性引导到在噪声条件下更好地识别关键字。特征融合和复杂的上下文线性模块也被引入,以加强这种区分,并有效地利用上下文信息。在内部挑战数据集和HiMIYA公共数据集的实验中表明,我们的DCCRN-KWS系统在性能上表现优异,而 ablation研究则证明了整个模型的良好设计。
https://arxiv.org/abs/2305.12331
Entity Alignment (EA) aims to find the equivalent entities between two Knowledge Graphs (KGs). Existing methods usually encode the triples of entities as embeddings and learn to align the embeddings, which prevents the direct interaction between the original information of the cross-KG entities. Moreover, they encode the relational triples and attribute triples of an entity in heterogeneous embedding spaces, which prevents them from helping each other. In this paper, we transform both triples into unified textual sequences, and model the EA task as a bi-directional textual entailment task between the sequences of cross-KG entities. Specifically, we feed the sequences of two entities simultaneously into a pre-trained language model (PLM) and propose two kinds of PLM-based entity aligners that model the entailment probability between sequences as the similarity between entities. Our approach captures the unified correlation pattern of two kinds of information between entities, and explicitly models the fine-grained interaction between original entity information. The experiments on five cross-lingual EA datasets show that our approach outperforms the state-of-the-art EA methods and enables the mutual enhancement of the heterogeneous information. Codes are available at this https URL.
实体匹配(EA)的目标是在两个知识图谱(KG)之间找到等价实体。现有的方法通常将实体的三元组编码为嵌入向量并学习在嵌入向量之间的对齐,这防止了跨KG实体的原始信息之间的直接交互。此外,它们还将实体的关联三元组和属性三元组编码为异质嵌入空间中的不匹配向量,这防止了它们互相帮助。在本文中,我们将两个三元组转换为统一的文字序列,并将EA任务建模为跨KG实体序列之间的双向文本相关度任务。具体而言,我们将两个实体序列同时输入到一个预训练语言模型(PLM)中,并提出了两种基于PLM的实体对齐器,它们将序列相关度建模为实体之间的相似度。我们的方法捕获了实体之间这两种类型的信息之间的统一相关性模式,并明确建模了原始实体信息之间的精细交互。在五个跨语言EA数据集上的实验结果表明,我们的方法比当前先进的EA方法表现更好,并实现了异质信息之间的协同增强。代码在此https URL上可用。
https://arxiv.org/abs/2305.11501