Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at this https URL.
现有的视频字幕方法通常需要先从解码视频中抽取帧并执行后续处理(例如特征提取和或字幕模型学习)。在这条处理路径中,手动帧采样可能会忽略视频中的关键信息,从而降低性能。此外,抽取的帧中的冗余信息可能会影响视频字幕推断的效率。针对这个问题,我们从压缩域的角度研究视频字幕,比现有的处理路径具有多项优势:1)与解码视频中的 raw 图像相比,压缩视频由 I-frames、运动向量和残差组成,具有很高的辨识度,这使得我们可以通过专门的模型设计利用整个视频进行学习,而无需手动采样;2)由于处理的信息规模更小且冗余更少,字幕模型在推断方面更加高效。我们提出了一种在压缩域中用于视频字幕的端到端Transformer,使其可以从压缩视频中学习字幕。我们证明,即使采用简单的设计,我们的方法也可以在不同基准测试中实现最先进的性能,同时比现有方法运行快近2倍。代码可在该 https URL 上获取。
https://arxiv.org/abs/2309.12867
Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models.
域泛化研究的问题是训练从一个多个域(或分布)中收集样本的模型,然后使用从一个未知的新域中收集样本的模型进行测试。在本文中,我们提出了一种域泛化的新方法,利用大型视觉语言模型的最新进展,特别是Clip teacher模型,训练一种小型模型,使其能够泛化到未知的域。关键技术贡献是一种新类型的正则化,它要求学生 learned 的图像表示接近从图像对应的文本描述编码中得到的 teacher 的文本表示。我们介绍了两种 loss 函数的设计,即绝对距离和相对距离,提供了具体指导,如何对学生模型的训练过程正则化。我们评估了我们提出的新方法,称为rise(正则化语义嵌入),在各种基准数据集上进行评估,并表明它比一些最先进的域泛化方法表现更好。据我们所知,我们的工作是使用大型视觉语言模型进行域泛化的第一个利用知识蒸馏的方法。通过引入文本信息,rise改善了机器学习模型的泛化能力。
https://arxiv.org/abs/2309.12530
In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to ~$83\%$ compared to fully supervised approaches trained with paired target data.
近年来,配对音频和字幕数据集在自动为音频片段生成描述方面取得了显著的成功,也就是自动化音频标题生成(AAC)。然而,收集足够的配对音频和字幕数据集是费时费力的。基于最近在Contrastive Language-Audio Pretraining(CLAP)方面的进展,我们提出了一种弱监督的方法来训练一个AAC模型,假设只有文本数据和一个预先训练的CLAP模型,从而消除了需要配对目标数据的需求。我们的方法利用CLAP中音频和文本嵌入之间的相似性。在训练期间,我们学习从CLAP文本嵌入中恢复文本,而在推理期间,我们使用音频嵌入进行解码。为了缓解音频和文本嵌入之间的模式差异,我们采用了在训练和推理阶段中 bridge the gap 的策略。我们评估了我们提出的方法在Clotho和AudioCaps数据集上的表现,表明它能够与完全监督的方法训练使用配对目标数据相比实现高达 ~$83\%$ 的相对性能。
https://arxiv.org/abs/2309.12242
Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.
自监督表示学习在过去几年中取得了显著进展,一些最近的方法能够在没有标签的情况下学习有用的图像表示。这些方法使用回退作为事实上的标准训练方法。最近,Geoffrey Hinton提出了前向-前向算法作为另一种训练方法。它使用两个前向遍历和每个层单独的损失函数来训练网络而无需回退。在本研究中,我们首次研究前向-前向相对于回退在自监督表示学习中的表现,并提供了学到的表示空间 insights。我们的基准使用四个标准数据集,分别是米NIST、F-米NIST、SVHN和CIFAR-10,以及三种常见的自监督表示学习技术,分别是旋转、翻转和拼图。我们的主要发现是,虽然在(自)监督训练中前向-前向算法表现与回退相当,但在所有研究设置中传输性能显著落后。这可能是由多种因素的组合造成的,包括每个层都有一个损失函数以及前向-前向范式中监督训练的实现方式。与回退相比,前向-前向算法更关注边界,并删除不必要的决策信息,这损害了表示学习目标。需要进行进一步研究和研究以稳定自监督学习的前向-前向策略,超越Geoffrey Hinton演示的数据集和配置。
https://arxiv.org/abs/2309.11955
Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
本文提出了TextCLIP,一个无对抗训练的文本指导图像生成和操纵的统一框架。该框架接受对应这两个不同任务的图像或随机噪声输入,并利用StyleGAN的强大生成能力和Contrastive Language-Image Pre-training(CLIP)的文本图像表示能力,以生成分辨率为$1024\times1024$的图像。在多模态CelebA-HQ数据集上进行广泛的实验表明,我们提出的方法在文本指导生成任务和操纵任务中都优于现有的最先进的方法。
https://arxiv.org/abs/2309.11923
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
https://arxiv.org/abs/2309.11751
Referenceless metrics (e.g., CLIPScore) use pretrained vision--language models to assess image descriptions directly without costly ground-truth reference texts. Such methods can facilitate rapid progress, but only if they truly align with human preference judgments. In this paper, we introduce ContextRef, a benchmark for assessing referenceless metrics for such alignment. ContextRef has two components: human ratings along a variety of established quality dimensions, and ten diverse robustness checks designed to uncover fundamental weaknesses. A crucial aspect of ContextRef is that images and descriptions are presented in context, reflecting prior work showing that context is important for description quality. Using ContextRef, we assess a variety of pretrained models, scoring functions, and techniques for incorporating context. None of the methods is successful with ContextRef, but we show that careful fine-tuning yields substantial improvements. ContextRef remains a challenging benchmark though, in large part due to the challenge of context dependence.
无参考指标(例如CLIPScore)使用训练好的视觉语言模型直接评估图像描述,而不需要昂贵的实际参考文本。这些方法可以促进快速进展,但只有真正与人类偏好判断对齐时才有效。在本文中,我们介绍了ContextRef,这是一个用于评估无参考指标基准。ContextRef由两个部分组成:人类在多种 established 质量维度上的评分,以及旨在揭示基本弱点的十种不同稳健性检查。ContextRef的一个关键特征是图像和描述在上下文中呈现,反映了先前工作表明上下文对于描述质量的重要性。利用ContextRef,我们评估了多种训练好的模型、评分函数和技巧,以考虑上下文。这些方法中没有一种是ContextRef有效的,但我们证明,仔细微调可以带来显著的改进。ContextRef尽管仍然是一个具有挑战性的基准,但很大程度上这是由于上下文依赖的挑战。
https://arxiv.org/abs/2309.11710
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.
现如今,在监督学习场景中,3D人类姿态估计可以实现极高的精度。因此,解决3D姿态标注不足的问题已经引起了越来越多的关注。特别是,有一些方法提出了通过自监督学习来学习图像表示,以便将外观信息从姿态信息中分离。这些方法只需要少量的监督数据来训练姿态回归器,使用姿态相关的隐向量作为输入,因为外观信息应该被排除在外。在本文中,我们进行了深入分析,以了解art-of-the-art的分离表示学习方法在何种程度上真正将外观信息从姿态信息中分离。首先,我们从自监督网络的角度出发,通过多种图像合成实验研究了分离的实现方式。其次,我们研究了从攻击视角下的3D姿态回归器的分离实现。具体来说,我们设计了一种新的攻击策略,重点是生成物体的自然外观变化,我们可以期望分离网络具有鲁棒性。总之,我们的分析表明,三个art-of-the-art的分离表示学习框架的分离程度虽然还没有完全完成,但它们的姿态代码中包含重要的外观信息。我们相信,我们的方法提供了评估自监督3D人类姿态估计中姿态与外观分离程度的宝贵测试平台。
https://arxiv.org/abs/2309.11667
While most modern video understanding models operate on short-range clips, real-world videos are often several minutes long with semantically consistent segments of variable length. A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length and aggregating the outputs. This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative. In this paper, we aim to provide a generic and adaptive sampling approach for long-form videos in lieu of the de facto uniform sampling. Viewing videos as semantically consistent segments, we formulate a task-agnostic, unsupervised, and scalable approach based on Kernel Temporal Segmentation (KTS) for sampling and tokenizing long videos. We evaluate our method on long-form video understanding tasks such as video classification and temporal action localization, showing consistent gains over existing approaches and achieving state-of-the-art performance on long-form video modeling.
现代视频理解模型通常处理的是短片段,而实际视频往往几十分钟,具有语义 consistent 的片段长度可变。处理长视频的常见方法是使用一段固定时间长度的片段进行均匀采样,并汇总输出。这种方法忽略了长视频的深层次本质,因为固定长度的片段往往重复或无意义。在本文中,我们旨在提供一种通用的、自适应的采样方法,以代替事实上的均匀采样。将视频视为语义 consistent 的片段,我们制定了基于核心时间分割(KTS)的任务无关、无监督和可扩展的方法,用于采样和 tokenizing 长视频。我们针对长视频理解任务,如视频分类和时间行为定位,评估了我们的方法,显示与现有方法一致的增益,并在长视频建模方面实现了最先进的性能。
https://arxiv.org/abs/2309.11569
The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at this https URL.
人工智能社区在开发强大的基础模型方面取得了显著进展,这些模型受到大规模多模式数据集的驱动。然而,在音频表示学习社区中,当前的音频语言数据集存在诸如音量不足、简单内容以及繁琐收集程序等限制。为了解决这些挑战,我们提出了一种创新且自动的音频配准生成管道,基于一系列公开工具或API,并建立了一个大规模的、高质量音频-语言数据集,名为Auto-ACD,其中包括超过190万音频-文本对。为了证明该 proposed dataset 的有效性,我们在我们的数据集上训练了流行模型,并显示在多种后续任务中的表现改进,例如音频语言检索、音频配准和环境分类。此外,我们还建立了一个新颖的测试集,并为音频文本任务提供了基准。该 proposed dataset 将在此httpsURL上发布。
https://arxiv.org/abs/2309.11500
We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.
我们介绍了 Kosmos-2.5,一个用于机器阅读文本密集型图像的多模态语言模型。在大规模文本密集型图像的预训练基础上, Kosmos-2.5在两个独立但合作的任务上表现出色:(1)生成具有空间意识的文本块,每个文本块在图像中分配其空间坐标,(2)产生结构化文本输出,将其样式和结构捕获到Markdown格式中。这种统一的多模态语言能力通过共享Transformer架构、特定任务提示和灵活文本表示来实现。我们评估了 Kosmos-2.5的端到端文档级文本识别和图像到Markdown文本生成性能。此外,通过监督微调,模型可以轻松适应各种不同提示的文本密集型图像理解任务,使其成为涉及大量文本的图像的实际应用中的通用工具。这项工作还开辟了多模态大型语言模型的未来规模扩展之路。
https://arxiv.org/abs/2309.11419
Modern smart glasses leverage advanced audio sensing and machine learning technologies to offer real-time transcribing and captioning services, considerably enriching human experiences in daily communications. However, such systems frequently encounter challenges related to environmental noises, resulting in degradation to speech recognition and speaker change detection. To improve voice quality, this work investigates directional source separation using the multi-microphone array. We first explore multiple beamformers to assist source separation modeling by strengthening the directional properties of speech signals. In addition to relying on predetermined beamformers, we investigate neural beamforming in multi-channel source separation, demonstrating that automatic learning directional characteristics effectively improves separation quality. We further compare the ASR performance leveraging separated outputs to noisy inputs. Our results show that directional source separation benefits ASR for the wearer but not for the conversation partner. Lastly, we perform the joint training of the directional source separation and ASR model, achieving the best overall ASR performance.
现代智能眼镜利用先进的音频感知和机器学习技术提供实时翻译和字幕服务,极大地丰富了日常通信中的人类体验。然而,这些系统经常面临与噪声相关的挑战,导致语音识别和说话人变化检测性能下降。为了提高语音质量,这项工作研究了使用多麦克风阵列的directional source separation技术。我们首先探索了多个滤波器以协助源分离建模,通过加强语音信号的directional性质。除了依赖预先设定的滤波器之外,我们还研究了多通道源分离中的神经网络滤波器,证明自动学习directional特性有效地改善了分离质量。我们进一步比较了使用分离输出到噪声输入的语音识别性能。我们的结果表明,directional source separation对佩戴者自身的语音识别性能有益,但对对话伙伴的语音识别性能没有益处。最后,我们进行了directional source separation和语音识别模型的联合训练,取得了最佳的语音识别整体性能。
https://arxiv.org/abs/2309.10993
Despite an exciting new wave of multimodal machine learning models, current approaches still struggle to interpret the complex contextual relationships between the different modalities present in videos. Going beyond existing methods that emphasize simple activities or objects, we propose a new model-agnostic approach for generating detailed textual descriptions that captures multimodal video information. Our method leverages the extensive knowledge learnt by large language models, such as GPT-3.5 or Llama2, to reason about textual descriptions of the visual and aural modalities, obtained from BLIP-2, Whisper and ImageBind. Without needing additional finetuning of video-text models or datasets, we demonstrate that available LLMs have the ability to use these multimodal textual descriptions as proxies for ``sight'' or ``hearing'' and perform zero-shot multimodal classification of videos in-context. Our evaluations on popular action recognition benchmarks, such as UCF-101 or Kinetics, show these context-rich descriptions can be successfully used in video understanding tasks. This method points towards a promising new research direction in multimodal classification, demonstrating how an interplay between textual, visual and auditory machine learning models can enable more holistic video understanding.
尽管出现了令人兴奋的多媒态机器学习模型,但现有的方法仍然难以解释视频中出现的不同感官模式之间的复杂上下文关系。我们提出了一种新的模型无关的方法,用于生成详细文本描述,捕捉多媒态视频信息。我们的方法利用大型语言模型如GPT-3.5或Llama2学习到的广泛知识,以处理从BLIP-2、Whisper和ImageBind获取的视觉和听觉感官模式文本描述。我们不需要进一步调整视频文本模型或数据集,就能证明可用的LLMs有使用这些多媒态文本描述作为“看到”或“听到”的代用品,并在上下文中实现零次机会多媒态分类的能力。我们对流行的行动识别基准点如UCF-101或Kinetics进行评估,表明这些丰富的上下文描述可以在视频理解任务中成功使用。这种方法指向了多媒态分类中的有前途的新研究方向,展示了如何将文本、视觉和听觉机器学习模型之间的交互实现更全面的视频理解。
https://arxiv.org/abs/2309.10783
Adversarial purification using generative models demonstrates strong adversarial defense performance. These methods are classifier and attack-agnostic, making them versatile but often computationally intensive. Recent strides in diffusion and score networks have improved image generation and, by extension, adversarial purification. Another highly efficient class of adversarial defense methods known as adversarial training requires specific knowledge of attack vectors, forcing them to be trained extensively on adversarial examples. To overcome these limitations, we introduce a new framework, namely Language Guided Adversarial Purification (LGAP), utilizing pre-trained diffusion models and caption generators to defend against adversarial attacks. Given an input image, our method first generates a caption, which is then used to guide the adversarial purification process through a diffusion network. Our approach has been evaluated against strong adversarial attacks, proving its effectiveness in enhancing adversarial robustness. Our results indicate that LGAP outperforms most existing adversarial defense techniques without requiring specialized network training. This underscores the generalizability of models trained on large datasets, highlighting a promising direction for further research.
使用生成模型进行dversarial净化表现出强大的dversarial防御性能。这些方法既分类器又攻击无关,使其具有广泛的适用性,但通常计算量较大。最近,扩散和得分网络的进步改进了图像生成和dversarial净化。另一个高效的dversarial防御方法称为dversarial训练,需要特定的攻击向量知识,迫使它们在dversarial示例上进行广泛的训练。为了克服这些限制,我们引入了一个新框架,即语言引导dversarial净化(LGAP),利用预训练的扩散模型和caption生成器来抵御dversarial攻击。给定输入图像,我们的方法首先生成一个caption,然后用于引导通过扩散网络的dversarial净化过程。我们的方法已经对强dversarial攻击进行了评估,证明了它在增强dversarial鲁棒性方面的 effectiveness。我们的结果表明,LGAP在没有需要专门网络训练的情况下优于大多数现有的dversarial防御技术。这强调了训练大型数据集模型的通用性,突出了进一步研究的前景。
https://arxiv.org/abs/2309.10348
Dynamic magnetic resonance imaging (DMRI) is an effective imaging tool for diagnosis tasks that require motion tracking of a certain anatomy. To speed up DMRI acquisition, k-space measurements are commonly undersampled along spatial or spatial-temporal domains. The difficulty of recovering useful information increases with increasing undersampling ratios. Compress sensing was invented for this purpose and has become the most popular method until deep learning (DL) based DMRI reconstruction methods emerged in the past decade. Nevertheless, existing DL networks are still limited in long-range sequential dependency understanding and computational efficiency and are not fully automated. Considering the success of Transformers positional embedding and "swin window" self-attention mechanism in the vision community, especially natural video understanding, we hereby propose a novel architecture named Reconstruction Swin Transformer (RST) for 4D MRI. RST inherits the backbone design of the Video Swin Transformer with a novel reconstruction head introduced to restore pixel-wise intensity. A convolution network called SADXNet is used for rapid initialization of 2D MR frames before RST learning to effectively reduce the model complexity, GPU hardware demand, and training time. Experimental results in the cardiac 4D MR dataset further substantiate the superiority of RST, achieving the lowest RMSE of 0.0286 +/- 0.0199 and 1 - SSIM of 0.0872 +/- 0.0783 on 9 times accelerated validation sequences.
动态磁共振成像(DMRI)是一种有效的诊断工具,用于需要对特定解剖结构进行运动跟踪的任务。为了加速DMRI获取,k-空间测量通常会在空间或时间域中 Undersampling 。恢复有用信息的难度随着 Undersampling 比例的增加而增加。压缩感知是由这发明的,已经成为最受欢迎的方法,直到基于深度学习(DL)的DMRI重建方法在过去十年中问世。然而,现有的DL网络仍然局限于远程序列依赖理解和计算效率的限制,并尚未完全自动化。考虑到 Transformers 的位置嵌入和“swin窗口”自注意力机制在视觉社区中的成功,特别是自然视频理解,我们 hereby propose 一个名为 Reconstruction Swin Transformer (RST) 的全新的架构,用于4DMRI重建。 RST继承视频 Swin Transformer 的主要骨架设计,引入一个恢复像素强度的新重建头。一个卷积网络称为SADXNet,用于快速初始化2DMR帧,以有效地减少模型复杂度,GPU硬件需求和训练时间。在心脏4DMRI数据集的实验结果进一步支持了RST的优势,实现了最低RMSE为0.0286±0.0199,1-SSIM为0.0872±0.0783,在9倍加速验证序列上。
https://arxiv.org/abs/2309.10227
Offline handwriting recognition (HWR) has improved significantly with the advent of deep learning architectures in recent years. Nevertheless, it remains a challenging problem and practical applications often rely on post-processing techniques for restricting the predicted words via lexicons or language models. Despite their enhanced performance, such systems are less usable in contexts where out-of-vocabulary words are anticipated, e.g. for detecting misspelled words in school assessments. To that end, we introduce the task of comparing a handwriting image to text. To solve the problem, we propose an unrestricted binary classifier, consisting of a HWR feature extractor and a multimodal classification head which convolves the feature extractor output with the vector representation of the input text. Our model's classification head is trained entirely on synthetic data created using a state-of-the-art generative adversarial network. We demonstrate that, while maintaining high recall, the classifier can be calibrated to achieve an average precision increase of 19.5% compared to addressing the task by directly using state-of-the-art HWR models. Such massive performance gains can lead to significant productivity increases in applications utilizing human-in-the-loop automation.
过去几年中,深度学习架构的出现使得离线手写识别(HWR)性能得到了显著提高。然而,它仍然是一个具有挑战性的问题,并且实用的应用程序通常依赖于后处理技术通过词汇表或语言模型限制预测单词。尽管这些系统的性能得到了增强,但在预计缺少词汇表的单词的情况下,它们 less useful,例如在在学校评估中检测拼写错误的单词方面。为此,我们引入了比较手写图像和文本的任务。为了解决这个问题,我们提出了一个不受限制的二进制分类器,它由一个HWR特征提取器和一个多模式分类头组成,该分类头将特征提取器输出与输入文本的向量表示卷积。我们训练我们的分类头完全使用先进的生成对抗网络生成的模拟数据。我们证明,尽管保持高召回率,分类器可以校准以实现平均精度提高19.5%,而直接使用先进的HWR模型解决这个问题则无法达到这个水平。这种巨大的性能提升可以在利用人类参与的自动化应用中导致显著的生产率增加。
https://arxiv.org/abs/2309.10158
We present RECAP (REtrieval-Augmented Audio CAPtioning), a novel and effective audio captioning system that generates captions conditioned on an input audio and other captions similar to the audio retrieved from a datastore. Additionally, our proposed method can transfer to any domain without the need for any additional fine-tuning. To generate a caption for an audio sample, we leverage an audio-text model CLAP to retrieve captions similar to it from a replaceable datastore, which are then used to construct a prompt. Next, we feed this prompt to a GPT-2 decoder and introduce cross-attention layers between the CLAP encoder and GPT-2 to condition the audio for caption generation. Experiments on two benchmark datasets, Clotho and AudioCaps, show that RECAP achieves competitive performance in in-domain settings and significant improvements in out-of-domain settings. Additionally, due to its capability to exploit a large text-captions-only datastore in a \textit{training-free} fashion, RECAP shows unique capabilities of captioning novel audio events never seen during training and compositional audios with multiple events. To promote research in this space, we also release 150,000+ new weakly labeled captions for AudioSet, AudioCaps, and Clotho.
我们提出了RECAP(REtrieval-Augmented AudioCAPtion),这是一种创新且有效的音频标注系统,其基于输入音频并生成与从数据存储中检索的音频类似的标注,同时我们的提议方法可以在任何领域无需进行额外的微调即可将其转移到该领域中。为了生成一个音频样本的标注,我们利用音频文本模型CLAP从可替换的数据存储中检索类似的标注,然后将其用于构建提示。接下来,我们将这个提示向GPT-2解码器输入并引入CLAP编码器和GPT-2之间的交叉注意力层以条件音频生成标注。对两个基准数据集(Clotho和AudioCaps)进行了实验,结果表明,RECAP在该领域 settings 中取得了竞争性能,并在跨领域 settings 中取得了显著的改善。此外,由于其可以在 \textit{训练免费} 的方式利用大型文本标注数据存储的唯一数据存储库,RECAP展示了标注 novel 音频事件从未在训练中见过的独特能力,以及包含多个事件的音乐Compositions。为了促进该领域的研究,我们还发布了150,000+ 新的弱标签标注音频样本、AudioSet、AudioCaps和Clotho。
https://arxiv.org/abs/2309.09836
Data-driven approaches hold promise for audio captioning. However, the development of audio captioning methods can be biased due to the limited availability and quality of text-audio data. This paper proposes a SynthAC framework, which leverages recent advances in audio generative models and commonly available text corpus to create synthetic text-audio pairs, thereby enhancing text-audio representation. Specifically, the text-to-audio generation model, i.e., AudioLDM, is used to generate synthetic audio signals with captions from an image captioning dataset. Our SynthAC expands the availability of well-annotated captions from the text-vision domain to audio captioning, thus enhancing text-audio representation by learning relations within synthetic text-audio pairs. Experiments demonstrate that our SynthAC framework can benefit audio captioning models by incorporating well-annotated text corpus from the text-vision domain, offering a promising solution to the challenge caused by data scarcity. Furthermore, SynthAC can be easily adapted to various state-of-the-art methods, leading to substantial performance improvements.
数据驱动的方法对于音频字幕有着潜在的 promise。然而,由于文本-音频数据的有限性和质量的限制,字幕方法的发展可能会受到偏见。本文提出了一个合成AC框架,利用最新的音频生成模型和常用的文本语料库,创建合成文本-音频对,从而提高文本-音频表示。具体来说,文本到音频生成模型(即 AudioLDM)从图像字幕数据集中提取合成音频信号并添加注释。我们的合成AC框架将扩展文本-视觉领域的良好注释字幕的可用性,通过在合成文本-音频对中学习关系来提高文本-音频表示。实验结果表明,通过将文本-视觉领域的良好注释文本语料库纳入我们的合成AC框架中,可以为音频字幕模型带来好处,并提供一个解决数据稀缺挑战的有前途的解决方案。此外,合成AC框架可以轻松适应各种先进的方法,导致显著的性能改进。
https://arxiv.org/abs/2309.09705
As the most critical components in a sentence, subject, predicate and object require special attention in the video captioning task. To implement this idea, we design a novel framework, named COllaborative three-Stream Transformers (COST), to model the three parts separately and complement each other for better representation. Specifically, COST is formed by three branches of transformers to exploit the visual-linguistic interactions of different granularities in spatial-temporal domain between videos and text, detected objects and text, and actions and text. Meanwhile, we propose a cross-granularity attention module to align the interactions modeled by the three branches of transformers, then the three branches of transformers can support each other to exploit the most discriminative semantic information of different granularities for accurate predictions of captions. The whole model is trained in an end-to-end fashion. Extensive experiments conducted on three large-scale challenging datasets, i.e., YouCookII, ActivityNet Captions and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods.
作为句子中最关键的部分,主语、谓语和对象在视频字幕任务中需要特别重视。为了实现这个想法,我们设计了一个全新的框架,名为合作三流转换器(Cost),分别建模三个部分,以更好地表示。具体来说,Cost由Transformer的三个分支组成,以利用视频和文本在空间-时间域中不同粒度的视觉语言学交互,检测物体和文本,以及操作和文本。同时,我们提出了跨粒度注意力模块,以对齐由Transformer的三个分支建模的互动,然后三个分支可以互相支持,利用不同粒度的语义信息进行准确的字幕预测。整个模型以端到端的方式进行训练。在三个大规模挑战性数据集上(YouCookII、ActivityNetcaptions和MSVD)进行了广泛的实验,结果表明,我们提出的方法与最先进的方法表现良好。
https://arxiv.org/abs/2309.09611
The excellent text-to-image synthesis capability of diffusion models has driven progress in synthesizing coherent visual stories. The current state-of-the-art method combines the features of historical captions, historical frames, and the current captions as conditions for generating the current frame. However, this method treats each historical frame and caption as the same contribution. It connects them in order with equal weights, ignoring that not all historical conditions are associated with the generation of the current frame. To address this issue, we propose Causal-Story. This model incorporates a local causal attention mechanism that considers the causal relationship between previous captions, frames, and current captions. By assigning weights based on this relationship, Causal-Story generates the current frame, thereby improving the global consistency of story generation. We evaluated our model on the PororoSV and FlintstonesSV datasets and obtained state-of-the-art FID scores, and the generated frames also demonstrate better storytelling in visuals. The source code of Causal-Story can be obtained from this https URL.
扩散模型出色的文本到图像合成能力推动了合成连贯视觉故事的进步。当前最先进的方法将历史标题、历史框架和当前标题的特征作为生成当前帧的条件。然而,这种方法将每个历史框架和标题视为相同的贡献。它以相等权重将它们连接在一起,而忽略了 not all 历史条件与生成当前帧有关。为了解决这个问题,我们提出了因果故事模型。该模型包含局部因果注意机制,考虑前一张标题、框架和当前标题之间的因果关系。通过基于这个关系分配权重,因果故事生成当前帧,从而改善了故事生成全局一致性。我们使用PororoSV和FlintstonesSV数据集评估了我们的模型,并获得了最先进的 FID 分数,生成的帧也展示了更好的视觉叙事。因果故事模型的源代码可以从这个 https URL 获取。
https://arxiv.org/abs/2309.09553