Personalized outfit recommendation remains a complex challenge, demanding both fashion compatibility understanding and trend awareness. This paper presents a novel framework that harnesses the expressive power of large language models (LLMs) for this task, mitigating their "black box" and static nature through fine-tuning and direct feedback integration. We bridge the item visual-textual gap in items descriptions by employing image captioning with a Multimodal Large Language Model (MLLM). This enables the LLM to extract style and color characteristics from human-curated fashion images, forming the basis for personalized recommendations. The LLM is efficiently fine-tuned on the open-source Polyvore dataset of curated fashion images, optimizing its ability to recommend stylish outfits. A direct preference mechanism using negative examples is employed to enhance the LLM's decision-making process. This creates a self-enhancing AI feedback loop that continuously refines recommendations in line with seasonal fashion trends. Our framework is evaluated on the Polyvore dataset, demonstrating its effectiveness in two key tasks: fill-in-the-blank, and complementary item retrieval. These evaluations underline the framework's ability to generate stylish, trend-aligned outfit suggestions, continuously improving through direct feedback. The evaluation results demonstrated that our proposed framework significantly outperforms the base LLM, creating more cohesive outfits. The improved performance in these tasks underscores the proposed framework's potential to enhance the shopping experience with accurate suggestions, proving its effectiveness over the vanilla LLM based outfit generation.
个性化服装推荐仍然是一个复杂的挑战,需要同时具备时尚兼容性和趋势意识。本文提出了一种新颖的方法,利用大型语言模型(LLMs)的表达能力来解决此任务,通过微调和直接反馈整合来减轻它们的“黑盒子”和静态性质。我们通过采用多模态大型语言模型(MLLM)进行图像 captioning来填补物品描述中的视觉-文本差距。这使得LLM能够从人类策划的时尚图像中提取风格和颜色特征,形成个性化推荐的基础。LLM在经过优化的开源Polyvore数据集中进行高效微调,提高其推荐时尚衣物的能力。采用负例直接偏好机制来增强LLM的决策过程。这导致一个自增强的AI反馈循环,持续根据季节时尚趋势优化建议。在Polyvore数据集上评估我们的框架,证明了其在两个关键任务上的有效性:填空题和互补物品检索。这些评估强调了我们框架通过直接反馈持续改进时尚、与趋势保持一致的服装建议的能力。评估结果显示,与基线LLM相比,我们提出的框架显著提高了性能,创建了更凝聚力的服装组合。这些任务中 improved performance 证明了所提出的框架通过准确建议提高购物体验的重要性,证明其在基于普通LLM的服装生成方面的有效性。
https://arxiv.org/abs/2409.12150
Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in training. Through our experiments we demonstrate that these are effective at boosting performance across different pre-training regimes, model architectures, and downstream data distributions, without incurring higher computational costs or requiring additional training data.
音频文本对比模型已经成为音乐表示学习中的有力方法。然而,尽管它们在经验上取得了成功,但关于这些框架中学习到的音乐文本表示的质量的影响,目前还知之甚少。在这项工作中,我们在有限的数据和计算预算内揭示了这些设计选择,并基于实证观察结果,建立了它们对音乐文本表示质量影响的更扎实理解,主要从三个角度进行考察:基础编码器的选择、训练数据中的策展水平以及文本增强。我们发现,数据策展是资源受限场景中音乐文本对比训练的最重要因素。受到这一见解的启发,我们引入了两种新颖的技术,即增强型视图丢弃和文本替换,它们增加了训练中可见文本输入的多样性和描述性。通过我们的实验,我们证明了这些技术在提高不同预训练状态、模型架构和下游数据分布的性能方面非常有效,而不会产生更高的计算成本或需要额外的训练数据。
https://arxiv.org/abs/2409.11498
Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6° mean-absolute-error in 3D source localization over the baseline.
人类可以通过不精确的自然语言描述来想象声音场景。例如,用“狮子咆哮从我身后传来!”这样的短语描述一个 acoustic 环境很容易想象。对于机器具有与人类相同的理解程度,机器必须知道狮子是什么(语义属性)、"背后"的概念(空间属性),以及这些语言信息如何与声音的语义和空间属性对齐(一个咆哮声从哪里来)。 最先进的音频基础模型通过学习将音频场景映射到自然文本描述中进行训练,这些模型对抗性训练非空间音频和文本对,因此缺乏空间意识。相比之下,声音事件定位和检测模型只能识别来自固定类别的声音,并将源定位到绝对位置(例如,0.2米),而不是用自然语言描述的位置。为了填补这些空白,我们提出了 ELSA,一种具有空间意识的音频和文本嵌入模型,使用多模态对比学习进行训练。ELSA 支持非空间音频、空间音频和用自然语言描述的开放词汇文本注解,描述声音的语义和空间成分。要训练 ELSA:(a) 我们将三个开源音频数据集中的音频和文本进行空间增强,总共 4,738 小时音频;(b) 我们设计了一个编码器,利用对比学习捕捉非空间音频的语义和空间属性,以及空间音频的语义和空间属性。ELSA 在语义检索和 3D 源本地化方面与最先进的模型相当。特别地,ELSA 在基准线上的平均音频到文本和文本到音频的 R@1 比率为 +2.8%,在基准线上实现了比 -11.6° 的绝对误差更优异的 3D 源本地化表现。
https://arxiv.org/abs/2409.11369
Reconstructing 3D visuals from functional Magnetic Resonance Imaging (fMRI) data, introduced as Recon3DMind in our conference work, is of significant interest to both cognitive neuroscience and computer vision. To advance this task, we present the fMRI-3D dataset, which includes data from 15 participants and showcases a total of 4768 3D objects. The dataset comprises two components: fMRI-Shape, previously introduced and accessible at this https URL, and fMRI-Objaverse, proposed in this paper and available at this https URL. fMRI-Objaverse includes data from 5 subjects, 4 of whom are also part of the Core set in fMRI-Shape, with each subject viewing 3142 3D objects across 117 categories, all accompanied by text captions. This significantly enhances the diversity and potential applications of the dataset. Additionally, we propose MinD-3D, a novel framework designed to decode 3D visual information from fMRI signals. The framework first extracts and aggregates features from fMRI data using a neuro-fusion encoder, then employs a feature-bridge diffusion model to generate visual features, and finally reconstructs the 3D object using a generative transformer decoder. We establish new benchmarks by designing metrics at both semantic and structural levels to evaluate model performance. Furthermore, we assess our model's effectiveness in an Out-of-Distribution setting and analyze the attribution of the extracted features and the visual ROIs in fMRI signals. Our experiments demonstrate that MinD-3D not only reconstructs 3D objects with high semantic and spatial accuracy but also deepens our understanding of how human brain processes 3D visual information. Project page at: this https URL.
重建从功能磁共振成像(fMRI)数据中生成的三维视觉信息,这是我们会议工作中发表的Recon3DMind,在认知神经学和计算机视觉领域都具有重要的意义。为了提高这项任务,我们提出了fMRI-3D数据集,其中包括来自15个参与者的数据,总共包含4768个3D对象。数据集由两个部分组成:fMRI-Shape,之前介绍并可在此处访问,和fMRI-Objaverse,本论文提出的并可在该处访问。fMRI-Objaverse包括来自5个参与者的数据,其中4个也是fMRI-Shape中的核心成员,每个受试者浏览了117个类别的3142个3D对象,并配有文本注释。这显著增强了数据集的多样性和潜在应用。此外,我们提出了MinD-3D,一种新颖的框架,旨在从fMRI信号中解码3D视觉信息。该框架首先使用神经融合编码器提取和聚合fMRI数据中的特征,然后采用特征桥扩散模型生成视觉特征,最后使用生成变换器解码器重构3D对象。我们在半语义和结构水平上设计指标,以评估模型性能。此外,我们还研究了在离散设置中我们的模型的有效性,并分析了fMRI信号中提取特征和视觉区域归因的差异。我们的实验证明,MinD-3D不仅以高语义和空间精度重构3D对象,而且还有助于深入理解人类大脑如何处理3D视觉信息。项目页面可在此处访问。
https://arxiv.org/abs/2409.11315
Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. 'rock song without guitar'), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub-tree. We evaluated the triplet-based musical knowledge for six general-purpose Transformer-based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.
多媒体文本音乐系统已经促进了音乐信息研究(MIR)应用的新方法,例如音频到文本和文本到音频检索,基于文本的歌曲生成和音乐字幕。尽管报告取得了成功,但很少有人对大型语言模型(LLM)的音乐知识进行评估。在本文中,我们证明了LLM存在以下问题:1)提示敏感性,2)无法建模否定(例如“没有吉他的摇滚歌曲”),3)对特定单词存在敏感性。我们将这些属性量化为三元组基于准确性的方式,评估了在层次化本体论中模型对标签相对相似性的建模能力。我们利用Audioset本体来生成包含锚点、正(相关)标签和负(不相关)标签的三元组,用于评估六个通用Transformer基模型。通过这种方法获得的三元组需要过滤,因为有些难以判断,因此对评估目的来说相对不具有信息价值。尽管报告的准确度相对较高,但所有六个模型都存在不稳定性,表明在投入使用之前,LLM需要对音乐进行适应。
https://arxiv.org/abs/2409.11449
Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple languages, audio-language models are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio language models in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio language models that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio language models by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.
音频语言模型可以理解音频输入并根据指令执行一系列音频相关任务,如语音识别和音频字幕,其中指令通常是文本提示。大多数音频语言模型是从预训练的音频编码器和大语言模型(LLMs)初始化的。尽管这些预训练组件是为支持多种语言而开发的,但音频语言模型主要在英语数据集上进行训练,这可能限制其仅支持英语指令或英语语音输入的使用。本文首先用泰语研究了现有音频语言模型的性能,作为例子。本文证明了,尽管基于多语言骨干网络,音频语言模型并不表现出低资源语言的跨语言涌现能力。接着,本文研究了为开发适应目标语言(包括英语)的音频语言模型而进行的数据混合。此外,本文将音频理解和语音指令跟踪功能集成到一个单一的模型中。我们的实验揭示了数据混合如何增强低资源语言和英语中的指令跟踪能力。我们的模型Typhoon-Audio在性能上远远超过了现有的开源音频语言模型,与泰语和英语中的Gemini-1.5-Pro相当。
https://arxiv.org/abs/2409.10999
Exploring the narratives conveyed by fine-art paintings is a challenge in image captioning, where the goal is to generate descriptions that not only precisely represent the visual content but also offer a in-depth interpretation of the artwork's meaning. The task is particularly complex for artwork images due to their diverse interpretations and varied aesthetic principles across different artistic schools and styles. In response to this, we present KALE Knowledge-Augmented vision-Language model for artwork Elaborations), a novel approach that enhances existing vision-language models by integrating artwork metadata as additional knowledge. KALE incorporates the metadata in two ways: firstly as direct textual input, and secondly through a multimodal heterogeneous knowledge graph. To optimize the learning of graph representations, we introduce a new cross-modal alignment loss that maximizes the similarity between the image and its corresponding metadata. Experimental results demonstrate that KALE achieves strong performance (when evaluated with CIDEr, in particular) over existing state-of-the-art work across several artwork datasets. Source code of the project is available at this https URL.
研究艺术作品所传达的故事是一个在图像标注中的挑战,其目标是生成不仅准确代表视觉内容,而且还能深入解读艺术作品意义的描述。对于艺术作品来说,由于它们具有各种各样的解释和不同的美学原则,这个任务尤其复杂。为了应对这个挑战,我们提出了KALE知识增强视觉语言模型,一种新颖的方法,它通过将艺术作品元数据集成到现有的视觉语言模型中来增强现有的模型。KALE通过两种方式将元数据集成到模型中:首先作为直接文本输入,然后通过多模态异质知识图来集成。为了优化图表示的学习,我们引入了一种新的跨模态对齐损失,该损失在图像和其相应元数据之间最大化相似度。实验结果表明,KALE在多个艺术作品数据集上实现了卓越的表现(特别是在CIDEr评估指标上)。项目的源代码可在此处访问:https:// URL。
https://arxiv.org/abs/2409.10921
Egocentric videos provide a unique perspective into individuals' daily experiences, yet their unstructured nature presents challenges for perception. In this paper, we introduce AMEGO, a novel approach aimed at enhancing the comprehension of very-long egocentric videos. Inspired by the human's ability to maintain information from a single watching, AMEGO focuses on constructing a self-contained representations from one egocentric video, capturing key locations and object interactions. This representation is semantic-free and facilitates multiple queries without the need to reprocess the entire visual content. Additionally, to evaluate our understanding of very-long egocentric videos, we introduce the new Active Memories Benchmark (AMB), composed of more than 20K of highly challenging visual queries from EPIC-KITCHENS. These queries cover different levels of video reasoning (sequencing, concurrency and temporal grounding) to assess detailed video understanding capabilities. We showcase improved performance of AMEGO on AMB, surpassing other video QA baselines by a substantial margin.
自我中心视频提供了独特的方式来了解个体日常经历的独特视角,然而它们的非结构化特性为感知带来了挑战。在本文中,我们介绍了AMEGO,一种旨在增强对非常长自我中心视频理解的独特方法。受到人类从单一观看中保留信息的能力的启发,AMEGO专注于构建一个自包含的表示,从单个自我中心视频捕捉关键位置和物体交互。这个表示是语义无关的,无需重新处理整个视觉内容即可完成多个查询。此外,为了评估我们对非常长自我中心视频的理解,我们引入了新的Active Memories Benchmark (AMB),由EPIC-KITCHENS的20K个高度具有挑战性的视觉查询组成。这些查询涵盖了视频推理的不同级别(序列、并发和时间基础),以评估详细视频理解能力。我们在AMB上展示了AMEGO的改善性能,超过其他视频QA基线,优势明显。
https://arxiv.org/abs/2409.10917
Implicit Neural Representation (INR), leveraging a neural network to transform coordinate input into corresponding attributes, has recently driven significant advances in several vision-related domains. However, the performance of INR is heavily influenced by the choice of the nonlinear activation function used in its multilayer perceptron (MLP) architecture. Multiple nonlinearities have been investigated; yet, current INRs face limitations in capturing high-frequency components, diverse signal types, and handling inverse problems. We have identified that these problems can be greatly alleviated by introducing a paradigm shift in INRs. We find that an architecture with learnable activations in initial layers can represent fine details in the underlying signals. Specifically, we propose SL$^{2}$A-INR, a hybrid network for INR with a single-layer learnable activation function, prompting the effectiveness of traditional ReLU-based MLPs. Our method performs superior across diverse tasks, including image representation, 3D shape reconstructions, inpainting, single image super-resolution, CT reconstruction, and novel view synthesis. Through comprehensive experiments, SL$^{2}$A-INR sets new benchmarks in accuracy, quality, and convergence rates for INR.
隐式神经表示(INR)利用神经网络将坐标输入转换为相应的属性,在最近几个与视觉相关的领域中,显著推动了进展。然而,INR的性能在很大程度上受到其在多层感知器(MLP)架构中使用的非线性激活函数的选择影响。我们研究了多个非线性特性;然而,目前的INR在面对高频成分、多样信号类型和处理反问题方面存在局限性。我们发现,通过引入INR范式的转变,这些问题可以大大减轻。我们发现,具有可学习激活函数的层初始架构可以表示底层信号的细小细节。具体来说,我们提出了SL$^{2}$A-INR,一种只使用单层可学习激活函数的INR,推动了传统基于ReLU的MLP的有效性。我们的方法在各种任务上表现出色,包括图像表示、3D形状重构、去噪、单图像超分辨率、CT重建和新的视图合成。通过全面的实验,SL$^{2}$A-INR为INR设立了新的准确度、质量和收敛率基准。
https://arxiv.org/abs/2409.10836
Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: this https URL.
潜在扩散模型在文本到音频(T2A)生成任务中已经取得了良好的结果,然而,之前的模型在生成质量、计算成本、扩散采样和数据准备方面遇到了困难。在本文中,我们引入了EzAudio,一种基于Transformer的T2A扩散模型,来处理这些挑战。我们的方法包括几个关键创新:(1)我们在1D波形变分自编码器(VAE)的潜在空间上构建T2A模型,避免了处理2D谱图表示的复杂性,并使用了一个额外的神经语音合成器。(2)我们设计了一个专门为音频潜在表示和扩散建模而设计的优化扩散Transformer架构,从而提高了收敛速度、训练稳定性和内存使用率,使得训练过程更容易和更高效。(3)为了应对数据稀疏问题,我们采用了一种数据有效的训练策略,利用未标注数据学习声学依赖关系,使用由音频语言模型注解的音频文本对齐学习到的音频 caption 数据,以及人类标注数据进行微调。(4)我们引入了一种类器无关的指导(CFG)缩放方法,通过实现强大的提示对齐,同时保留良好的音频质量,简化了EzAudio,无需在找到最优的CFG分数来平衡这一权衡而努力。EzAudio在客观指标和主观评价上超越了现有的开源模型,为用户提供了逼真的听觉体验,同时保持了简化的模型结构和低训练成本,以及易于跟随的训练流程。代码、数据和预训练模型发布在:https://这个URL。
https://arxiv.org/abs/2409.10819
Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.
视觉语言模型(VLMs)在各种任务上的零样本通用表现已经得到了很好的展现,尤其是与大型语言模型(LLMs)集成时。然而,它们在理解 persuasive 视觉媒体(如广告)方面的能力仍然有待研究。广告通常采用异常化的图像,通过出人意料的物体并置来传达共享特征。例如,图1(e)显示了一瓶有羽毛状质感的啤酒。这需要高级推理来推断出这种异常表示啤酒的轻盈。我们引入了三个新的任务,多标签异常性分类、异常性陈述检索和异常物体识别,以评估 VLMs 对 persuasive 图像中异常性的理解。我们通过使用语义上具有挑战性的负样本来评估 VLMs 使用异常性推断广告信息以及它们的推理能力。最后,我们通过提取关注异常元素的全面图像描述来开创了异常性意识语言输出。我们的研究结果表明:(1)与 LLMs 相比,VLMs 缺乏高级推理能力;(2)简单的、有效的策略可以提取异常性意识信息,实现全面的图像语言描述;(3)异常性有助于提高说服性广告的理解。代码和数据将公开提供。
https://arxiv.org/abs/2409.10719
We introduce Playground v3 (PGv3), our latest text-to-image model that achieves state-of-the-art (SoTA) performance across multiple testing benchmarks, excels in graphic design abilities and introduces new capabilities. Unlike traditional text-to-image generative models that rely on pre-trained language models like T5 or CLIP text encoders, our approach fully integrates Large Language Models (LLMs) with a novel structure that leverages text conditions exclusively from a decoder-only LLM. Additionally, to enhance image captioning quality-we developed an in-house captioner, capable of generating captions with varying levels of detail, enriching the diversity of text structures. We also introduce a new benchmark CapsBench to evaluate detailed image captioning performance. Experimental results demonstrate that PGv3 excels in text prompt adherence, complex reasoning, and accurate text rendering. User preference studies indicate the super-human graphic design ability of our model for common design applications, such as stickers, posters, and logo designs. Furthermore, PGv3 introduces new capabilities, including precise RGB color control and robust multilingual understanding.
我们介绍了Playground v3(PGv3),我们最新的文本转图像模型,在多个测试基准中实现了最先进的性能,并在图形设计能力和创新功能方面表现出色。与依赖于预训练语言模型如T5或CLIP的文本转图像生成模型不同,我们的方法完全将大型语言模型(LLMs)与新的结构相结合,该结构仅从解码器级LLM的文本条件中利用文本。此外,为了提高图像摘要质量,我们开发了一个内部摘要器,具有生成不同细节水平 captions的能力,丰富了文本结构的多样性。我们还引入了一个新的基准CapsBench,以评估详细图像摘要性能。实验结果表明,PGv3在文本提示遵守、复杂推理和准确文本渲染方面表现出色。用户偏好研究结果表明,我们的模型在常见设计应用(如贴纸、海报和标志设计)中的超人类图形设计能力。此外,PGv3还引入了其他功能,包括精确的RGB颜色控制和强大的多语言理解。
https://arxiv.org/abs/2409.10695
Text-To-Music (TTM) models have recently revolutionized the automatic music generation research field. Specifically, by reaching superior performances to all previous state-of-the-art models and by lowering the technical proficiency needed to use them. Due to these reasons, they have readily started to be adopted for commercial uses and music production practices. This widespread diffusion of TTMs poses several concerns regarding copyright violation and rightful attribution, posing the need of serious consideration of them by the audio forensics community. In this paper, we tackle the problem of detection and attribution of TTM-generated data. We propose a dataset, FakeMusicCaps that contains several versions of the music-caption pairs dataset MusicCaps re-generated via several state-of-the-art TTM techniques. We evaluate the proposed dataset by performing initial experiments regarding the detection and attribution of TTM-generated audio.
文本转音乐(TTM)模型最近在自动音乐生成领域取得了巨大的突破。具体来说,它们通过超越所有先前的最先进模型并降低使用它们所需的技术水平,达到了卓越的性能。因此,它们已经迅速开始被应用于商业用途和音乐制作实践中。这些TTM的广泛传播引起了关于侵权和正当归功的担忧,需要音频 forensics 社区对它们进行认真的考虑。在本文中,我们解决了TTM生数据检测和归功的问题。我们提出了一个数据集,FakeMusicCaps,它包含了几种最先进的TTM技术通过生成音乐词对数据集 MusicCaps 中的音频。我们通过进行关于TTM生音频的检测和归功的初始实验来评估所提出的数据集。
https://arxiv.org/abs/2409.10684
Video annotation is a critical and time-consuming task in computer vision research and applications. This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process. Our approach uses Hierarchical Stochastic Neighbor Embedding (HSNE) to create a multi-scale representation of video features, allowing annotators to efficiently explore and label large video datasets. We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video. Our experiments on multiple datasets show the effectiveness and robustness of our pipeline across various scenarios. Moreover, we investigate the optimal configuration of HSNE parameters for different datasets. Our work provides a promising direction for scaling up video annotation efforts in the era of video understanding.
视频注释是计算机视觉研究中有趣和具有挑战性的任务,同时在应用领域中也有着广泛的应用。本文提出了一种新颖的注释管道,该管道利用预提取的特征和降维来加速视频注释过程。我们的方法使用层次随机邻接嵌入(HSNE)来创建视频特征的多尺度表示,使得注释者可以有效地探索和标注大型视频数据集。我们证明了与传统线性方法相比,注释努力显著提高,超过12小时的视频标注所需的点击量减少了近10倍。我们对多个数据集的实验结果表明,我们的管道在各种场景中都具有有效性和稳健性。此外,我们还研究了HSNE参数在不同数据集上的最优配置。我们的工作为在视频理解时代增加视频注释努力提供了有前途的方向。
https://arxiv.org/abs/2409.10641
Contrastive pretraining can substantially increase model generalisation and downstream performance. However, the quality of the learned representations is highly dependent on the data augmentation strategy applied to generate positive pairs. Positive contrastive pairs should preserve semantic meaning while discarding unwanted variations related to the data acquisition domain. Traditional contrastive pipelines attempt to simulate domain shifts through pre-defined generic image transformations. However, these do not always mimic realistic and relevant domain variations for medical imaging such as scanner differences. To tackle this issue, we herein introduce counterfactual contrastive learning, a novel framework leveraging recent advances in causal image synthesis to create contrastive positive pairs that faithfully capture relevant domain variations. Our method, evaluated across five datasets encompassing both chest radiography and mammography data, for two established contrastive objectives (SimCLR and DINO-v2), outperforms standard contrastive learning in terms of robustness to acquisition shift. Notably, counterfactual contrastive learning achieves superior downstream performance on both in-distribution and on external datasets, especially for images acquired with scanners under-represented in the training set. Further experiments show that the proposed framework extends beyond acquisition shifts, with models trained with counterfactual contrastive learning substantially improving subgroup performance across biological sex.
对比性预训练可以显著提高模型的泛化能力和下游性能。然而,所学习到的表示的质量对应用的数据增强策略至关重要。阳性对比性对应该保留语义含义,同时丢弃与数据收集领域相关的无关变异。传统的对比性管道试图通过预定义的通用图像变换来模拟领域变化。然而,这些并不总是模仿医学成像领域(如扫描仪差异)的真实和相关领域变异。为解决这个问题,我们在这里引入了反事实对比性学习,一种利用最近在因果图像合成领域的进展来创建对比性阳性对的新框架,以忠实捕捉相关领域变异。我们对五个数据集(包括胸部X光摄影和乳腺X光摄影数据)进行了评估,这两个 established 的对比性目标(SimCLR和DINO-v2),在对比性预训练方面的表现优于标准对比学习。值得注意的是,反事实对比性学习在分布内和分布外的数据上都取得了优越的性能,尤其是对于在训练集中扫描器表现不佳的图像。进一步的实验表明,所提出的框架不仅仅局限于获取到的扫描仪变化,使用反事实对比性学习的模型在生物性别子群体上显著提高子群体性能。
https://arxiv.org/abs/2409.10365
The SoccerNet 2024 challenges represent the fourth annual video understanding challenges organized by the SoccerNet team. These challenges aim to advance research across multiple themes in football, including broadcast video understanding, field understanding, and player understanding. This year, the challenges encompass four vision-based tasks. (1) Ball Action Spotting, focusing on precisely localizing when and which soccer actions related to the ball occur, (2) Dense Video Captioning, focusing on describing the broadcast with natural language and anchored timestamps, (3) Multi-View Foul Recognition, a novel task focusing on analyzing multiple viewpoints of a potential foul incident to classify whether a foul occurred and assess its severity, (4) Game State Reconstruction, another novel task focusing on reconstructing the game state from broadcast videos onto a 2D top-view map of the field. Detailed information about the tasks, challenges, and leaderboards can be found at this https URL, with baselines and development kits available at this https URL.
soccernet 2024 挑战代表足球网络团队组织的一年一度的视频理解挑战。这些挑战旨在促进足球领域内的研究,包括直播视频理解、场地理解和对球员的理解。今年,挑战包括四个基于视觉的任务。(1)球动作检测,专注于准确地定位与球相关的足球动作的发生时间和位置,(2)密集视频标题,用自然语言描述直播内容并使用 anchored timestamps 锚定时间,(3)多视角犯规识别,专注于分析潜在犯规事件的多个视角来判断是否犯规以及其严重程度,(4)比赛状态重建,专注于将直播视频上的比赛状态从二维场地视图重建到场上。关于任务、挑战和排行榜的详细信息,请查阅此链接,基础研究和开发工具可在该链接找到。
https://arxiv.org/abs/2409.10587
Singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer's gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additionally supervise the model using the singer's voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the codes are available at \bulurl{this https URL}.
singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as the vocal tone of the singer or emotional expressions. However, existing open-source audio-text datasets for voice generation tend to capture only a very limited range of attributes, often missing musical characteristics of the audio. To fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse set of attributes. S2Cap consists of pairs of textual prompts and music audio samples with a wide range of vocal and musical attributes, including pitch, volume, tempo, mood, singer's gender and age, and musical genre and emotional expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm for singing style captioning. Singing style captioning is a relative task to voice generation that generates text descriptions of vocal characteristics, which we first suggested. First, to mitigate the misalignment between the audio encoder and the text decoder, we present a novel mechanism called CRESCENDO, which utilizes positive-pair similarity learning to synchronize the embedding spaces of a pretrained audio encoder to get similar embeddings with a text encoder. We additional supervise the model using the singer's voice, which is demixed by the accompaniment. This supervision allows the model to more accurately capture vocal characteristics, leading to improved singing style captions that better reflect the style of the singer. The dataset and the code are available at <https://this URL>.
https://arxiv.org/abs/2409.09866
Vision-language models (VLMs) such as CLIP are trained via contrastive learning between text and image pairs, resulting in aligned image and text embeddings that are useful for many downstream tasks. A notable drawback of CLIP, however, is that the resulting embedding space seems to lack some of the structure of their purely text-based alternatives. For instance, while text embeddings have been long noted to satisfy \emph{analogies} in embedding space using vector arithmetic, CLIP has no such property. In this paper, we propose an approach to natively train CLIP in a contrastive manner to reason about differences in embedding space. We finetune CLIP so that the differences in image embedding space correspond to \emph{text descriptions of the image differences}, which we synthetically generate with large language models on image-caption paired datasets. We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute (e.g., elephants are larger than cats), which is useful in retrieval or constructing attribute-based classifiers, and improved zeroshot classification performance on many downstream image classification tasks. In addition, our approach enables a new mechanism for inference that we refer to as comparative prompting, where we leverage prior knowledge of text descriptions of differences between classes of interest, achieving even larger performance gains in classification. Finally, we illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space, such as in text-to-image generation.
Vision-language models(VLMs)如CLIP是通过在文本和图像对之间进行对比学习进行训练的。这导致了对齐的图像和文本嵌入,对于许多下游任务非常有用。然而,CLIP的一个显著缺点是,生成的嵌入空间似乎缺乏其纯粹文本基础的替代品的某些结构。例如,虽然文本嵌入已经被指出在向量算术中满足相似性,但CLIP没有这种属性。在本文中,我们提出了一种在对比方式下原生训练CLIP的方法,以解释在嵌入空间中的差异。我们微调CLIP,使得图像嵌入空间中的差异对应于图像差异的文本描述,我们在图像-捕捉配对数据集上使用大型语言模型生成的。我们首先证明,我们的方法在通过特定属性排名图像方面产生了显著的改进(例如,大象比猫更大),这对于检索或构建基于属性的分类器非常有用,并且在许多下游图像分类任务上的 zero-shot 分类性能得到了提高。此外,我们的方法实现了一种新的推理机制,我们称之为比较提示,我们利用感兴趣类别的文本描述的差异先验知识,在分类中实现更大的性能提升。最后,我们说明生成的嵌入遵守了在嵌入空间中更大的几何性质,如文本到图像的生成。
https://arxiv.org/abs/2409.09721
Recently, biological perception has been a powerful tool for handling the camouflaged object detection (COD) task. However, most existing methods are heavily dependent on the local spatial information of diverse scales from convolutional operations to optimize initial features. A commonly neglected point in these methods is the long-range dependencies between feature pixels from different scale spaces that can help the model build a global structure of the object, inducing a more precise image representation. In this paper, we propose a novel Global-Local Collaborative Optimization Network, called GLCONet. Technically, we first design a collaborative optimization strategy from the perspective of multi-source perception to simultaneously model the local details and global long-range relationships, which can provide features with abundant discriminative information to boost the accuracy in detecting camouflaged objects. Furthermore, we introduce an adjacent reverse decoder that contains cross-layer aggregation and reverse optimization to integrate complementary information from different levels for generating high-quality representations. Extensive experiments demonstrate that the proposed GLCONet method with different backbones can effectively activate potentially significant pixels in an image, outperforming twenty state-of-the-art methods on three public COD datasets. The source code is available at: \this https URL.
近年来,生物感知在处理伪装物体检测(COD)任务中已成为一种强大的工具。然而,大多数现有方法在从卷积操作到优化初始特征的过程中高度依赖各种尺度的局部空间信息。在这些方法中,通常被忽视的一个关键点是不同尺度空间中特征像素之间的长距离依赖关系,这种依赖关系可以帮助模型构建对象的全局结构,从而产生更精确的图像表示。在本文中,我们提出了一个名为GLCONet的新全局局部协同优化网络。从多源感知的角度出发,我们首先设计了一种协同优化策略,同时建模不同尺度之间的局部细节和全局长距离关系,这可以为检测伪装物体提供丰富的判别信息,提高识别的准确性。此外,我们还引入了一个相邻反解器,其中包含跨层聚合和反向优化,以整合不同层次的互补信息,生成高质量的表示。大量实验证明,使用不同骨干网络的GLCONet方法可以有效激活图像中可能具有重大影响力的像素,在三个公共COD数据集上优于20个最先进的算法。源代码可在此处访问:\this <https:// URL>。
https://arxiv.org/abs/2409.09588
The success of Vision Language Models (VLMs) on various vision-language tasks heavily relies on pre-training with large scale web-crawled datasets. However, the noisy and incomplete nature of web data makes dataset scale crucial for performance, rendering end-to-end training increasingly prohibitive. In this paper, we propose NEVLP, a noise-robust framework for efficient vision-language pre-training that requires less pre-training data. Specifically, we bridge the modality gap between a frozen image encoder and a large language model with a transformer and introduce two innovative learning strategies: noise-adaptive learning and concept-enhanced learning to mitigate the impact of noise. In noise-adaptive learning, we estimate the noise probability of each image-text pair based on the transformer's memorization effect and employ noise-adaptive regularization on image-text contrastive learning to condition cross-modal alignment. In concept-enhanced learning, we enrich incomplete text by incorporating visual concepts (objects in the image) to provide prior information about existing objects for image-text matching and image-grounded text generation, thereby mitigating text incompletion. Our framework effectively utilizes noisy web data and achieves state-of-the-art performance with less pre-training data across a wide range of vision-language tasks, including image-text retrieval, image captioning, and visual question answering.
要在各种视觉语言任务上取得成功,Vision Language Models (VLMs) 依赖于使用大规模爬取的互联网数据集进行预训练。然而,互联网数据的嘈杂和不完整性质使得数据集规模对性能至关重要,导致端到端训练变得越来越困难。在本文中,我们提出了 NEVLP,一种噪音容量的框架,用于高效的视觉语言预训练,需要更少的预训练数据。具体来说,我们通过将静止的图像编码器与大型语言模型(使用Transformer)相连接,引入了两种创新的学习策略:噪音适应学习和概念增强学习,以减轻噪音的影响。在噪音适应学习中,我们根据Transformer的记忆效应估计每个图像文本对之间的噪音概率,并在图像文本对比学习上应用噪音适应 regularization,从而条件跨模态对齐。在概念增强学习中,我们通过引入图像中的视觉概念(图像中的物体)来丰富不完整的文本,为图像文本匹配和图像 grounded 文本生成提供先验信息,从而减轻文本不完整。我们的框架有效地利用了噪音互联网数据,在广泛的视觉语言任务中实现了与预训练数据更少的性能,包括图像文本检索、图像标题和视觉问题回答。
https://arxiv.org/abs/2409.09582