In this paper, we explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks. We hypothesize that the latent representation learned from a pretrained generative T2V model encapsulates rich semantics and coherent temporal correspondences, thereby naturally facilitating video understanding. Our hypothesis is validated through the classic referring video object segmentation (R-VOS) task. We introduce a novel framework, termed ``VD-IT'', tailored with dedicatedly designed components built upon a fixed pretrained T2V model. Specifically, VD-IT uses textual information as a conditional input, ensuring semantic consistency across time for precise temporal instance matching. It further incorporates image tokens as supplementary textual inputs, enriching the feature set to generate detailed and nuanced masks.Besides, instead of using the standard Gaussian noise, we propose to predict the video-specific noise with an extra noise prediction module, which can help preserve the feature fidelity and elevates segmentation quality. Through extensive experiments, we surprisingly observe that fixed generative T2V diffusion models, unlike commonly used video backbones (e.g., Video Swin Transformer) pretrained with discriminative image/video pre-tasks, exhibit better potential to maintain semantic alignment and temporal consistency. On existing standard benchmarks, our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods. The code will be available at \url{this https URL}
在本文中,我们探讨了来自预训练的文本到视频(T2V)扩散模型生成视频理解任务的视觉表示。我们假设,从预训练的生成T2V模型中学习到的潜在表示包含了丰富的语义和连贯的时间对应关系,从而自然地促进视频理解。我们的假设通过经典的参考视频物体分割(R-VOS)任务得到验证。我们引入了一个名为“VD-IT”的新框架,它专为固定预训练的T2V模型设计了一些专用的组件。具体来说,VD-IT使用文本信息作为条件输入,确保了语义的一致性,并在精确的时间实例匹配方面保持了语义一致性。它还进一步增加了图像令牌作为补充的文本输入,从而丰富特征集,生成了详细和细微的掩码。 此外,我们提出了一种预测模块,用于预测视频特有噪声,这有助于保留特征保真度并提高分割质量。通过广泛的实验,我们惊讶地观察到,固定生成T2V扩散模型,与通常用于视频骨干的(如Video Swin Transformer)预训练方法相比,具有更好的保持语义对齐和时间一致性的潜力。在现有标准基准上,我们的VD-IT实现了极具竞争力的结果,超过了许多现有方法。代码将在\url{这个 https URL}中提供。
https://arxiv.org/abs/2403.12042
We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications. First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL .
我们介绍了一个具有多才多艺的柔性捕获视觉语言模型(VLM),可以生成不同长度的区域特定描述。该模型FlexCap通过为输入边界框生成长度条件下的捕获结果,从而控制其输出信息密度。描述可以从简洁的物体标签到详细的捕获信息。为了实现这一点,我们创建了各种长度的大规模训练数据集,从带标签的图像开始。这种柔性捕获能力具有几个宝贵的应用。首先,FlexCap在视觉基因组数据集上的密集捕获任务中表现出卓越的性能。其次,通过使用FlexCap生成局部描述作为大型语言模型的输入,可以构建视觉问答(VQA)系统。该系统在多个VQA数据集上实现了最先进的零散射击性能。我们还证明了使用FlexCap的“局部化然后描述”方法比其他VLM的“描述然后局部化”方法在开放性物体检测方面表现更好。我们突出柔性捕获模型的一个新颖特点,即它可以通过前缀条件提取多样视觉信息。最后,我们初步展示了FlexCap在图像分类、物体属性识别和视觉对话等任务上的广泛应用。项目网页:https://this URL 。
https://arxiv.org/abs/2403.12026
As a cross-modal task, visual storytelling aims to generate a story for an ordered image sequence automatically. Different from the image captioning task, visual storytelling requires not only modeling the relationships between objects in the image but also mining the connections between adjacent images. Recent approaches primarily utilize either end-to-end frameworks or multi-stage frameworks to generate relevant stories, but they usually overlook latent topic information. In this paper, in order to generate a more coherent and relevant story, we propose a novel method, Topic Aware Reinforcement Network for VIsual StoryTelling (TARN-VIST). In particular, we pre-extracted the topic information of stories from both visual and linguistic perspectives. Then we apply two topic-consistent reinforcement learning rewards to identify the discrepancy between the generated story and the human-labeled story so as to refine the whole generation process. Extensive experimental results on the VIST dataset and human evaluation demonstrate that our proposed model outperforms most of the competitive models across multiple evaluation metrics.
作为一种跨模态任务,视觉叙事旨在自动为有序图像序列生成故事。与图像标题任务不同,视觉叙事不仅需要建模图像中对象之间的关系,还需要挖掘相邻图像之间的联系。最近的方法主要利用端到端框架或多阶段框架生成相关故事,但通常忽视了潜在主题信息。在本文中,为了生成更连贯和相关的故事,我们提出了一个新颖的方法:主题感知强化网络(TARN-VIST)。特别地,我们从视觉和语言两个角度预提取了故事的主题信息。然后我们应用两个主题相关的强化学习奖励来识别生成的故事与人类标注故事之间的差异,以优化整个生成过程。在VIST数据集和人类评估的广泛实验结果中,我们的提出的模型在多个评估指标上优于大多数竞争模型。
https://arxiv.org/abs/2403.11550
We explore how reconciling several foundation models (large language models and vision-language models) with a novel unified memory mechanism could tackle the challenging video understanding problem, especially capturing the long-term temporal relations in lengthy videos. In particular, the proposed multimodal agent VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video; 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task, utilizing the zero-shot tool-use ability of LLMs. VideoAgent demonstrates impressive performances on several long-horizon video understanding benchmarks, an average increase of 6.6% on NExT-QA and 26.0% on EgoSchema over baselines, closing the gap between open-sourced models and private counterparts including Gemini 1.5 Pro.
我们探讨了如何通过将大型语言模型和视觉语言模型与新颖的统一记忆机制相结合来解决具有挑战性的视频理解问题,特别是捕捉长时间视频中的长期时间关系。特别是,所提出的多模态代理VideoAgent:1)构建了一个结构化记忆来存储视频的通用时间事件描述和物体中心跟踪状态;2)在给定输入任务查询时,它采用包括视频段局部定位和物体记忆查询等其他视觉基础模型来交互式解决任务,利用LLM的零 shot工具使用能力。VideoAgent在多个长期时间范围的视觉理解基准测试中表现出色,与基线相比,平均提高了6.6%的NExT-QA和26.0%的EgoSchema,缩小了开源模型和私有模型之间的差距,包括Gemini 1.5 Pro。
https://arxiv.org/abs/2403.11481
This paper introduces Scene-LLM, a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a hybrid 3D visual feature representation, that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space, enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and ego-centric 3D information. This combination is pivotal for interactive planning, where scene-level data supports global planning and ego-centric data is important for localization. Notably, we use ego-centric 3D frame features for feature alignment, an efficient technique that enhances the model's ability to align features of small objects within the scene. Our experiments with Scene-LLM demonstrate its strong capabilities in dense captioning, question answering, and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning, offering new possibilities for sophisticated agent interactions in indoor settings.
本文介绍了Scene-LLM,一种通过整合大型语言模型(LLMs)的推理优势来增强 embodied代理人在互动式3D室内环境中的能力的模型。Scene-LLM采用一种混合3D视觉特征表示,包含了密集的空间信息并支持场景状态更新。该模型采用投影层,在预训练的文本嵌入空间中有效地投影这些特征,从而有效解释3D视觉信息。我们方法的特点是同时整合了场景级和自中心3D信息。这种结合对于交互式规划至关重要,因为在交互式规划中,场景级数据支持全局规划,自中心数据对于局部定位很重要。 值得注意的是,我们使用自中心3D帧特征进行特征对齐,这是一种有效的方法,可以增强模型对场景中小物体特征的对齐能力。我们对Scene-LLM的实验结果表明,它具有强大的密集描述、问题回答和交互式规划能力。我们相信,Scene-LLM推动了3D视觉理解和推理领域的发展,为室内环境中的复杂代理交互提供了新的可能性。
https://arxiv.org/abs/2403.11401
Two approaches have emerged to input images into large language models (LLMs). The first is to caption images into natural language. The second is to map image feature embeddings into the domain of the LLM and pass the mapped embeddings directly to the LLM. The majority of recent few-shot multimodal work reports performance using architectures that employ variations of one of these two approaches. But they overlook an important comparison between them. We design a controlled and focused experiment to compare these two approaches to few-shot visual question answering (VQA) with LLMs. Our findings indicate that for Flan-T5 XL, a 3B parameter LLM, connecting visual embeddings directly to the LLM embedding space does not guarantee improved performance over using image captions. In the zero-shot regime, we find using textual image captions is better. In the few-shot regimes, how the in-context examples are selected determines which is better.
有两种将图像输入到大型语言模型(LLMs)的方法已经出现了。第一种是将图像标题成自然语言。第二种是将图像特征嵌入映射到LLM的领域,并直接传递映射的嵌入到LLM。大多数最近的几少 shot 多模态工作都使用采用了这两种方法中的一个或两个架构的模型,但是它们忽略了它们之间的重要比较。我们设计了一个有控制力的实验,以比较这两种方法与少 shot 视觉问答(VQA) with LLMs的性能。我们的研究结果表明,对于 Flan-T5 XL这样的3B参数LLM,直接将视觉表示连接到LLM嵌入空间并没有提高性能。在零散局面下,我们发现使用文本图像标题要好。在少散局面下,如何选择上下文实例决定了哪种方法更好。
https://arxiv.org/abs/2403.11317
Data visualization serves as a critical means for presenting data and mining its valuable insights. The task of chart summarization, through natural language processing techniques, facilitates in-depth data analysis of charts. However, there still are notable deficiencies in terms of visual-language matching and reasoning ability for existing approaches. To address these limitations, this study constructs a large-scale dataset of comprehensive chart-caption pairs and fine-tuning instructions on each chart. Thanks to the broad coverage of various topics and visual styles within this dataset, better matching degree can be achieved from the view of training data. Moreover, we propose an innovative chart summarization method, ChartThinker, which synthesizes deep analysis based on chains of thought and strategies of context retrieval, aiming to improve the logical coherence and accuracy of the generated summaries. Built upon the curated datasets, our trained model consistently exhibits superior performance in chart summarization tasks, surpassing 8 state-of-the-art models over 7 evaluation metrics. Our dataset and codes are publicly accessible.
数据可视化是一种展示数据和挖掘其有价值的信息的关键手段。通过自然语言处理技术,图表摘要任务有助于深入分析图表。然而,现有的方法在视觉语言匹配和推理能力方面仍然存在显著的不足。为了克服这些限制,本研究构建了一个大型的图表摘要数据集,并为每个图表提供了详细的指令。得益于这个数据集中的广泛涵盖各种主题和视觉风格的全面性,从训练数据的角度可以实现更好的匹配程度。此外,我们提出了一个创新性的图表摘要方法,ChartThinker,它基于思维链和上下文检索策略的深度分析,旨在提高生成的摘要的逻辑连贯性和准确性。基于精心挑选的训练数据,我们训练的模型在图表摘要任务中始终表现出卓越的性能,超过7个评估指标中的8个领先水平。我们的数据集和代码都是公开可用的。
https://arxiv.org/abs/2403.11236
In this work, we show that synthetic data created by generative models is complementary to computer graphics (CG) rendered data for achieving remarkable generalization performance on diverse real-world scenes for 3D human pose and shape estimation (HPS). Specifically, we propose an effective approach based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. We first collect a large-scale human-centric dataset with comprehensive annotations, e.g., text captions and surface normal images. Then, we train a customized ControlNet model upon this dataset to generate diverse human images and initial ground-truth labels. At the core of this step is that we can easily obtain numerous surface normal images from a 3D human parametric model, e.g., SMPL-X, by rendering the 3D mesh onto the image plane. As there exists inevitable noise in the initial labels, we then apply an off-the-shelf foundation segmentation model, i.e., SAM, to filter negative data samples. Our data generation pipeline is flexible and customizable to facilitate different real-world tasks, e.g., ego-centric scenes and perspective-distortion scenes. The generated dataset comprises 0.79M images with corresponding 3D annotations, covering versatile viewpoints, scenes, and human identities. We train various HPS regressors on top of the generated data and evaluate them on a wide range of benchmarks (3DPW, RICH, EgoBody, AGORA, SSP-3D) to verify the effectiveness of the generated data. By exclusively employing generative models, we generate large-scale in-the-wild human images and high-quality annotations, eliminating the need for real-world data collection.
在这项工作中,我们证明了由生成模型产生的合成数据对于在多样真实场景中实现令人印象深刻的泛化性能是互补的,这对于3D人体姿态和形状估计(HPS)非常重要。具体来说,我们提出了一个基于最近扩散模型的有效方法,称为HumanWild,它可以通过轻松地生成人体图像和相应的3D网格注释来生成大量的人体数据。我们首先收集了一个大规模的人中心数据集,例如带有全面注释的文本捕捉和表面法线图像。然后,我们在这个数据集上训练一个自定义的控制网络模型,以生成多样的人体图像和初始地面真实标签。这一步的核心在于,我们可以通过将3D人体参数模型渲染到图像平面上轻松获得大量的人体表面法线图像,例如SMPL-X。由于初始标签存在不可避免的噪声,因此我们然后应用了一个通用的基础分割模型,即SAM,来过滤负数据样本。我们的数据生成管道灵活且可定制,以促进不同现实任务的实现,例如自利场景和透视失真场景。生成的数据集包括0.79M张对应3D注释的图像,涵盖了各种观点、场景和人体身份。我们在生成的数据上训练各种HPS回归器,并在广泛的基准测试(3DPW,RICH,EgoBody,AGORA,SSP-3D)上对其有效性能进行评估,以验证生成的数据的有效性。通过仅使用生成模型,我们生成了大规模的野外人体图像和高质量的注释,消除了在现实世界数据收集的需求。
https://arxiv.org/abs/2403.11111
Image-text retrieval (ITR) plays a significant role in making informed decisions for various remote sensing (RS) applications. Nonetheless, creating ITR datasets containing vision and language modalities not only requires significant geo-spatial sampling area but also varing categories and detailed descriptions. To this end, we introduce an image caption dataset LuojiaHOG, which is geospatial-aware, label-extension-friendly and comprehensive-captioned. LuojiaHOG involves the hierarchical spatial sampling, extensible classification system to Open Geospatial Consortium (OGC) standards, and detailed caption generation. In addition, we propose a CLIP-based Image Semantic Enhancement Network (CISEN) to promote sophisticated ITR. CISEN consists of two components, namely dual-path knowledge transfer and progressive cross-modal feature fusion. Comprehensive statistics on LuojiaHOG reveal the richness in sampling diversity, labels quantity and descriptions granularity. The evaluation on LuojiaHOG is conducted across various state-of-the-art ITR models, including ALBEF, ALIGN, CLIP, FILIP, Wukong, GeoRSCLIP and CISEN. We use second- and third-level labels to evaluate these vision-language models through adapter-tuning and CISEN demonstrates superior performance. For instance, it achieves the highest scores with WMAP@5 of 88.47\% and 87.28\% on third-level ITR tasks, respectively. In particular, CISEN exhibits an improvement of approximately 1.3\% and 0.9\% in terms of WMAP@5 compared to its baseline. These findings highlight CISEN advancements accurately retrieving pertinent information across image and text. LuojiaHOG and CISEN can serve as a foundational resource for future RS image-text alignment research, facilitating a wide range of vision-language applications.
图像文本检索(ITR)在各种遥感和(RS)应用中做出明智的决策具有重要作用。然而,创建包含视觉和语言模态的ITR数据集不仅需要显著的地理采样区域,而且要考虑分类和详细描述的多样性。为此,我们介绍了一个名为LuojiaHOG的图像标题数据集,它具有地理感知性、标签扩展友好性和全面性。LuojiaHOG涉及分层空间采样、扩展分类系统至Open Geospatial Consortium(OGC)标准以及详细描述生成。此外,我们提出了一个基于CLIP的图像语义增强网络(CISEN)以促进复杂的ITR。CISEN由两个组件组成,即双路径知识传递和渐进跨模态特征融合。LuojiaHOG全面统计揭示了采样多样性、标签数量和描述 granularity。在LuojiaHOG上进行评估,包括ALBEF、ALIGN、CLIP、FILIP、Wukong、GeoRSCLIP和CISEN。我们使用第二和第三级标签通过自适应调整和CISEN在视觉语言模型上表现出优越性能。例如,它分别实现了WMAP@5的88.47\%和87.28\%的分数,在第三级ITR任务上。特别是在CISEN的基础上,它在大规模ITR任务中的性能显著提高。这些发现突出了CISEN准确检索图像和文本相关信息的优越性。LuojiaHOG和CISEN可以作为未来RS图像文本对齐研究的基石,促进各种视觉语言应用的发展。
https://arxiv.org/abs/2403.10887
Internet memes have become a powerful means for individuals to express emotions, thoughts, and perspectives on social media. While often considered as a source of humor and entertainment, memes can also disseminate hateful content targeting individuals or communities. Most existing research focuses on the negative aspects of memes in high-resource languages, overlooking the distinctive challenges associated with low-resource languages like Bengali (also known as Bangla). Furthermore, while previous work on Bengali memes has focused on detecting hateful memes, there has been no work on detecting their targeted entities. To bridge this gap and facilitate research in this arena, we introduce a novel multimodal dataset for Bengali, BHM (Bengali Hateful Memes). The dataset consists of 7,148 memes with Bengali as well as code-mixed captions, tailored for two tasks: (i) detecting hateful memes, and (ii) detecting the social entities they target (i.e., Individual, Organization, Community, and Society). To solve these tasks, we propose DORA (Dual cO attention fRAmework), a multimodal deep neural network that systematically extracts the significant modality features from the memes and jointly evaluates them with the modality-specific features to understand the context better. Our experiments show that DORA is generalizable on other low-resource hateful meme datasets and outperforms several state-of-the-art rivaling baselines.
互联网流行语已成为个人在社交媒体上表达情感、思想和观点的强大手段。虽然通常被认为是一种幽默和娱乐的来源,但流行语也可能传播针对个人或团体的仇恨内容。现有研究主要关注高资源语言中流行语的负面方面,而忽略了与低资源语言(如孟加拉语,也称为孟加拉语)相关的独特挑战。此外,之前关于孟加拉语流行语的研究主要集中在检测仇恨性流行语,而没有涉及检测其针对的实体。为了弥合这一空白,并促进该领域的研究,我们引入了一个名为孟加拉语、BHM(孟加拉语仇恨流行语)的多模态数据集。该数据集包括7,148个流行语,其中包含孟加拉语以及孟加拉语的混合编码。数据集针对两个任务:(i)检测仇恨性流行语,(ii)检测它们针对的社交实体(即个人、组织、社区和社会)。为解决这两个任务,我们提出了DORA(双重cO关注框架),一种多模态深度神经网络,系统地从流行语中提取显著的模态特征,并分别与模态特征共同评估,以更好地理解上下文。我们的实验结果表明,DORA在 其他低资源仇恨流行语数据集上具有泛化能力,并超越了几个最先进的竞争基线。
https://arxiv.org/abs/2403.10829
Long-form video understanding represents a significant challenge within computer vision, demanding a model capable of reasoning over long multi-modal sequences. Motivated by the human cognitive process for long-form video understanding, we emphasize interactive reasoning and planning over the ability to process lengthy visual inputs. We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to iteratively identify and compile crucial information to answer a question, with vision-language foundation models serving as tools to translate and retrieve visual information. Evaluated on the challenging EgoSchema and NExT-QA benchmarks, VideoAgent achieves 54.1% and 71.3% zero-shot accuracy with only 8.4 and 8.2 frames used on average. These results demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods, highlighting the potential of agent-based approaches in advancing long-form video understanding.
长格式视频理解是计算机视觉领域的一个重要挑战,要求模型具有在长多模态序列上进行推理的能力。为了满足人类在长格式视频理解中的认知过程,我们强调交互式推理和规划,而不是处理长视觉输入的能力。我们引入了一个名为VideoAgent的新颖智能体系统,它采用一个大语言模型作为核心代理,通过迭代确定和汇总关键信息来回答问题,而视觉语言模型作为工具来翻译和检索视觉信息。在具有挑战性的EgoSchema和NExT-QA基准测试中评估,VideoAgent平均使用8.4和8.2帧,实现了54.1%和71.3%的零散准确性。这些结果表明,我们的方法在当前最先进的方法上具有优越的效性和效率,突出了基于智能体的方法在促进长格式视频理解方面的潜力。
https://arxiv.org/abs/2403.10517
Pre-training image representations from the raw text about images enables zero-shot vision transfer to downstream tasks. Through pre-training on millions of samples collected from the internet, multimodal foundation models, such as CLIP, produce state-of-the-art zero-shot results that often reach competitiveness with fully supervised methods without the need for task-specific training. Besides the encouraging performance on classification accuracy, it is reported that these models close the robustness gap by matching the performance of supervised models trained on ImageNet under natural distribution shift. Because robustness is critical to real-world applications, especially safety-critical ones, in this paper, we present a comprehensive evaluation based on a large-scale robustness benchmark covering 7 natural, 3 synthetic distribution shifts, and 11 adversarial attacks. We use CLIP as a pilot study. We show that CLIP leads to a significant robustness drop compared to supervised ImageNet models on our benchmark, especially under synthetic distribution shift and adversarial attacks. Furthermore, data overlap analysis suggests that the observed robustness under natural distribution shifts could be attributed, at least in part, to data overlap. In summary, our evaluation shows a comprehensive evaluation of robustness is necessary; and there is a significant need to improve the robustness of zero-shot multimodal models.
通过从原始文本中提取图像预训练图像表示,使得零散 shot 视觉传输下游任务成为可能。通过从互联网收集数百万个样本进行预训练,多模态基础模型(如 CLIP)产生了最先进的零散 shot 结果,通常可以达到与无需任务特定训练的全监督方法相媲美的水平。除了分类准确度令人鼓舞的结果之外,据报道,这些模型通过在自然分布漂移下训练监督模型与 ImageNet 上的监督模型相匹敌,从而缩小了鲁棒性差距。因为鲁棒性对现实世界的应用(尤其是关键应用)至关重要,尤其是在本文中,我们基于覆盖7个自然、3个合成分布漂移和11个对抗攻击的大型鲁棒性基准进行全面评估。我们使用 CLIP 作为试点研究。我们发现,CLIP 在我们的基准上导致监督 ImageNet 模型在合成分布漂移和对抗攻击方面的鲁棒性显著下降。此外,数据重叠分析表明,观察到的鲁棒性在自然分布漂移上可能是由数据重叠造成的。总之,我们的评估表明,对鲁棒性的全面评估是必要的;提高零散 shot 多模态模型的鲁棒性具有重要的意义。
https://arxiv.org/abs/2403.10499
Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
Mitigating hallucinations of Large Multi-modal Models(LMMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LMMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues generated by our novel Adversarial Question Generator, which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LMMs. On our benchmark, the zero-shot performance of state-of-the-art LMMs dropped significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning that robustly fine-tunes LMMs on augmented multi-modal instruction-following datasets with hallucinatory dialogues. Extensive experiments show that our proposed approach successfully reduces dialogue hallucination while maintaining or even improving performance.
缓解大型多模态模型(LMMs)的幻觉对于增强其对于通用助手设备的可靠性至关重要。本文表明,LMMs的前用户-系统对话可能会显著加剧这种幻觉。为了准确测量这一点,我们首先通过扩展流行的多模态基准数据集,使用我们新颖的对抗性问题生成器生成的附带幻觉对话,该生成器可以通过对LMMs的对抗攻击来生成与图像相关的 adversarial 对话。在我们的基准上,最先进的 LMM 的零散性能对于 both VQA 和 Captioning 任务都下降了显著。接下来,我们进一步表明,这种幻觉主要是由先前的对话预测偏差导致的,而不是视觉内容。为了减少这种偏见,我们提出了对抗指令调整,它在使用增强多模态指令跟随数据集上对 LMMs 进行鲁棒微调的同时,通过附带幻觉对话进行调整。大量实验证明,我们提出的方法在保持或甚至提高性能的同时,成功地减少了对话幻觉。
https://arxiv.org/abs/2403.10492
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images. In this paper, we propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye is better at temporal video grounding and comparable on other video-text tasks with existing video-text LLMs, which verifies its superior video-text multi-modal understanding abilities.
视频文本大型语言模型(video-text LLMs)在回答问题和进行对话的简单视频中表现出了出色的性能。然而,它们在长且复杂的视频中表现几乎与随机相同,对文本查询的 grounding 几乎没有任何理解能力和推理能力,这是视频和图像之间最基本的区别。在本文中,我们提出了HawkEye,是第一个在文本到文本方式下实现时间视频 grounding 的 video-text LLM。为了收集适用于时间视频 grounding 的训练数据,我们构建了InternVid-G,一个大规模视频文本语料库,带有分段级的字幕和负跨度,其中我们引入了两个新的时间感知训练目标给视频-文本 LLM。我们还提出了一个表示视频段落的粗粒度方法,这是更稳健且更容易让 LLM 学习和跟随的其他替代方法的。大量实验证明,HawkEye 在时间视频 grounding 方面表现更好,与现有视频文本 LLM 相当,验证了其在视频-文本多模态理解能力上的卓越表现。
https://arxiv.org/abs/2403.10228
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
音频文本检索(ATR)是一种从音频片段(A2T)中检索相关标题,反之从文本中检索相关音频片段(T2A)的方法,最近吸引了大量研究关注。现有的方法通常将来自每个模态的信息聚合为单个向量进行匹配,但这会牺牲局部细节,并且很难捕捉模态之间的精细关系。此外,当前的ATR数据集缺乏全面的对齐信息,简单的二元对比学习标签忽视了样本之间的细微语义差异的测量。为了应对这些挑战,我们提出了一个新颖的ATR框架,全面捕捉不同角度和更细粒度的多模态信息之间的匹配关系。具体来说,我们引入了一种细粒度对齐方法,通过多尺度过程从局部到全局层次结构,捕捉细致的跨模态关系。此外,我们还开创性地应用了跨模态相似性一致性,利用内部模态相似关系作为软监督,以提高更复杂对齐的准确性。大量实验验证了我们的方法的有效性,在AudioCaps数据集上比前方法至少提高了3.9%(T2A)/6.9%(A2T)的R@1,在Clotho数据集上比前方法至少提高了2.9%(T2A)/5.4%(A2T)的R@1。
https://arxiv.org/abs/2403.10146
Understanding videos is one of the fundamental directions in computer vision research, with extensive efforts dedicated to exploring various architectures such as RNN, 3D CNN, and Transformers. The newly proposed architecture of state space model, e.g., Mamba, shows promising traits to extend its success in long sequence modeling to video modeling. To assess whether Mamba can be a viable alternative to Transformers in the video understanding domain, in this work, we conduct a comprehensive set of studies, probing different roles Mamba can play in modeling videos, while investigating diverse tasks where Mamba could exhibit superiority. We categorize Mamba into four roles for modeling videos, deriving a Video Mamba Suite composed of 14 models/modules, and evaluating them on 12 video understanding tasks. Our extensive experiments reveal the strong potential of Mamba on both video-only and video-language tasks while showing promising efficiency-performance trade-offs. We hope this work could provide valuable data points and insights for future research on video understanding. Code is public: this https URL.
理解视频是计算机视觉研究的一个基本方向,致力于探索各种架构,如RNN、3D CNN和Transformer。新提出的状态空间模型(如Mamba)表现出在长序列建模领域延续其成功的前景,同时研究了Mamba在视频理解领域可能表现出优势的各种任务。在本文中,我们进行了一系列全面的研究,探讨了Mamba在建模视频中的不同作用,同时研究了Mamba在各种任务中可能表现出优势的多样性。我们将Mamba分为四种建模视频的角色,构建了一个由14个模型/模块组成的视频Mamba套件,并在12个视频理解任务上对其进行评估。我们广泛的实验揭示了Mamba在视频-仅和视频-语言任务方面的强大潜力,同时展示了其在效率和性能上的有益权衡。我们希望这项工作可以为未来关于视频理解的科学研究提供有价值的数据点和见解。代码是公开的:此链接。
https://arxiv.org/abs/2403.09626
In this work, we discuss building performant Multimodal Large Language Models (MLLMs). In particular, we study the importance of various architecture components and data choices. Through careful and comprehensive ablations of the image encoder, the vision language connector, and various pre-training data choices, we identified several crucial design lessons. For example, we demonstrate that for large-scale multimodal pre-training using a careful mix of image-caption, interleaved image-text, and text-only data is crucial for achieving state-of-the-art (SOTA) few-shot results across multiple benchmarks, compared to other published pre-training results. Further, we show that the image encoder together with image resolution and the image token count has substantial impact, while the vision-language connector design is of comparatively negligible importance. By scaling up the presented recipe, we build MM1, a family of multimodal models up to 30B parameters, consisting of both dense models and mixture-of-experts (MoE) variants, that are SOTA in pre-training metrics and achieve competitive performance after supervised fine-tuning on a range of established multimodal benchmarks. Thanks to large-scale pre-training, MM1 enjoys appealing properties such as enhanced in-context learning, and multi-image reasoning, enabling few-shot chain-of-thought prompting.
在这项工作中,我们讨论了构建高性能的多模态大型语言模型(MLLMs)。特别是,我们研究了各种架构组件和数据选择的重要性。通过仔细和全面的图像编码器、视觉语言连接器和各种预训练数据选择,我们识别出几个关键的设计经验。例如,我们证明了,在大型多模态预训练中,使用仔细混合图像捕捉、平滑图像文本和文本only数据对于在多个基准上实现最先进的(SOTA)几 shot结果至关重要,与其他已发表的预训练结果相比。此外,我们还证明了图像编码器与图像分辨率相结合,对图像标记计数有相当大的影响,而视觉语言连接器的设计则相对较小。通过扩展所提出的食谱,我们构建了MM1,一种具有30B参数的 multimodal 模型家族,包括密集模型和专家混合(MoE)变体,在预训练指标上实现了最先进的性能,并在各种已建立的多模态基准上实现了具有竞争力的性能。由于大规模预训练,MM1 具有诸如增强的上下文学习 和多图像推理 这样的有吸引力的特性,实现了几 shot链式思维提示。
https://arxiv.org/abs/2403.09611