This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
本文讨论了第三版Monocular Depth Estimation Challenge(MDEC)的结果。该挑战关注于将零散样本到具有挑战性的SYNCSH-Patches数据集,场景位于自然和室内环境中。与前几版一样,方法可以使用任何形式的监督,即监督或自监督。挑战在测试集上总共获得了19个提交,超过了基线:其中10个提交了一份报告,描述了他们的方法,并突出了基础模型如Depth Anything在方法核心中发现的扩散使用情况。挑战获胜者大幅提高了3D F- Score性能,从17.51%到23.72%。
https://arxiv.org/abs/2404.16831
Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
近年来,大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上,一些研究,如CoOp和CoCoOp,提出了使用提示学习的方法,其中上下文在提示中替换为可学习向量,从而在手动设计的提示上取得了显著的改进。然而,对于未见过的类别的性能提升仍然很小,为了解决这个问题,传统零散学习技术中经常使用数据增强。通过我们的实验,我们发现了CoOp和CoCoOp中重要的问题:通过传统图像增强学习到的上下文存在偏见,不利于对未见过的类别的泛化。为了解决这个问题,我们提出了一个对抗性标记嵌入策略,当在提示中诱导偏见时,将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”,AAPL,我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验,总体而言,AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。
https://arxiv.org/abs/2404.16804
Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.
视觉语言模型使得无需重新训练即可对开放世界中的物体进行分类。虽然这种零样本范式取得了重大进展,但即使是最先进的模型在物体不与其典型描述相当时也会表现出偏斜的性能。现实世界中的苹果呈现出各种形式——从切成薄片到整个,放在桌子上或碗里——然而,标准视觉语言模型将类别的实例映射到基于类别的单个向量上。我们认为,为了在类中表示这种丰富的多样性,零样本分类应超越单一向量。我们提出了一种方法,通过推断属性来编码和解释类中的多样性,在零样本设置中不需要重新训练。我们发现在一系列包括层次结构、多样物体状态和真实世界地理多样性的大数据集上,我们的方法 consistently优于标准零样本分类。此外,我们的方法具有内在可解释性,为每个推理提供准确的解释,从而促进模型调试和提高透明度。我们还发现,我们的方法能够有效地扩展到大量的属性,以考虑多样性,从而使典型实例的预测更准确。最后,我们描述了总体和最差类准确度之间的原则性权衡,该权衡可以通过我们方法的超参数进行调整。我们希望这项工作能够推动关于零样本分类在捕捉世界多样性方面的前景以及在不牺牲性能的情况下构建透明AI系统的研究。
https://arxiv.org/abs/2404.16717
Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work, we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially, we identify key visual clues critical to visual emotion recognition. Subsequently, we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data, effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP, our proposed EmoVIT architecture incorporates emotion-specific instruction data, leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments, our model showcases its proficiency in emotion classification, adeptness in affective reasoning, and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs, providing valuable insights and opening avenues for future exploration in this domain. Our code is available at \url{this https URL}.
视觉指令微调是一种新的学习范式,涉及使用任务特定指令对预训练语言模型进行微调。在这个范式中,我们专注于提高模型在理解并遵循与情感上下文相关的指令方面的能力。首先,我们识别出对视觉情感识别至关重要的关键视觉线索。接着,我们引入了一种新颖的GPT辅助生成情感视觉指令数据的长式依赖关系网络,有效解决了该领域中标注指令数据不足的问题。通过在InstructionBLIP工作的基础上拓展工作,我们提出的EmoVIT架构利用大型语言模型的强大能力来增强性能。通过广泛的实验,我们的模型在情感分类、情感推理和理解幽默方面展现了卓越的表现。比较分析为LLM时代的情感视觉指令微调提供了一个稳健的基准,为这个领域提供了宝贵的见解,并开拓了未来的研究方向。我们的代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.16670
Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.
多模态基础模型,如CLIP,已经展示了令人印象深刻的零样本能力。然而,由于它们具有大量参数和高推理时间,这些模型在资源受限的环境中的应用有限。虽然现有的方法已经将整个CLIP架构缩小,但我们关注于训练更小的图像编码器变体,这对于高效的零样本分类是足够的。使用合成数据已经表明,从更大的教师表示中提取表示具有潜力,导致强大的零样本和线性探测性能。然而,我们发现,在真正的零样本设置中,这种方法在对比损失方面表现令人失望。我们发现,这种方法在合成和真实数据之间的泛化差上存在问题。然而,通过使用基于图像特征的L2蒸馏损失,我们缓解了这些问题,并培训学生实现零样本性能,这在与DataCompXL数据集上训练的ViT-B/32教师模型相当的四域特定数据集上。
https://arxiv.org/abs/2404.16637
Unsupervised cross-lingual transfer involves transferring knowledge between languages without explicit supervision. Although numerous studies have been conducted to improve performance in such tasks by focusing on cross-lingual knowledge, particularly lexical and syntactic knowledge, current approaches are limited as they only incorporate syntactic or lexical information. Since each type of information offers unique advantages and no previous attempts have combined both, we attempt to explore the potential of this approach. In this paper, we present a novel framework called "Lexicon-Syntax Enhanced Multilingual BERT" that combines both lexical and syntactic knowledge. Specifically, we use Multilingual BERT (mBERT) as the base model and employ two techniques to enhance its learning capabilities. The code-switching technique is used to implicitly teach the model lexical alignment information, while a syntactic-based graph attention network is designed to help the model encode syntactic structure. To integrate both types of knowledge, we input code-switched sequences into both the syntactic module and the mBERT base model simultaneously. Our extensive experimental results demonstrate this framework can consistently outperform all baselines of zero-shot cross-lingual transfer, with the gains of 1.0~3.7 points on text classification, named entity recognition (ner), and semantic parsing tasks. Keywords:cross-lingual transfer, lexicon, syntax, code-switching, graph attention network
无监督跨语言转移涉及在没有任何明确监督的情况下在语言之间传递知识。尽管已经进行了大量研究,以通过关注跨语言知识来提高此类任务的性能,特别是词汇和句法知识,但目前的方法仍然有限,因为它们仅包括语义或词汇信息。由于每种信息都具有独特的优势,并且没有 previous 尝试将两种信息相结合,因此我们试图探索这种方法的潜力。在本文中,我们提出了一个名为 "Lexicon-Syntax Enhanced Multilingual BERT" 的新框架,结合了词汇和句法知识。具体来说,我们使用多语言 BERT(mBERT)作为基础模型,并采用两种技术来增强其学习能力。代码转换技术用于含蓄地教导模型词汇对齐信息,而基于句法的图注意力网络旨在帮助模型编码语义结构。为了整合两种知识,我们将代码转换序列同时输入到语义模块和 mBERT 基础模型中。我们进行了广泛的实验研究,结果表明,与其他零散的跨语言转移 baseline 相比,该框架可以始终如一地优于所有基线,在文本分类、命名实体识别(NER)和语义解析任务中的得分增加了 1.0~3.7 点。关键词:跨语言转移,词汇,语法,代码转换,图注意力网络
https://arxiv.org/abs/2404.16627
Low-shot counters estimate the number of objects corresponding to a selected category, based on only few or no exemplars annotated in the image. The current state-of-the-art estimates the total counts as the sum over the object location density map, but does not provide individual object locations and sizes, which are crucial for many applications. This is addressed by detection-based counters, which, however fall behind in the total count accuracy. Furthermore, both approaches tend to overestimate the counts in the presence of other object classes due to many false positives. We propose DAVE, a low-shot counter based on a detect-and-verify paradigm, that avoids the aforementioned issues by first generating a high-recall detection set and then verifying the detections to identify and remove the outliers. This jointly increases the recall and precision, leading to accurate counts. DAVE outperforms the top density-based counters by ~20% in the total count MAE, it outperforms the most recent detection-based counter by ~20% in detection quality and sets a new state-of-the-art in zero-shot as well as text-prompt-based counting.
低 shot 计数器根据图像中仅标注了几个或没有示例的类别的对象数量来估计选定类别的物体数量。 目前的最佳估计将总数计数视为物体位置密度图的求和,但并未提供物体位置和大小,这对于许多应用程序至关重要。这是通过基于检测的计数器来解决的,尽管在总数准确性上落后于最先进的计数器。此外,两种方法都容易在存在其他类别的物体时高估计数,由于许多误检。我们提出了 DAVE,一种基于检测和验证范式的低 shot 计数器,通过首先生成一个高召回度的检测集,然后验证检测结果来识别和删除异常值,从而避免了上述问题。这种方法共同提高了召回率和精度,导致准确的计数。与最先进的密度基于计数器相比,DAVE 在总数计数 MAE 上约 20% 更优,而在检测质量和零 shot 计数方面与最先进的检测基于计数器相当。此外,DAVE 在零 shot 和文本提示基于计数方面达到了最先进水平。
https://arxiv.org/abs/2404.16622
Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
近年来,在Vision和语言模型(VLMs)方面的进步已经提高了开放世界3D表示,推动了在未见类别的3D零击能力。现有的开放世界方法在预训练3D编码器时添加了一个额外的3D编码器,使其将来自3D数据(如深度图或点云)的特征与CAD渲染图像和相关文本对齐。然而,CAD图像中有限的颜色和纹理变化可能会削弱对齐稳健性。此外,预训练3D编码器数据集和VLM数据集之间的体积差异导致了2D到3D知识传递的低效。为了克服这些问题,我们提出了OpenDlign,一种学习开放世界3D表示的新框架,它利用点云投影得到的深度图生成的深度对齐图像。与CAD渲染图像不同,我们的生成图像在保持几何和语义一致性的同时,提供了丰富、逼真的颜色和纹理多样性。此外,OpenDlign还优化了深度图投影并集成了深度特定文本提示,提高了2D VLM对3D学习的知识迁移效率。实验结果表明,OpenDlign在零击和少击3D任务上显著优于现有基准,在仅600万调整参数的情况下,超过了ModelNet40和OmniObject3D的分数。此外,将生成的深度对齐图像集成到现有的3D学习流程中,显著提高了它们的性能。
https://arxiv.org/abs/2404.16538
Instruction tuning has shown its ability to not only enhance zero-shot generalization across various tasks but also its effectiveness in improving the performance of specific tasks. A crucial aspect in instruction tuning for a particular task is a strategic selection of related tasks that offer meaningful supervision, thereby enhancing efficiency and preventing performance degradation from irrelevant tasks. Our research reveals that leveraging instruction information \textit{alone} enables the identification of pertinent tasks for instruction tuning. This approach is notably simpler compared to traditional methods that necessitate complex measurements of pairwise transferability between tasks or the creation of data samples for the target task. Furthermore, by additionally learning the unique instructional template style of the meta-dataset, we observe an improvement in task selection accuracy, which contributes to enhanced overall performance. Experimental results demonstrate that training on a small set of tasks, chosen solely based on the instructions, leads to substantial performance improvements on benchmarks like P3, Big-Bench, NIV2, and Big-Bench Hard. Significantly, these improvements exceed those achieved by prior task selection methods, highlighting the efficacy of our approach.
指令调整已经证明了其不仅能够提高各种任务的零 shot通用性,而且在特定任务上的效果也能得到提高。为特定任务进行指令调整的关键方面是选择相关任务进行有意义监督,从而提高效率并防止性能下降。我们的研究揭示,仅利用指令信息就能够识别出与指令调整相关的任务。与传统方法相比,这种方法显得尤为简单,不需要对任务之间 transferability 的复杂测量或为目标任务创建数据样本。此外,通过进一步学习元数据集的独有指令模板风格,我们观察到任务选择准确性的提高,从而提高了整体性能。实验结果表明,仅基于指令对少量任务进行训练,选择任务的标准仅仅基于指令,能够在像P3、Big-Bench、NIV2和Big-Bench Hard等基准上带来显著的性能提升。值得注意的是,这些提升超过了先前的任务选择方法所能实现的效果,充分证明了我们的方法的有效性。
https://arxiv.org/abs/2404.16418
Many image retrieval studies use metric learning to train an image encoder. However, metric learning cannot handle differences in users' preferences, and requires data to train an image encoder. To overcome these limitations, we revisit relevance feedback, a classic technique for interactive retrieval systems, and propose an interactive CLIP-based image retrieval system with relevance feedback. Our retrieval system first executes the retrieval, collects each user's unique preferences through binary feedback, and returns images the user prefers. Even when users have various preferences, our retrieval system learns each user's preference through the feedback and adapts to the preference. Moreover, our retrieval system leverages CLIP's zero-shot transferability and achieves high accuracy without training. We empirically show that our retrieval system competes well with state-of-the-art metric learning in category-based image retrieval, despite not training image encoders specifically for each dataset. Furthermore, we set up two additional experimental settings where users have various preferences: one-label-based image retrieval and conditioned image retrieval. In both cases, our retrieval system effectively adapts to each user's preferences, resulting in improved accuracy compared to image retrieval without feedback. Overall, our work highlights the potential benefits of integrating CLIP with classic relevance feedback techniques to enhance image retrieval.
许多图像检索研究使用元学习来训练图像编码器。然而,元学习无法处理用户偏好的差异,并需要数据来训练图像编码器。为了克服这些限制,我们重新审视了相关性反馈,这是一个经典的交互式检索系统技术,并提出了一种基于相关性反馈的交互式CLIP图像检索系统。我们的检索系统首先执行检索,通过二进制反馈收集每个用户的独特偏好,然后返回用户喜欢的图像。即使用户有各种偏好,我们的检索系统也会通过反馈学习每个用户的偏好,并适应偏好。此外,我们的检索系统利用CLIP的零 shot传输能力,无需训练就能实现高准确度。我们通过实验实证证明,我们的检索系统在基于类别的图像检索中与最先进的元学习技术竞争相当,即使没有为每个数据集专门训练图像编码器。此外,我们还设置了两个额外的实验设置,用户具有各种偏好:一标签图像检索和条件图像检索。在这两种情况下,我们的检索系统都能有效适应每个用户的偏好,从而改善了与没有反馈时的图像检索的准确性。总之,我们的工作突出了将CLIP与经典相关性反馈技术相结合可以提高图像检索的潜力的可能性。
https://arxiv.org/abs/2404.16398
Zero-shot learning has consistently yielded remarkable progress via modeling nuanced one-to-one visual-attribute correlation. Existing studies resort to refining a uniform mapping function to align and correlate the sample regions and subattributes, ignoring two crucial issues: 1) the inherent asymmetry of attributes; and 2) the unutilized channel information. This paper addresses these issues by introducing a simple yet effective approach, dubbed Dual Expert Distillation Network (DEDN), where two experts are dedicated to coarse- and fine-grained visual-attribute modeling, respectively. Concretely, one coarse expert, namely cExp, has a complete perceptual scope to coordinate visual-attribute similarity metrics across dimensions, and moreover, another fine expert, namely fExp, consists of multiple specialized subnetworks, each corresponds to an exclusive set of attributes. Two experts cooperatively distill from each other to reach a mutual agreement during training. Meanwhile, we further equip DEDN with a newly designed backbone network, i.e., Dual Attention Network (DAN), which incorporates both region and channel attention information to fully exploit and leverage visual semantic knowledge. Experiments on various benchmark datasets indicate a new state-of-the-art.
零样本学习通过建模复杂的一对一视觉属性相关性 consistently取得了显著的进步。现有研究通过优化统一映射函数来对样本区域和子属性进行对齐和相关,忽略了两个关键问题:(1)属性的固有不对称性;(2)未利用的通道信息。本文通过引入一种简单而有效的途径来解决这些问题,称为双专家蒸馏网络(DEDN),其中两个专家分别致力于粗粒度和细粒度视觉属性建模。具体来说,一个粗专家,即cExp,具有完整的感知范围,以协调维度内的视觉属性相似度度量,另一个细专家,即fExp,由多个专用子网络组成,每个子网络对应一个独特的属性集合。两个专家在训练过程中合作蒸馏,以达到相互一致。同时,我们通过设计一个新的骨干网络,即双注意网络(DAN),为DEDN添加了新功能,该网络包含区域和通道关注信息,以充分利用和利用视觉语义知识。在各种基准数据集上的实验表明,达到了最先进水平。
https://arxiv.org/abs/2404.16348
Despite the remarkable success of deep learning in medical imaging analysis, medical image segmentation remains challenging due to the scarcity of high-quality labeled images for supervision. Further, the significant domain gap between natural and medical images in general and ultrasound images in particular hinders fine-tuning models trained on natural images to the task at hand. In this work, we address the performance degradation of segmentation models in low-data regimes and propose a prompt-less segmentation method harnessing the ability of segmentation foundation models to segment abstract shapes. We do that via our novel prompt point generation algorithm which uses coarse semantic segmentation masks as input and a zero-shot prompt-able foundation model as an optimization target. We demonstrate our method on a segmentation findings task (pathologic anomalies) in ultrasound images. Our method's advantages are brought to light in varying degrees of low-data regime experiments on a small-scale musculoskeletal ultrasound images dataset, yielding a larger performance gain as the training set size decreases.
尽管在医学影像分析中深度学习的成功已经让人印象深刻,但由于高质量 labeled 图像的稀缺性,医学图像分割仍然具有挑战性。此外,自然图像和医学图像以及超声图像之间显著的领域差距会阻碍将基于自然图像训练的模型用于当前任务的微调。在这项工作中,我们解决了在低数据 regime 下分割模型的性能下降问题,并提出了一个无需提示的分割方法,利用分割基础模型的能力对抽象形状进行分割。我们通过使用粗粒度语义分割掩码作为输入和零散提示可优化目标来实现这一目标。我们在超声图像数据集上展示了我们的方法。在小的多关节超声图像数据集上进行低数据 regime 实验,各种低数据 regime 实验都表明,随着训练集大小的减小,性能提高。
https://arxiv.org/abs/2404.16325
Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g., a woman's photo) and a text description (e.g., "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper, we propose TI2V-Zero, a zero-shot, tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image, enabling TI2V generation without any optimization, fine-tuning, or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input, we propose a "repeat-and-slide" strategy that modulates the reverse denoising process, allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity, we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets, where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore, we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.
文本有条件图像转视频生成(TI2V)旨在从给定的图像(例如,一张女人的照片)和文本描述(例如,“一个女人在喝水”)合成一个真实的视频。现有的TI2V框架通常需要在视频文本数据集上进行昂贵的训练,并针对文本和图像条件设计特定的模型。在本文中,我们提出TI2V-Zero,一种零散拍摄、无需优化、无需微调或引入外部模块的方法,它使预训练的文本到视频(T2V)扩散模型能够根据提供的图像进行条件生成,从而实现无需任何优化、微调或引入外部模块的TI2V生成。我们的方法利用预训练的T2V扩散基础模型作为生成先验。为了在使用附加图像进行视频生成时指导视频生成,我们提出了“重复并滑动”策略,它通过调节反滤波过程来控制预冻扩散模型,使其从提供的图像合成逐帧视频。为了确保时间连续性,我们采用DDPM反向策略对每个新合成帧进行初始化,并使用插值技术帮助保留视觉细节。我们对领域特定数据集和开放数据集进行了全面的实验,其中TI2V-Zero在领域特定模型中始终表现出优异的性能。此外,我们还证明了TI2V-Zero可以在提供更多图像时无缝扩展到其他任务,如视频填充和预测。其自回归设计还支持长视频生成。
https://arxiv.org/abs/2404.16306
Electronic health records (EHR) even though a boon for healthcare practitioners, are growing convoluted and longer every day. Sifting around these lengthy EHRs is taxing and becomes a cumbersome part of physician-patient interaction. Several approaches have been proposed to help alleviate this prevalent issue either via summarization or sectioning, however, only a few approaches have truly been helpful in the past. With the rise of automated methods, machine learning (ML) has shown promise in solving the task of identifying relevant sections in EHR. However, most ML methods rely on labeled data which is difficult to get in healthcare. Large language models (LLMs) on the other hand, have performed impressive feats in natural language processing (NLP), that too in a zero-shot manner, i.e. without any labeled data. To that end, we propose using LLMs to identify relevant section headers. We find that GPT-4 can effectively solve the task on both zero and few-shot settings as well as segment dramatically better than state-of-the-art methods. Additionally, we also annotate a much harder real world dataset and find that GPT-4 struggles to perform well, alluding to further research and harder benchmarks.
电子病历(EHR)虽然对医疗保健专业人员来说是一个福音,但它们每天都变得越来越复杂和冗长。在搜寻这些漫长的EHR时,这会使人疲惫不堪,成为医生和患者互动过程中的一个繁琐的部分。为了解决这个问题,提出了几种方法,包括摘要和分段,但只有少数方法真正有效。随着自动方法的兴起,机器学习(ML)在解决在EHR中识别相关节点的任务方面显示出前景。然而,大多数ML方法依赖于有标签的数据,而在医疗保健领域获得这些数据非常困难。大语言模型(LLMs)等其他方法在自然语言处理(NLP)方面也表现出色,而且在大规模数据集上表现出色,完全不需要任何有标签的数据。因此,我们提出使用LLMs来识别相关节点的建议。我们发现,GPT-4在零和少样本设置下都能够有效解决这个任务,而且性能比现有方法还要好。此外,我们还用更为困难的现实世界数据集进行了标注,发现GPT-4表现不佳,这表明需要进一步的研究和更为严苛的基准测试。
https://arxiv.org/abs/2404.16294
Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work, we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M, we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks.
近年来,基于内容的图像-文本预训练模型 deduplication 技术已经证明了,在训练过程中删除常见于网络的图像-文本数据集可以显著降低训练 Visual-Language Pre-trained (VLP) 模型的成本,同时不显著影响性能。这些结果基于从互联网上收集的常见图像-文本数据集进行裁剪——这些数据集已知可能包含有害的社交偏见,而这些偏见可能会在训练模型时进行编码。在这项工作中,我们评估了 deduplication 对训练模型的影响以及最近采用 SemDeDup 算法进行修改以减少我们观察到的负面效应的容易实现方法。当研究基于 deduplicated 变体的 CLIP 模型在 FairFace 和 FACET 数据集上进行训练时,我们发现,我们提出的公平 DeDup 算法在 FairDeDup 数据集上的性能始终优于 SemDeDup,同时保持在 CLIP 基准测试上的零散性能。
https://arxiv.org/abs/2404.16123
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.
近年来,在生成式人工智能(Generative AI)方面的进步导致了生成视觉上逼真的合成视频的技术的发展。虽然已经开发了许多方法来检测由AI生成的合成图像,但在本文中,我们证明了合成图像检测器无法检测合成视频。我们证明了这是因为合成视频生成器引入了与图像生成器留下的痕迹显著不同的迹线。尽管如此,我们证明了合成视频痕迹可以学习,并用于可靠的合成视频检测或生成器来源 attribution,即使在H.264重新压缩之后。此外,我们证明了从零散转移学习中检测新生成器生成的视频是具有挑战性的,但通过几散学习可以准确地从新生成器中检测到视频。
https://arxiv.org/abs/2404.15955
A model's capacity to generalize its knowledge to interpret unseen inputs with different characteristics is crucial to build robust and reliable machine learning systems. Language model evaluation tasks lack information metrics about model generalization and their applicability in a new setting is measured using task and language-specific downstream performance, which is often lacking in many languages and tasks. In this paper, we explore a set of efficient and reliable measures that could aid in computing more information related to the generalization capability of language models in cross-lingual zero-shot settings. In addition to traditional measures such as variance in parameters after training and distance from initialization, we also measure the effectiveness of sharpness in loss landscape in capturing the success in cross-lingual transfer and propose a novel and stable algorithm to reliably compute the sharpness of a model optimum that correlates to generalization.
模型将知识泛化到解释具有不同特性的未见输入的能力对构建稳健可靠的机器学习系统至关重要。语言模型评估任务缺乏有关模型泛化及其在新环境中的适用性的信息指标。在本文中,我们探讨了一系列可能有助于计算跨语言零样本设置中语言模型泛化能力更详细信息的有效且可靠的度量。除了训练后参数的方差和初始化点的距离等传统度量外,我们还测量了损失函数在捕捉跨语言传输成功方面的有效性,并提出了一种可靠计算模型最优尖度的新算法。
https://arxiv.org/abs/2404.15928
Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.
个人反馈有助于提高学生的论文写作技能。然而,提供这样的反馈需要消耗大量的努力,因此在实践中很难实现个性化。自动生成的论文反馈可以作为指导学生自行 pace、convenience 和 desired frequency 的替代方案。大型语言模型(LLMs)已经在生成连贯且上下文相关的文本方面表现出强大的性能。然而,它们提供有帮助的论文反馈的能力仍然不清楚。本研究探讨了基于LLM的零 shot 和零 shot 生成论文反馈的几种提示策略。受到 Chain-of-Thought 提示的启发,我们研究了自动评分(AES)在生成反馈质量方面的优势和程度。我们评估了LLM仅通过提示所能达到的AES性能以及生成的论文反馈的有用性。我们的结果表明,联合处理AES和反馈生成可以提高AES性能。然而,尽管我们的手动评估强调了生成的论文反馈的质量,但论文评分对生成的反馈的影响仍然较低。
https://arxiv.org/abs/2404.15845
Benefiting from strong generalization ability, pre-trained vision language models (VLMs), e.g., CLIP, have been widely utilized in zero-shot scene understanding. Unlike simple recognition tasks, grounded situation recognition (GSR) requires the model not only to classify salient activity (verb) in the image, but also to detect all semantic roles that participate in the action. This complex task usually involves three steps: verb recognition, semantic role grounding, and noun recognition. Directly employing class-based prompts with VLMs and grounding models for this task suffers from several limitations, e.g., it struggles to distinguish ambiguous verb concepts, accurately localize roles with fixed verb-centric template1 input, and achieve context-aware noun predictions. In this paper, we argue that these limitations stem from the mode's poor understanding of verb/noun classes. To this end, we introduce a new approach for zero-shot GSR via Language EXplainer (LEX), which significantly boosts the model's comprehensive capabilities through three explainers: 1) verb explainer, which generates general verb-centric descriptions to enhance the discriminability of different verb classes; 2) grounding explainer, which rephrases verb-centric templates for clearer understanding, thereby enhancing precise semantic role localization; and 3) noun explainer, which creates scene-specific noun descriptions to ensure context-aware noun recognition. By equipping each step of the GSR process with an auxiliary explainer, LEX facilitates complex scene understanding in real-world scenarios. Our extensive validations on the SWiG dataset demonstrate LEX's effectiveness and interoperability in zero-shot GSR.
得益于强大的泛化能力,预训练的视觉语言模型(VLMs)(例如CLIP)已经在零散场景理解中得到了广泛应用。与简单的识别任务不同, grounded situation recognition (GSR) 需要模型不仅对图像中的显着活动(动词)进行分类,而且还要检测参与行动的所有语义角色。这种复杂任务通常包括三个步骤:动词识别、语义角色定位和名词识别。直接使用基于类的提示的VLMs和定位模型进行这项任务存在多个局限性,例如它难以区分模糊的动词概念,无法准确地将动词中心模板作为输入来定位固定动词概念,并且无法实现语境感知的名词预测。在本文中,我们认为这些局限源于模式对动词/名词类的理解不足。为此,我们引入了一种新的通过语言解释器(LEX)进行零散GSR的方法,该方法通过三个解释器显著增强了模型的全面能力:1) 动词解释器,它生成通用的动词中心描述,以增强不同动词类别的可区分性;2) 定位解释器,它重新表述了动词中心模板,以更清晰地理解,从而增强精确的语义角色定位;3) 名词解释器,它创建了与场景相关的名词描述,以确保语境感知的名词识别。通过为GSR过程的每个步骤提供辅助解释器,LEX在现实世界场景中促进了复杂场景理解。我们在SWiG数据集上的广泛验证证明了LEX在零散GSR方面的有效性和互操作性。
https://arxiv.org/abs/2404.15785