Zero-Shot

Knowledge-enhanced Visual-Language Pretraining for Computational Pathology

2024-04-15 17:11:25

Xiao Zhou, Xiaoman Zhang, Chaoyi Wu, Ya Zhang, Weidi Xie, Yanfeng Wang

arXiv_CV

arXiv_CV Classification Embedding Represenation_Learning Knowledge Language_Model Pose Zero-Shot
Abstract

In this paper, we consider the problem of visual representation learning for computational pathology, by exploiting large-scale image-text pairs gathered from public resources, along with the domain specific knowledge in pathology. Specifically, we make the following contributions: (i) We curate a pathology knowledge tree that consists of 50,470 informative attributes for 4,718 diseases requiring pathology diagnosis from 32 human tissues. To our knowledge, this is the first comprehensive structured pathology knowledge base; (ii) We develop a knowledge-enhanced visual-language pretraining approach, where we first project pathology-specific knowledge into latent embedding space via language model, and use it to guide the visual representation learning; (iii) We conduct thorough experiments to validate the effectiveness of our proposed components, demonstrating significant performance improvement on various downstream tasks, including cross-modal retrieval, zero-shot classification on pathology patches, and zero-shot tumor subtyping on whole slide images (WSIs). All codes, models and the pathology knowledge tree will be released to the research community

Abstract (translated)

在本文中，我们考虑了计算机病理学中视觉表示学习的问题，通过利用来自公共资源的大型图像-文本对，结合病理学领域的专业知识。具体来说，我们做出了以下贡献：（一）我们创建了一个包含32个人体组织病理学诊断的4718种疾病的信息丰富的病理学知识树，这是目前为止最全面的结构化病理学知识库；（二）我们开发了一种知识增强的视觉语言预训练方法，其中我们首先通过语言模型将病理学特定知识投影到潜在表示空间，并用于指导视觉表示学习；（三）我们进行了详细实验来验证我们提议的组件的有效性，证明了其在各种下游任务上的显著性能提升，包括跨模态检索、病理补丁上的零散分类和整个幻灯片图像（WSIs）上的肿瘤亚型检测。所有代码、模型和病理学知识树都将公开发布给研究社区。

URL

https://arxiv.org/abs/2404.09942

PDF

https://arxiv.org/pdf/2404.09942.pdf
Read All
Evolving Interpretable Visual Classifiers with Large Language Models

2024-04-15 17:09:53

Mia Chiquier, Utkarsh Mall, Carl Vondrick

arXiv_AI

arXiv_AI Recognition Classification Language_Model Transformer Zero-Shot
Abstract

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

Abstract (translated)

多模态预训练模型，如CLIP，因其开放词汇灵活性和高性能而备受欢迎，用于零散射击分类。然而，视觉语言模型，它们在图像和类标签之间计算相似度分数，是黑盒的，具有有限的可解释性、风险偏见和无法发现未记录的新视觉概念。此外，在实际环境中，类别的名称和属性的词汇表将是未知的，这将阻止这些方法在大型视觉语言数据集上表现良好。为了应对这些局限性，我们提出了一个新方法，用于发现具有可解释性且具有区分性的视觉识别属性集。我们引入了一个进化搜索算法，该算法利用大型语言模型及其上下文学习能力，迭代地突触分类属性瓶颈。我们的方法产生了解决最先进基线的最佳细粒度分类器。我们在五个细粒度iNaturalist数据集上比基线领先18.4%，在两个KikiBouba数据集上比基线领先22.2%。尽管基线具有关于类别的特权信息，但我们仍然能够取得这样的结果。

URL

https://arxiv.org/abs/2404.09941

PDF

https://arxiv.org/pdf/2404.09941.pdf
Read All
Zero-shot detection of buildings in mobile LiDAR using Language Vision Model

2024-04-15 16:56:58

June Moh Goo, Zichao Zeng, Jan Boehm

arXiv_AI

arXiv_AI Detection Face Zero-Shot 3D Point_Cloud
Abstract

Recent advances have demonstrated that Language Vision Models (LVMs) surpass the existing State-of-the-Art (SOTA) in two-dimensional (2D) computer vision tasks, motivating attempts to apply LVMs to three-dimensional (3D) data. While LVMs are efficient and effective in addressing various downstream 2D vision tasks without training, they face significant challenges when it comes to point clouds, a representative format for representing 3D data. It is more difficult to extract features from 3D data and there are challenges due to large data sizes and the cost of the collection and labelling, resulting in a notably limited availability of datasets. Moreover, constructing LVMs for point clouds is even more challenging due to the requirements for large amounts of data and training time. To address these issues, our research aims to 1) apply the Grounded SAM through Spherical Projection to transfer 3D to 2D, and 2) experiment with synthetic data to evaluate its effectiveness in bridging the gap between synthetic and real-world data domains. Our approach exhibited high performance with an accuracy of 0.96, an IoU of 0.85, precision of 0.92, recall of 0.91, and an F1 score of 0.92, confirming its potential. However, challenges such as occlusion problems and pixel-level overlaps of multi-label points during spherical image generation remain to be addressed in future studies.

Abstract (translated)

近年来，在二维（2D）计算机视觉任务中，语言视觉模型（LVMs）已经超越了现有技术的水平，这促使人们尝试将LVMs应用于三维（3D）数据。虽然LVMs在解决各种下游2D视觉任务方面非常有效和高效，但在表示3D数据点时，它们面临很大的挑战。从3D数据中提取特征更加困难，而且由于数据量庞大和数据收集和标注成本高，导致数据集的可用性非常有限。此外，为点云构建LVMs还更具挑战性，因为需要大量数据和训练时间。为了应对这些问题，我们的研究旨在1)通过球形投影法应用 grounded SAM 将3D 转移到2D，2) 实验使用合成数据来评估其在将模拟和现实世界数据领域之间弥合差距方面的有效性。我们的方法在精度为0.96，IoU为0.85，精确度为0.92，召回率为0.91，F1分数为0.92的条件下表现出了良好的性能，证实了其潜力。然而，在球形图像生成过程中出现的遮挡问题和多标签点在球形图像生成期间的像素级别重叠等问题仍然需要未来的研究来解决。

URL

https://arxiv.org/abs/2404.09931

PDF

https://arxiv.org/pdf/2404.09931.pdf
Read All
Zero-shot Building Age Classification from Facade Image Using GPT-4

2024-04-15 16:47:22

Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Meihui Wang, Jan Boehm

arXiv_AI

arXiv_AI Deep_Learning Classification Prediction Language_Model Transformer Zero-Shot Chat
Abstract

A building's age of construction is crucial for supporting many geospatial applications. Much current research focuses on estimating building age from facade images using deep learning. However, building an accurate deep learning model requires a considerable amount of labelled training data, and the trained models often have geographical constraints. Recently, large pre-trained vision language models (VLMs) such as GPT-4 Vision, which demonstrate significant generalisation capabilities, have emerged as potential training-free tools for dealing with specific vision tasks, but their applicability and reliability for building information remain unexplored. In this study, a zero-shot building age classifier for facade images is developed using prompts that include logical instructions. Taking London as a test case, we introduce a new dataset, FI-London, comprising facade images and building age epochs. Although the training-free classifier achieved a modest accuracy of 39.69%, the mean absolute error of 0.85 decades indicates that the model can predict building age epochs successfully albeit with a small bias. The ensuing discussion reveals that the classifier struggles to predict the age of very old buildings and is challenged by fine-grained predictions within 2 decades. Overall, the classifier utilising GPT-4 Vision is capable of predicting the rough age epoch of a building from a single facade image without any training.

Abstract (translated)

建筑物的建造年龄对于支持许多地理空间应用至关重要。目前，大量研究关注使用深度学习从外观图像估计建筑物的年龄。然而，要建立一个准确的深度学习模型需要大量的带标签训练数据，训练出的模型通常具有地理约束。最近，一些预训练的大规模视觉语言模型（VLMs）如GPT-4 Vision，表现出显著的泛化能力，成为处理特定视觉任务的潜在无标签工具，但它们用于构建信息的可行性和可靠性仍需探索。在本文中，使用提示包括逻辑指令开发了一个零散的建筑年龄分类器来处理外观图像。以伦敦为例，我们引入了一个新的数据集FI-London，包括外观图像和建筑年龄的epoch。虽然无标签分类器取得了39.69%的 modest 准确率，但0.85世纪的平均绝对误差表明，模型可以成功预测建筑物的年龄，尽管存在一定的偏差。接下来的讨论揭示了，分类器在预测非常古老的建筑物时遇到困难，并且在20年内的细粒度预测方面受到挑战。总体而言，使用GPT-4 Vision的分类器可以从单个外观图像预测建筑物的粗略年龄。

URL

https://arxiv.org/abs/2404.09921

PDF

https://arxiv.org/pdf/2404.09921.pdf
Read All
The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning

2024-04-15 13:30:34

Yaohui Li, Qifeng Zhou, Haoxing Chen, Jianbing Zhang, Xinyu Dai, Hao Zhou

arXiv_CV

arXiv_CV Classification Inference Knowledge Transformer Pose Few-Shot Zero-Shot
Abstract

Contrastive Language-Image Pre-training (CLIP) has shown powerful zero-shot learning performance. Few-shot learning aims to further enhance the transfer capability of CLIP by giving few images in each class, aka 'few shots'. Most existing methods either implicitly learn from the few shots by incorporating learnable prompts or adapters, or explicitly embed them in a cache model for inference. However, the narrow distribution of few shots often contains incomplete class information, leading to biased visual knowledge with high risk of misclassification. To tackle this problem, recent methods propose to supplement visual knowledge by generative models or extra databases, which can be costly and time-consuming. In this paper, we propose an Iterative Visual Knowledge CompLetion (KCL) method to complement visual knowledge by properly taking advantages of unlabeled samples without access to any auxiliary or synthetic data. Specifically, KCL first measures the similarities between unlabeled samples and each category. Then, the samples with top confidence to each category is selected and collected by a designed confidence criterion. Finally, the collected samples are treated as labeled ones and added to few shots to jointly re-estimate the remaining unlabeled ones. The above procedures will be repeated for a certain number of iterations with more and more samples being collected until convergence, ensuring a progressive and robust knowledge completion process. Extensive experiments on 11 benchmark datasets demonstrate the effectiveness and efficiency of KCL as a plug-and-play module under both few-shot and zero-shot learning settings. Code is available at this https URL.

Abstract (translated)

对比性语言与图像预训练（CLIP）已经展示了强大的零散学习性能。少样本学习旨在通过为每个类别提供几张图像，即“少样本”，进一步增强CLIP的转移能力。然而，现有的方法要么通过包含可学习提示或适配器隐式地学习少样本，要么明确地将它们嵌入用于推理的缓存模型中。然而，少样本分布的狭窄范围通常包含不完整的类别信息，导致高误分类风险。为了解决这个问题，最近的方法提出通过生成模型或额外数据库补充视觉知识，这些方法代价昂贵且耗时。在本文中，我们提出了一种迭代视觉知识完成（KCL）方法，通过充分利用未标记样本的优点而不访问任何辅助或合成数据来补充视觉知识。具体来说，KCL首先测量未标记样本与每个类别的相似性。然后，根据设计好的置信度标准选择具有最高置信度的样本并收集。最后，收集到的样本被视为已标记的，并添加到少样本中共同重新估计未标记的剩余样本。上述过程将在收集到越来越多的样本后进行一定次数的迭代，直到达到收敛，从而确保渐进和稳健的知识完成过程。在11个基准数据集上的广泛实验证明，KCL在少样本和零散学习设置下具有有效性和高效性。代码可在此处访问：https://this URL。

URL

https://arxiv.org/abs/2404.09778

PDF

https://arxiv.org/pdf/2404.09778.pdf
Read All
LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

2024-04-15 11:53:22

Guangyan Li, Yongqiang Tang, Wensheng Zhang

arXiv_AI

arXiv_AI GAN Classification Attention Language_Model Transformer Pose Zero-Shot
Abstract

Large language models (LLMs) show excellent performance in difficult tasks, but they often require massive memories and computational resources. How to reduce the parameter scale of LLMs has become research hotspots. In this study, we make an important observation that the multi-head self-attention (MHA) sub-layer of Transformer exhibits noticeable low-rank structure, while the feed-forward network (FFN) sub-layer does not. With this regard, we design a mixed compression model, which organically combines Low-Rank matrix approximation And structured Pruning (LoRAP). For the MHA sub-layer, we propose an input activation weighted singular value decomposition method to strengthen the low-rank characteristic. Furthermore, we discover that the weight matrices in MHA sub-layer have different low-rank degrees. Thus, a novel parameter allocation scheme according to the discrepancy of low-rank degrees is devised. For the FFN sub-layer, we propose a gradient-free structured channel pruning method. During the pruning, we get an interesting finding that the least important 1% of parameter actually play a vital role in model performance. Extensive evaluations on zero-shot perplexity and zero-shot task classification indicate that our proposal is superior to previous structured compression rivals under multiple compression ratios.

Abstract (translated)

大语言模型（LLMs）在困难任务上的表现往往出色，但它们通常需要大量的记忆资源和计算资源。如何降低LLMs的参数规模已成为研究热点。在这项研究中，我们做出了一个重要的观察，即Transformer中的多头自注意力（MHA）子层表现出明显的低秩结构，而前馈网络（FFN）子层则不具有。针对这个问题，我们设计了一个混合压缩模型，将低秩矩阵逼近和结构剪枝（LoRAP）自然地结合在一起。对于MHA子层，我们提出了一种输入激活加权 singular value decomposition 方法来增强低秩特性。此外，我们还发现MHA子层中的权重矩阵具有不同的低秩度。因此，根据低秩度的差异，提出了一种新的参数分配方案。对于FFN子层，我们提出了一种无梯度结构的通道剪枝方法。在剪枝过程中，我们得到一个有趣的发现，即参数中最不重要的1%实际上对模型性能具有关键作用。通过对零散拍摄准确度和零散任务分类的评估，表明我们的建议在多个压缩比下优于之前的结构压缩竞争对手。

URL

https://arxiv.org/abs/2404.09695

PDF

https://arxiv.org/pdf/2404.09695.pdf
Read All
Do LLMs Understand Visual Anomalies? Uncovering LLM Capabilities in Zero-shot Anomaly Detection

2024-04-15 10:42:22

Jiaqi Zhu, Shaofeng Cai, Fang Deng, Junran Wu

arXiv_CV

arXiv_CV Image_Caption Detection Language_Model Transformer Pose Zero-Shot
Abstract

Large vision-language models (LVLMs) are markedly proficient in deriving visual representations guided by natural language. Recent explorations have utilized LVLMs to tackle zero-shot visual anomaly detection (VAD) challenges by pairing images with textual descriptions indicative of normal and abnormal conditions, referred to as anomaly prompts. However, existing approaches depend on static anomaly prompts that are prone to cross-semantic ambiguity, and prioritize global image-level representations over crucial local pixel-level image-to-text alignment that is necessary for accurate anomaly localization. In this paper, we present ALFA, a training-free approach designed to address these challenges via a unified model. We propose a run-time prompt adaptation strategy, which first generates informative anomaly prompts to leverage the capabilities of a large language model (LLM). This strategy is enhanced by a contextual scoring mechanism for per-image anomaly prompt adaptation and cross-semantic ambiguity mitigation. We further introduce a novel fine-grained aligner to fuse local pixel-level semantics for precise anomaly localization, by projecting the image-text alignment from global to local semantic spaces. Extensive evaluations on the challenging MVTec and VisA datasets confirm ALFA's effectiveness in harnessing the language potential for zero-shot VAD, achieving significant PRO improvements of 12.1% on MVTec AD and 8.9% on VisA compared to state-of-the-art zero-shot VAD approaches.

Abstract (translated)

大视觉语言模型（LVLMs）在从自然语言中引导视觉表示方面明显高效。最近的研究利用LVLMs来解决零散视觉异常检测（VAD）挑战，通过将图像与文本描述正常和异常情况的图像相结合，称为异常提示。然而，现有的方法依赖于静态异常提示，容易受到跨语义歧义的影响，并且优先考虑全局图像层面的表示，而忽略了关键的局部像素级图像到文本对齐，这是准确异常定位所必需的。在本文中，我们提出了ALFA，一种无需训练的解决方案，通过统一的模型来解决这些挑战。我们提出了一个运行时提示自适应策略，该策略首先生成有益的异常提示，以利用大型语言模型的能力。该策略通过针对每个图像的异常提示自适应得分机制来增强其跨语义歧义缓解效果。我们进一步引入了一种新细粒度对齐器，通过将全局图像到局部语义空间的图像文本对齐进行投影，实现精确的异常定位。对MVTec和VisA等具有挑战性的数据集的广泛评估证实了ALFA在利用语言潜力进行零散VAD方面的有效性，与最先进的零散VAD方法相比，其性能提高了12.1%的MVTec AD和8.9%的VisA。

URL

https://arxiv.org/abs/2404.09654

PDF

https://arxiv.org/pdf/2404.09654.pdf
Read All
CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning

2024-04-15 10:19:39

Haojian Huang, Xiaozhen Qiao, Zhuo Chen, Haodong Chen, Bingyu Li, Zhe Sun, Mulin Chen, Xuelong Li

arXiv_CV

arXiv_CV Recognition Deep_Learning Relation Inference Knowledge Pose Zero-Shot
Abstract

Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. This knowledge, typically encapsulated in attribute descriptions, aids in identifying class-specific visual features, thus facilitating visual-semantic alignment and improving ZSL performance. However, real-world challenges such as distribution imbalances and attribute co-occurrence among instances often hinder the discernment of local variances in images, a problem exacerbated by the scarcity of fine-grained, region-specific attribute annotations. Moreover, the variability in visual presentation within categories can also skew attribute-category associations. In response, we propose a bidirectional cross-modal ZSL approach CREST. It begins by extracting representations for attribute and visual localization and employs Evidential Deep Learning (EDL) to measure underlying epistemic uncertainty, thereby enhancing the model's resilience against hard negatives. CREST incorporates dual learning pathways, focusing on both visual-category and attribute-category alignments, to ensure robust correlation between latent and observable spaces. Moreover, we introduce an uncertainty-informed cross-modal fusion technique to refine visual-attribute inference. Extensive experiments demonstrate our model's effectiveness and unique explainability across multiple datasets. Our code and data are available at: Comments: Ongoing work; 10 pages, 2 Tables, 9 Figures; Repo is available at this https URL.

Abstract (translated)

零样本学习（ZSL）通过利用已知类别的语义知识从已知到未知类别的迁移来识别新颖类别。这种知识通常被封装在属性描述中，有助于识别类特定的视觉特征，从而促进视觉-语义对齐并提高ZSL性能。然而，现实世界的挑战，如分布不平衡和实例之间的属性共现，往往阻碍了图像中局部变差的鉴别，而这也使得对于细粒度、区域特定的属性注释的稀缺性更加严重。此外，同一类别内视觉表现的可变性也会导致属性-类别关联的偏斜。为了应对这些挑战，我们提出了双向跨模态ZSL方法CREST。它首先提取属性和视觉局部化的表示，并采用证据深度学习（EDL）测量潜在的认知不确定性，从而增强模型对 hard negative 的抵抗力。CREST包括两条学习路径：关注视觉类别和属性类别的对齐，以确保潜在可观察空间与观测可观察空间之间的稳健相关性。此外，我们还引入了一种不确定信息引导的跨模态融合技术，以优化视觉-属性推理。大量实验证明了我们模型的有效性以及其独特的可解释性，在我们的数据集上。我们的代码和数据可在此处获取：https://github.com/your-username/zero-shot-learning-zsl

URL

https://arxiv.org/abs/2404.09640

PDF

https://arxiv.org/pdf/2404.09640.pdf
Read All
Inferring Behavior-Specific Context Improves Zero-Shot Generalization in Reinforcement Learning

2024-04-15 07:31:48

Tidiane Camaret Ndir, Andr\'e Biedenkapp, Noor Awad

arXiv_AI

arXiv_AI Reinforcement_Learning Pose Zero-Shot Agent
Abstract

In this work, we address the challenge of zero-shot generalization (ZSG) in Reinforcement Learning (RL), where agents must adapt to entirely novel environments without additional training. We argue that understanding and utilizing contextual cues, such as the gravity level of the environment, is critical for robust generalization, and we propose to integrate the learning of context representations directly with policy learning. Our algorithm demonstrates improved generalization on various simulated domains, outperforming prior context-learning techniques in zero-shot settings. By jointly learning policy and context, our method acquires behavior-specific context representations, enabling adaptation to unseen environments and marks progress towards reinforcement learning systems that generalize across diverse real-world tasks. Our code and experiments are available at this https URL.

Abstract (translated)

在这项工作中，我们关注了强化学习（RL）中的零样本泛化（ZSG）挑战。在这里，代理必须适应完全新颖的环境，而无需额外训练。我们认为理解并利用上下文线索（如环境的引力水平）对于稳健泛化至关重要，我们提出将上下文表示的学习直接与策略学习相结合。我们的算法在各种模拟领域都展示了更好的泛化能力，在零样本设置中超过了先前的上下文学习技术。通过共同学习策略和上下文，我们的方法获得了行为特定的上下文表示，使代理能够适应未见过的环境，并朝着跨多种现实世界任务的强化学习系统进行进步。我们的代码和实验可通过此链接访问。

URL

https://arxiv.org/abs/2404.09521

PDF

https://arxiv.org/pdf/2404.09521.pdf
Read All
Leveraging Temporal Contextualization for Video Action Recognition

2024-04-15 06:24:56

Minji Kim, Dongyoon Han, Taekyung Kim, Bohyung Han

arXiv_CV

arXiv_CV Video_Caption Recognition Action_Recognition Language_Model Transformer Pose Action Few-Shot Zero-Shot
Abstract

Pretrained vision-language models have shown effectiveness in video understanding. However, recent studies have not sufficiently leveraged essential temporal information from videos, simply averaging frame-wise representations or referencing consecutive frames. We introduce Temporally Contextualized CLIP (TC-CLIP), a pioneering framework for video understanding that effectively and efficiently leverages comprehensive video information. We propose Temporal Contextualization (TC), a novel layer-wise temporal information infusion mechanism for video that extracts core information from each frame, interconnects relevant information across the video to summarize into context tokens, and ultimately leverages the context tokens during the feature encoding process. Furthermore, our Video-conditional Prompting (VP) module manufactures context tokens to generate informative prompts in text modality. We conduct extensive experiments in zero-shot, few-shot, base-to-novel, and fully-supervised action recognition to validate the superiority of our TC-CLIP. Ablation studies for TC and VP guarantee our design choices. Code is available at this https URL

Abstract (translated)

预训练的视觉-语言模型已经在视频理解方面取得了有效性。然而，最近的研究并没有充分利用视频的必要时间信息，仅仅是平均帧级表示或参考连续帧。我们引入了 Temporally Contextualized CLIP (TC-CLIP)，这是一个先驱性的框架，用于视频理解，有效且高效地利用了全面视频信息。我们提出了 Temporal Contextualization (TC)，一种新的层间时间信息注入机制，用于提取每个帧的核心信息，将视频中的相关信息连接起来，并最终在特征编码过程中利用上下文 tokens。此外，我们的 Video-conditional Prompting (VP) 模块用于在文本模态生成有信息的提示。我们在零散、少散、基础到 novel 和完全监督的动作识别上进行了广泛的实验，以验证我们的 TC-CLIP 的优越性。TC 和 VP 的消融研究确保了我们的设计选择。代码可以从该链接获取：

URL

https://arxiv.org/abs/2404.09490

PDF

https://arxiv.org/pdf/2404.09490.pdf
Read All
kNN-CLIP: Retrieval Enables Training-Free Segmentation on Continually Expanding Large Vocabularies

2024-04-15 04:20:01

Zhongrui Gui, Shuyang Sun, Runjia Li, Jianhao Yuan, Zhaochong An, Karsten Roth, Ameya Prabhu, Philip Torr

arXiv_CV

arXiv_CV Segmentation Embedding Zero-Shot
Abstract

Rapid advancements in continual segmentation have yet to bridge the gap of scaling to large continually expanding vocabularies under compute-constrained scenarios. We discover that traditional continual training leads to catastrophic forgetting under compute constraints, unable to outperform zero-shot segmentation methods. We introduce a novel strategy for semantic and panoptic segmentation with zero forgetting, capable of adapting to continually growing vocabularies without the need for retraining or large memory costs. Our training-free approach, kNN-CLIP, leverages a database of instance embeddings to enable open-vocabulary segmentation approaches to continually expand their vocabulary on any given domain with a single-pass through data, while only storing embeddings minimizing both compute and memory costs. This method achieves state-of-the-art mIoU performance across large-vocabulary semantic and panoptic segmentation datasets. We hope kNN-CLIP represents a step forward in enabling more efficient and adaptable continual segmentation, paving the way for advances in real-world large-vocabulary continual segmentation methods.

Abstract (translated)

快速的在持续分割方面的进步尚未在计算受限的场景中跨越词汇量扩展的鸿沟。我们发现，在计算受限的情况下，传统的持续训练会导致灾难性遗忘，无法超越零散分割方法。我们提出了一种新颖的策略，称为带有零遗忘的语义和全视图分割，可以适应不断增长的词汇量，而无需重新训练或产生大的内存成本。我们的免费训练方法kNN-CLIP利用实例嵌入数据库，使任何给定领域都可以通过一次数据通过实现开放词汇分割方法来扩展其词汇量。同时，只存储最小化计算和内存成本的嵌入。这种方法在大型词汇量语义和全视图分割数据集上实现了最先进的mIoU性能。我们希望kNN-CLIP代表了一种推动更高效、更适应的持续分割向前发展，为现实世界中的大型词汇量持续分割方法的进步铺平道路。

URL

https://arxiv.org/abs/2404.09447

PDF

https://arxiv.org/pdf/2404.09447.pdf
Read All
RankCLIP: Ranking-Consistent Language-Image Pretraining

2024-04-15 00:12:27

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zhili Feng, Zenghui Ding, Yining Sun

arXiv_AI

arXiv_AI Classification Face Relation Language_Model Transformer Zero-Shot Self-Supervised Contrastive_Learning Matching
Abstract

Among the ever-evolving development of vision-language models, contrastive language-image pretraining (CLIP) has set new benchmarks in many downstream tasks such as zero-shot classifications by leveraging self-supervised contrastive learning on large amounts of text-image pairs. However, its dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RankCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By leveraging both in-modal and cross-modal ranking consistency, RankCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the enhanced capability of RankCLIP to effectively improve performance across various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the potential of RankCLIP in further advancing vision-language pretraining.

Abstract (translated)

在视觉语言模型不断演变的背景下，对比语言-图像预训练（CLIP）已经在许多下游任务中达到了新的基准，例如利用大规模文本-图像对的自监督对比学习来进行零散shot分类。然而，它对固定一对一映射的依赖性忽视了文本和图像之间以及文本内部复杂多面之间的关系。为此，我们引入了RankCLIP，一种超越了CLIP及其变体的 rigid one-to-one matching 框架的新预训练方法。通过利用 both in-modal 和 cross-modal ranking consistency, RankCLIP 改善了alignment 过程，使其能够捕捉每个模态之间微妙的 many-to-many 关系。通过全面的实验，我们证明了 RankCLIP 在各种下游任务中的增强能力，特别是在零散shot分类方面取得了显著的进步，突出了 RankCLIP 在进一步推动视觉语言预训练方面的潜在可能性。

URL

https://arxiv.org/abs/2404.09387

PDF

https://arxiv.org/pdf/2404.09387.pdf
Read All
DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection

2024-04-14 11:01:44

Lewei Yao, Renjie Pi, Jianhua Han, Xiaodan Liang, Hang Xu, Wei Zhang, Zhenguo Li, Dan Xu

arXiv_CV

arXiv_CV Caption Detection Object_Detection Language_Model Transformer Zero-Shot
Abstract

Existing open-vocabulary object detectors typically require a predefined set of categories from users, significantly confining their application scenarios. In this paper, we introduce DetCLIPv3, a high-performing detector that excels not only at both open-vocabulary object detection, but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs, providing rich, multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs, DetCLIPv3 demonstrates superior open-vocabulary detection performance, \eg, our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark, outperforming GLIPv2, GroundingDINO, and DetCLIPv2 by 18.0/19.6/6.6 AP, respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset, showcasing its strong generative capability.

Abstract (translated)

现有的开词对象检测器通常需要用户预定义的一组类别，这大大限制了它们的应用场景。在本文中，我们介绍了DetCLIPv3，一种高性能的检测器，不仅在开词对象检测方面表现出色，而且还能为检测到的对象生成等级标签。DetCLIPv3的特点是三个核心设计：1. 多功能模型架构：我们通过将摘要头集成到模型中，得到一个鲁棒的开词集检测框架，进一步通过集成摘要头部使得模型具有生成能力。2. 高信息密度数据：我们利用视觉大语言模型开发了一个自动注释管道，对大规模图像-文本对进行优化，提供丰富的多粒度物体标签，以提高训练。3. 高效的训练策略：我们使用低分辨率输入的预训练阶段，使得物体描述者能够高效地学习广泛的视觉概念，然后在细粒度阶段利用少量高分辨率样本进一步提高检测性能。有了这些有效的设计，DetCLIPv3在开词集检测方面表现出色，例如，我们的Swin-T骨干网络模型在LVIS minival基准上实现了显著的47.0零散固定AP，比GLIPv2，GroundingDINO和DetCLIPv2分别高18.0/19.6/6.6 AP。DetCLIPv3还在VG数据集上的密集标注任务上实现了最先进的19.7 AP，展示了其强大的生成能力。

URL

https://arxiv.org/abs/2404.09216

PDF

https://arxiv.org/pdf/2404.09216.pdf
Read All
ChangeAnywhere: Sample Generation for Remote Sensing Change Detection via Semantic Latent Diffusion Model

2024-04-13 03:46:35

Kai Tang, Jin Chen

arXiv_AI

arXiv_AI Detection Deep_Learning Knowledge Pose Few-Shot Zero-Shot Enhancement Diffusion
Abstract

Remote sensing change detection (CD) is a pivotal technique that pinpoints changes on a global scale based on multi-temporal images. With the recent expansion of deep learning, supervised deep learning-based CD models have shown satisfactory performance. However, CD sample labeling is very time-consuming as it is densely labeled and requires expert knowledge. To alleviate this problem, we introduce ChangeAnywhere, a novel CD sample generation method using the semantic latent diffusion model and single-temporal images. Specifically, ChangeAnywhere leverages the relative ease of acquiring large single-temporal semantic datasets to generate large-scale, diverse, and semantically annotated bi-temporal CD datasets. ChangeAnywhere captures the two essentials of CD samples, i.e., change implies semantically different, and non-change implies reasonable change under the same semantic constraints. We generated ChangeAnywhere-100K, the largest synthesis CD dataset with 100,000 pairs of CD samples based on the proposed method. The ChangeAnywhere-100K significantly improved both zero-shot and few-shot performance on two CD benchmark datasets for various deep learning-based CD models, as demonstrated by transfer experiments. This paper delineates the enormous potential of ChangeAnywhere for CD sample generation and demonstrates the subsequent enhancement of model performance. Therefore, ChangeAnywhere offers a potent tool for remote sensing CD. All codes and pre-trained models will be available at this https URL.

Abstract (translated)

遥感变化检测（CD）是一种关键的技术，它通过多时态图像的全球范围来确定变化。随着深度学习的最近扩展，基于监督的深度学习CD模型已经表现出令人满意的性能。然而，CD样本标注非常耗时，因为它密集标注，需要专业知识。为了减轻这个问题，我们引入了ChangeAnywhere，一种使用语义潜在扩散模型和单时态图像的新型CD样本生成方法。具体来说，ChangeAnywhere利用获取大型单时态语义数据集的相对容易性来生成大规模、多样化和语义标注的生物两时态CD数据集。ChangeAnywhere抓住了CD样本的两个关键要素，即变化表示语义不同，不变表示在相同语义约束下合理的改变。我们基于该方法生成了ChangeAnywhere-100K，即基于提出的最大合成CD数据集，其中100,000对CD样本。ChangeAnywhere在各种基于深度学习的CD模型上显著提高了零击和少数击性能，如迁移实验所示。本文概述了ChangeAnywhere在CD样本生成方面的巨大潜力，并展示了后续模型性能的提高。因此，ChangeAnywhere为遥感CD提供了强大的工具。所有代码和预训练模型都可以在https://这个网址找到。

URL

https://arxiv.org/abs/2404.08892

PDF

https://arxiv.org/pdf/2404.08892.pdf
Read All
Inverse Kinematics for Neuro-Robotic Grasping with Humanoid Embodied Agents

2024-04-12 21:42:34

Jan-Gerrit Habekost, Connor G\"ade, Philipp Allgeuer, Stefan Wermter

arXiv_AI

arXiv_AI Language_Model Action Zero-Shot Embodied Agent
Abstract

This paper introduces a novel zero-shot motion planning method that allows users to quickly design smooth robot motions in Cartesian space. A Bézier curve-based Cartesian plan is transformed into a joint space trajectory by our neuro-inspired inverse kinematics (IK) method CycleIK, for which we enable platform independence by scaling it to arbitrary robot designs. The motion planner is evaluated on the physical hardware of the two humanoid robots NICO and NICOL in a human-in-the-loop grasping scenario. Our method is deployed with an embodied agent that is a large language model (LLM) at its core. We generalize the embodied agent, that was introduced for NICOL, to also be embodied by NICO. The agent can execute a discrete set of physical actions and allows the user to verbally instruct various different robots. We contribute a grasping primitive to its action space that allows for precise manipulation of household objects. The new CycleIK method is compared to popular numerical IK solvers and state-of-the-art neural IK methods in simulation and is shown to be competitive with or outperform all evaluated methods when the algorithm runtime is very short. The grasping primitive is evaluated on both NICOL and NICO robots with a reported grasp success of 72% to 82% for each robot, respectively.

Abstract (translated)

本文提出了一种新颖的零 shot运动规划方法，允许用户在二维空间中快速设计平滑的机器人运动。通过我们基于Bézier曲线的人体启发式逆运动学（IK）方法CycleIK，将基于Bézier曲线的二维计划变换为机器人空间轨迹。该运动规划器在人类监督下的两个大型机器人NICO和NICOL上的物理硬件上进行评估。我们使用具有身体代理的自主机器人（LLM）来部署该方法。我们还将基于NICO的 embodied agent 扩展到也具有NICO 的身体代理。该代理可以执行一系列物理动作，并允许用户通过口头指令控制各种不同机器人。我们在其动作空间中添加了抓握原语，允许用户精确操作家庭用品。与流行的数值IK求解器和最先进的神经IK方法在模拟中进行了比较，并在算法运行时间非常短时，证明了该方法与所有评估方法具有竞争性或优越性。抓握原语在NICOL和NICO机器人上的报告抓握成功率在72%到82%之间。

URL

https://arxiv.org/abs/2404.08825

PDF

https://arxiv.org/pdf/2404.08825.pdf
Read All
`Eyes of a Hawk and Ears of a Fox': Part Prototype Network for Generalized Zero-Shot Learning

2024-04-12 18:37:00

Joshua Feinglass, Jayaraman J. Thiagarajan, Rushil Anirudh, T.S. Jayram, Yezhou Yang

arXiv_CV

arXiv_CV Recognition Detection Object_Detection Attention Transformer Pose Zero-Shot
Abstract

Current approaches in Generalized Zero-Shot Learning (GZSL) are built upon base models which consider only a single class attribute vector representation over the entire image. This is an oversimplification of the process of novel category recognition, where different regions of the image may have properties from different seen classes and thus have different predominant attributes. With this in mind, we take a fundamentally different approach: a pre-trained Vision-Language detector (VINVL) sensitive to attribute information is employed to efficiently obtain region features. A learned function maps the region features to region-specific attribute attention used to construct class part prototypes. We conduct experiments on a popular GZSL benchmark consisting of the CUB, SUN, and AWA2 datasets where our proposed Part Prototype Network (PPN) achieves promising results when compared with other popular base models. Corresponding ablation studies and analysis show that our approach is highly practical and has a distinct advantage over global attribute attention when localized proposals are available.

Abstract (translated)

当前的泛化零 shot学习（GZSL）方法是基于仅考虑整个图像中单个类属性的基模型构建的。这过于简化了新颖类别识别的过程，因为在图像的不同区域可能具有来自不同观察类别的属性，因此具有不同的主导属性。为此，我们采用了一种根本不同的方法：一个对属性信息敏感的预训练 Vision-Language 检测器（VINVL）用于高效地获得区域特征。学习到的函数将区域特征映射到用于构建类部件原型的地方特定的属性关注。我们在包括CUB、SUN和AWA2在内的流行GZSL基准上进行了实验，其中我们提出的部件原型网络（PPN）与其他流行基模型相比较，实现了很好的效果。相应的消融研究和分析表明，我们的方法具有很高的实用性，并且当局部提议可用时，与全局属性关注相比具有明显的优势。

URL

https://arxiv.org/abs/2404.08761

PDF

https://arxiv.org/pdf/2404.08761.pdf
Read All
Probing the 3D Awareness of Visual Foundation Models

2024-04-12 17:58:04

Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani

arXiv_CV

arXiv_CV Segmentation Detection Face Inference Zero-Shot 3D
Abstract

Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at this https URL.

Abstract (translated)

近年来在大型预训练方面的进步产生了具有强大功能的视觉基础模型。不仅 recent 模型能够将训练任务中的任意图像泛化，而且它们的中间表示对于其他视觉任务（如检测和分割）也是有用的。鉴于这类模型能够在 2D 上分类、描绘和定位物体，我们问它们是否也代表其 3D 结构？在这篇工作中，我们分析了视觉基础模型的 3D 意识。我们认为，3D 意识意味着表示（1）编码场景的 3D 结构，（2）在视图中一致地表示表面。我们对冻结特征使用任务特定探针和零散推理来进行实验。我们的实验揭示了当前模型的几个局限性。我们的代码和分析可以在这个链接中找到。

URL

https://arxiv.org/abs/2404.08636

PDF

https://arxiv.org/pdf/2404.08636.pdf
Read All
Synthetic Dataset Creation and Fine-Tuning of Transformer Models for Question Answering in Serbian

2024-04-12 17:27:54

Aleksa Cvetanović, Predrag Tadić

arXiv_CL

arXiv_CL QA Knowledge Bert Transformer Pose Zero-Shot
Abstract

In this paper, we focus on generating a synthetic question answering (QA) dataset using an adapted Translate-Align-Retrieve method. Using this method, we created the largest Serbian QA dataset of more than 87K samples, which we name SQuAD-sr. To acknowledge the script duality in Serbian, we generated both Cyrillic and Latin versions of the dataset. We investigate the dataset quality and use it to fine-tune several pre-trained QA models. Best results were obtained by fine-tuning the BERTić model on our Latin SQuAD-sr dataset, achieving 73.91% Exact Match and 82.97% F1 score on the benchmark XQuAD dataset, which we translated into Serbian for the purpose of evaluation. The results show that our model exceeds zero-shot baselines, but fails to go beyond human performance. We note the advantage of using a monolingual pre-trained model over multilingual, as well as the performance increase gained by using Latin over Cyrillic. By performing additional analysis, we show that questions about numeric values or dates are more likely to be answered correctly than other types of questions. Finally, we conclude that SQuAD-sr is of sufficient quality for fine-tuning a Serbian QA model, in the absence of a manually crafted and annotated dataset.

Abstract (translated)

在本文中，我们重点使用自适应Translate-Align-Retrieve方法生成一个合成问题回答（QA）数据集。通过这种方法，我们创建了超过87K个样本的塞尔维亚QA数据集，我们称之为SQuAD-sr。为了承认塞尔维亚的脚本二元性，我们生成了塞尔维亚和拉丁版本的數據集。我们研究了數據集的質量，并使用它来微調多個预训练QA模型的精度。最佳结果是在我们的拉丁SQuAD-sr數據集上微調BERTić模型，实现了73.91%的准确匹配和82.97%的分数，我们在基准XQuAD數據集上的表现。結果表明，我们的模型超过了零散的基線，但沒有超越人類的表現。我們注意到了使用單語预訓練模型的優勢，以及使用拉丁文比使用 cyrillic 文本来實現的性能增加。通過進行進一步分析，我們發現，數值或日期等數值問題比其他類型的問題更可能得到正確回答。最後，我們得出結論，SQuAD-sr對於在缺乏手動製作和標註的數據集上微調塞尔维亚QA模型是足夠的質量。

URL

https://arxiv.org/abs/2404.08617

PDF

https://arxiv.org/pdf/2404.08617.pdf
Read All
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

2024-04-12 16:35:23

Övgü Özdemir, Erdem Akagündüz

arXiv_AI

arXiv_AI Image_Caption Caption VQA QA Language_Model Pose Zero-Shot
Abstract

Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{this https URL}.

Abstract (translated)

视觉问题回答（VQA）被认为是AI完成的任务，因为它需要理解、推理和推断关于视觉和语言内容的视觉和语言内容。在过去的几年里，为VQA问题提出了许多神经架构建议。然而，在零散射击VQA上取得成功仍然具有挑战性，因为需要具备高级的泛化能力和推理能力。本文探讨了将图像摘要作为VQA管道中中间过程的引入对视觉问题回答效果的影响。具体来说，我们探讨了使用图像摘要而不是图像并利用大型语言模型（LLMs）建立零散射击设置的有效性。由于图像摘要是这个过程中最关键的一步，因此我们比较了最先进的图像摘要模型在各种问题类型的结构和语义方面的视觉问题回答性能。我们提出了一种直接而有效的基于问题的图像摘要方法，将上下文信息传递给问题回答（QA）模型。这种方法涉及从问题中提取关键词，为图像-问题对生成文本摘要，并将问题驱动的摘要融入LLM提示中。我们评估了使用通用和基于问题的图像摘要在VQA管道中的效果。我们的研究突出了在零散射击设置下利用图像摘要和大型语言模型的潜力，以实现GQA竞争力的性能。我们的代码可在此处访问：\url{这个链接}。

URL

https://arxiv.org/abs/2404.08589

PDF

https://arxiv.org/pdf/2404.08589.pdf
Read All
Pathological Primitive Segmentation Based on Visual Foundation Model with Zero-Shot Mask Generation

2024-04-12 16:29:49

Abu Bakor Hayat Arnob, Xiangxue Wang, Yiping Jiao, Xiao Gan, Wenlong Ming, Jun Xu

arXiv_CV

arXiv_CV Segmentation Detection Classification Pose Zero-Shot Medical
Abstract

Medical image processing usually requires a model trained with carefully crafted datasets due to unique image characteristics and domain-specific challenges, especially in pathology. Primitive detection and segmentation in digitized tissue samples are essential for objective and automated diagnosis and prognosis of cancer. SAM (Segment Anything Model) has recently been developed to segment general objects from natural images with high accuracy, but it requires human prompts to generate masks. In this work, we present a novel approach that adapts pre-trained natural image encoders of SAM for detection-based region proposals. Regions proposed by a pre-trained encoder are sent to cascaded feature propagation layers for projection. Then, local semantic and global context is aggregated from multi-scale for bounding box localization and classification. Finally, the SAM decoder uses the identified bounding boxes as essential prompts to generate a comprehensive primitive segmentation map. The entire base framework, SAM, requires no additional training or fine-tuning but could produce an end-to-end result for two fundamental segmentation tasks in pathology. Our method compares with state-of-the-art models in F1 score for nuclei detection and binary/multiclass panoptic(bPQ/mPQ) and mask quality(dice) for segmentation quality on the PanNuke dataset while offering end-to-end efficiency. Our model also achieves remarkable Average Precision (+4.5%) on the secondary dataset (HuBMAP Kidney) compared to Faster RCNN. The code is publicly available at this https URL.

Abstract (translated)

医学图像处理通常需要精心制作的数据集训练的模型，因为具有独特的图像特征和领域特有挑战的图像，特别是在病理学中。数字组织样本的原始检测和分割对于客观和自动癌症诊断和预后至关重要。SAM（ segment Anything Model）最近开发用于从自然图像中准确地分割通用对象，但需要人类提示生成掩模。在这项工作中，我们提出了一个新方法，将预训练的自然图像编码器SAM的检测基于区域提议的分割。预训练编码器产生的区域发送到级联特征传播层进行投影。然后，从多尺度中聚合局部语义和全局上下文进行边界框定位和分类。最后，SAM解码器使用确定的边界框作为关键提示生成全面的基本分割图。整个基础框架SAM无需额外训练或微调，但可以对病理学中的两个基本分割任务产生端到端结果。我们的方法在F1分数上与最先进的模型在核心Nuclei检测和二分类/多分类PanNuke数据集上的质量和dice分数相比较。与Faster RCNN相比，我们的模型在HuBMAP Kidney的二级数据集上实现了显著的平均精度（+4.5%）。代码可在此https URL上公开获取。

URL

https://arxiv.org/abs/2404.08584

PDF

https://arxiv.org/pdf/2404.08584.pdf
Read All

Content

Zero-Shot (20)

Zero-Shot

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL