LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Abstract
Abstract (translated)
URL
PDF

Abstract

The distribution of subpopulations is an important property hidden within a dataset. Uncovering and analyzing the subpopulation distribution within datasets provides a comprehensive understanding of the datasets, standing as a powerful tool beneficial to various downstream tasks, including Dataset Subpopulation Organization, Subpopulation Shift, and Slice Discovery. Despite its importance, there has been no work that systematically explores the subpopulation distribution of datasets to our knowledge. To address the limitation and solve all the mentioned tasks in a unified way, we introduce a novel concept of subpopulation structures to represent, analyze, and utilize subpopulation distributions within datasets. To characterize the structures in an interpretable manner, we propose the Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework, which employs world knowledge and instruction-following capabilities of Large Language Models (LLMs) to linguistically analyze informative image captions and summarize the structures. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery. Furthermore, we propose complete workflows to address downstream tasks, named Task-specific Tuning, showcasing the application of the discovered structure to a spectrum of subpopulation-related tasks, including dataset subpopulation organization, subpopulation shift, and slice discovery.

Abstract (translated)

子种群分布是数据集中一个重要的属性。揭示和分析数据集中的子种群分布提供了对数据集的全面了解，作为各种下游任务的有力工具，包括数据集子种群组织、子种群平移和切片发现。尽管这对数据集非常重要，但我们不知道有没有系统地研究了数据集中的子种群分布。为了克服这一局限，并统一解决所有提到的任务，我们引入了一个新的子种群结构概念，用于表示、分析和利用数据集中的子种群分布。为了以可解释的方式描述结构，我们提出了 Subpopulation Structure Discovery with Large Language Models (SSD-LLM) framework，该框架利用大型语言模型的世界知识和指令跟随能力进行语言分析，并总结结构。此外，我们提出了完整的下游任务工作流程，名为任务特定调整，展示了发现的结构在子种群相关任务中的应用，包括数据集子种群组织、子种群平移和切片发现。此外，我们还提出了完整的下游任务工作流程，名为任务特定调整，展示了发现的结构在子种群相关任务中的应用，包括数据集子种群组织、子种群平移和切片发现。

URL

https://arxiv.org/abs/2405.02363

PDF

https://arxiv.org/pdf/2405.02363.pdf

LLM as Dataset Analyst: Subpopulation Structure Discovery with Large Language Model

Abstract

Abstract (translated)

URL

PDF Copy

PDF