Paper Reading AI Learner

Contextual Categorization Enhancement through LLMs Latent-Space

2024-04-25 09:20:51
Zineddine Bettouche, Anas Safi, Andreas Fischer

Abstract

Managing the semantic quality of the categorization in large textual datasets, such as Wikipedia, presents significant challenges in terms of complexity and cost. In this paper, we propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset and its associated categories into a latent space. We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories. Our graphical approach is powered by Convex Hull, while we utilize Hierarchical Navigable Small Worlds (HNSWs) for the hierarchical approach. As a solution to the information loss caused by the dimensionality reduction, we modulate the following mathematical solution: an exponential decay function driven by the Euclidean distances between the high-dimensional encodings of the textual categories. This function represents a filter built around a contextual category and retrieves items with a certain Reconsideration Probability (RP). Retrieving high-RP items serves as a tool for database administrators to improve data groupings by providing recommendations and identifying outliers within a contextual framework.

Abstract (translated)

管理大型文本数据集中的分类语义质量是一项具有复杂性和成本挑战性的任务。在本文中,我们提出利用Transformer模型从维基百科数据集中的文本和相关的类别中提取语义信息,并将其转换为潜在空间。然后,我们探讨了基于这些编码的不同方法,以评估和增强类别的语义身份。我们的图形方法基于Convex Hull,而我们在Hierarchical Navigable Small Worlds (HNSWs)中使用分层方法。作为一种解决由于维度降低引起的信息损失的方法,我们调节以下数学解:由Euclidean距离驱动的指数衰减函数。这个函数围绕一个上下文类别构建一个滤波器,并检索具有特定重新考虑概率(RP)的项。检索高RP项目是一种数据库管理员通过提供建议和改进数据分组的方法。通过在上下文框架内识别异常值,这种工具可以帮助管理员优化数据分组。

URL

https://arxiv.org/abs/2404.16442

PDF

https://arxiv.org/pdf/2404.16442.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot