Abstract
[Abridged Abstract] Recent technological advances underscore labor market dynamics, yielding significant consequences for employment prospects and increasing job vacancy data across platforms and languages. Aggregating such data holds potential for valuable insights into labor market demands, new skills emergence, and facilitating job matching for various stakeholders. However, despite prevalent insights in the private sector, transparent language technology systems and data for this domain are lacking. This thesis investigates Natural Language Processing (NLP) technology for extracting relevant information from job descriptions, identifying challenges including scarcity of training data, lack of standardized annotation guidelines, and shortage of effective extraction methods from job ads. We frame the problem, obtaining annotated data, and introducing extraction methodologies. Our contributions include job description datasets, a de-identification dataset, and a novel active learning algorithm for efficient model training. We propose skill extraction using weak supervision, a taxonomy-aware pre-training methodology adapting multilingual language models to the job market domain, and a retrieval-augmented model leveraging multiple skill extraction datasets to enhance overall performance. Finally, we ground extracted information within a designated taxonomy.
Abstract (translated)
最近的技术进步突出了劳动力市场的动态,对就业前景产生了重大影响,并增加了平台和语言中的职位空缺数据。对这种数据的汇总有可能为劳动力市场提供有价值的洞察,包括劳动力市场需求、新技能的出现以及为各种利益相关者提供职位匹配。然而,尽管在私营部门普遍存在见解,但在该领域仍缺乏透明的语言技术和数据。本论文研究了自然语言处理(NLP)技术,用于从职位描述中提取相关信息,识别包括训练数据不足、缺乏标准化注释指南和有效提取方法在内的挑战。我们构建了问题、获得注释数据和介绍提取方法。我们的贡献包括职位描述数据集、去识别数据集和一个新的人工学习算法,用于高效模型训练。我们提出了使用弱监督进行技能提取的分类感知预训练方法、适应多语言语言模型的领域感知预训练方法以及利用多个技能提取数据集的检索增强模型,以提高整体性能。最后,我们将在指定的分类中定位提取的信息。
URL
https://arxiv.org/abs/2404.18977