Data-driven soft sensors (DDSS) have become mainstream methods for predicting key performance indicators in process industries. However, DDSS development requires complex and costly customized designs tailored to various tasks during the modeling process. Moreover, DDSS are constrained to a single structured data modality, limiting their ability to incorporate additional contextual knowledge. Furthermore, DDSSs' limited representation learning leads to weak predictive performance with scarce data. To address these challenges, we propose a general framework named LLM-TKESS (large language model for text-based knowledge-embedded soft sensing), harnessing the powerful general problem-solving capabilities, cross-modal knowledge transfer abilities, and few-shot capabilities of LLM for enhanced soft sensing modeling. Specifically, an auxiliary variable series encoder (AVS Encoder) is proposed to unleash LLM's potential for capturing temporal relationships within series and spatial semantic relationships among auxiliary variables. Then, we propose a two-stage fine-tuning alignment strategy: in the first stage, employing parameter-efficient fine-tuning through autoregressive training adjusts LLM to rapidly accommodate process variable data, resulting in a soft sensing foundation model (SSFM). Subsequently, by training adapters, we adapt the SSFM to various downstream tasks without modifying its architecture. Then, we propose two text-based knowledge-embedded soft sensors, integrating new natural language modalities to overcome the limitations of pure structured data models. Furthermore, benefiting from LLM's pre-existing world knowledge, our model demonstrates outstanding predictive capabilities in small sample conditions. Using the thermal deformation of air preheater rotor as a case study, we validate through extensive experiments that LLM-TKESS exhibits outstanding performance.
数据驱动的软传感器(DDSS)已成为过程工业中预测关键绩效指标的主要方法。然而,DDSS的发展需要在建模过程中进行复杂的、成本高昂且针对特定任务定制的设计。此外,DDSS仅限于单一结构化数据模式,限制了其融合额外上下文知识的能力。更进一步的是,由于DDSS的表示学习能力有限,在数据稀缺的情况下预测性能较弱。为了解决这些挑战,我们提出了一种名为LLM-TKESS(基于大型语言模型的知识嵌入文本软传感)的一般框架,利用LLM强大的通用问题解决能力、跨模态知识转移能力和少量样本处理能力来增强软传感器建模。 具体而言,我们提出了一个辅助变量序列编码器(AVS编码器),以释放LLM在捕捉时间序列内的时间关系和辅助变量间的空间语义关系方面的潜力。然后,我们提出了一种两阶段的微调对齐策略:第一阶段通过自回归训练进行参数高效的微调,使LLM能够快速适应过程变量数据,形成一个软传感基础模型(SSFM)。随后,在不修改其架构的情况下,通过对适配器进行训练来将SSFM调整为各种下游任务。此外,我们提出了两种基于文本的知识嵌入式软传感器,通过引入新的自然语言模态克服了纯结构化数据模型的局限性。 得益于LLM预先存在的世界知识,我们的模型在小样本条件下展示了卓越的预测能力。以空气预热器转子的热变形为例,在广泛的实验验证中,我们证明了LLM-TKESS表现出色。
https://arxiv.org/abs/2501.05075
Existing vehicle trajectory prediction models struggle with generalizability, prediction uncertainties, and handling complex interactions. It is often due to limitations like complex architectures customized for a specific dataset and inefficient multimodal handling. We propose Perceiver with Register queries (PerReg+), a novel trajectory prediction framework that introduces: (1) Dual-Level Representation Learning via Self-Distillation (SD) and Masked Reconstruction (MR), capturing global context and fine-grained details. Additionally, our approach of reconstructing segmentlevel trajectories and lane segments from masked inputs with query drop, enables effective use of contextual information and improves generalization; (2) Enhanced Multimodality using register-based queries and pretraining, eliminating the need for clustering and suppression; and (3) Adaptive Prompt Tuning during fine-tuning, freezing the main architecture and optimizing a small number of prompts for efficient adaptation. PerReg+ sets a new state-of-the-art performance on nuScenes [1], Argoverse 2 [2], and Waymo Open Motion Dataset (WOMD) [3]. Remarkable, our pretrained model reduces the error by 6.8% on smaller datasets, and multi-dataset training enhances generalization. In cross-domain tests, PerReg+ reduces B-FDE by 11.8% compared to its non-pretrained variant.
现有的车辆轨迹预测模型在泛化能力、预测不确定性以及处理复杂交互方面存在挑战,这通常是因为复杂的架构针对特定数据集进行了定制,并且多模态处理效率低下。我们提出了一个新颖的轨迹预测框架Perceiver with Register queries (简称 PerReg+),该框架引入了以下几点改进: 1. 通过自蒸馏(Self-Distillation, SD)和掩码重建(Masked Reconstruction, MR),实现双层表示学习,能够捕捉全局上下文信息与细粒度细节。此外,我们通过从被屏蔽的输入中进行段级轨迹以及车道段的重构,并采用查询删除策略,有效利用了上下文信息并提升了泛化能力; 2. 采用基于注册查询和预训练的方法增强多模态处理能力,消除了聚类和抑制的需求; 3. 在微调过程中实现自适应提示调整(Adaptive Prompt Tuning),通过冻结主要架构,并优化少量提示来实现高效的适应性。 PerReg+在nuScenes、Argoverse 2 和Waymo Open Motion Dataset (WOMD) 数据集上达到了新的性能上限。值得注意的是,我们的预训练模型在较小数据集上的误差降低了6.8%,而跨数据集的多任务训练进一步提升了泛化能力。在跨域测试中,PerReg+相较于非预训练版本将B-FDE(最终距离误差)减少了11.8%。 通过这些改进,PerReg+不仅提高了预测精度和效率,还增强了模型在不同场景下的适应性和鲁棒性。
https://arxiv.org/abs/2501.04815
Over the last decade, representation learning, which embeds complex information extracted from large amounts of data into dense vector spaces, has emerged as a key technique in machine learning. Among other applications, it has been a key building block for large language models and advanced computer vision systems based on contrastive learning. A core component of representation learning systems is the projection head, which maps the original embeddings into different, often compressed spaces, while preserving the similarity relationship between vectors. In this paper, we propose a quantum-inspired projection head that includes a corresponding quantum-inspired similarity metric. Specifically, we map classical embeddings onto quantum states in Hilbert space and introduce a quantum circuit-based projection head to reduce embedding dimensionality. To evaluate the effectiveness of this approach, we extended the BERT language model by integrating our projection head for embedding compression. We compared the performance of embeddings, which were compressed using our quantum-inspired projection head, with those compressed using a classical projection head on information retrieval tasks using the TREC 2019 and TREC 2020 Deep Learning benchmarks. The results demonstrate that our quantum-inspired method achieves competitive performance relative to the classical method while utilizing 32 times fewer parameters. Furthermore, when trained from scratch, it notably excels, particularly on smaller datasets. This work not only highlights the effectiveness of the quantum-inspired approach but also emphasizes the utility of efficient, ad hoc low-entanglement circuit simulations within neural networks as a powerful quantum-inspired technique.
在过去十年中,表示学习作为一种关键技术在机器学习领域崭露头角。它将从大量数据中提取的复杂信息嵌入到密集向量空间中。除了其他应用外,它已成为大型语言模型和基于对比学习的高级计算机视觉系统的重要构建模块之一。表示学习系统的核心组成部分是投影头部(projection head),它将原始嵌入映射到不同的、通常是压缩的空间,并保持向量之间的相似性关系。在这篇论文中,我们提出了一种受量子启发的投影头部,其中包括一种相应的量子启发式相似度度量方法。具体而言,我们将经典嵌入映射到希尔伯特空间中的量子态,并引入基于量子电路的投影头部以减少嵌入维度。 为了评估这种方法的有效性,我们在BERT语言模型中扩展了我们的投影头以实现嵌入压缩功能,并将其与使用经典投影头进行嵌入压缩的效果进行了对比。我们通过TREC 2019和TREC 2020深度学习基准的信息检索任务来比较两种方法的性能。结果表明,尽管参数量减少了32倍,我们的量子启发式方法在表现上可以与传统方法相媲美;同时,在从头开始训练时,特别是在小数据集上的效果更为显著。 这项工作不仅突显了量子启发式的有效性,而且还强调了在神经网络中使用高效的、临时的低纠缠电路模拟作为一种强大的量子启发技术的重要性。
https://arxiv.org/abs/2501.04591
Recent advancements in vision foundation models (VFMs) have revolutionized visual perception in 2D, yet their potential for 3D scene understanding, particularly in autonomous driving applications, remains underexplored. In this paper, we introduce LargeAD, a versatile and scalable framework designed for large-scale 3D pretraining across diverse real-world driving datasets. Our framework leverages VFMs to extract semantically rich superpixels from 2D images, which are aligned with LiDAR point clouds to generate high-quality contrastive samples. This alignment facilitates cross-modal representation learning, enhancing the semantic consistency between 2D and 3D data. We introduce several key innovations: i) VFM-driven superpixel generation for detailed semantic representation, ii) a VFM-assisted contrastive learning strategy to align multimodal features, iii) superpoint temporal consistency to maintain stable representations across time, and iv) multi-source data pretraining to generalize across various LiDAR configurations. Our approach delivers significant performance improvements over state-of-the-art methods in both linear probing and fine-tuning tasks for both LiDAR-based segmentation and object detection. Extensive experiments on eleven large-scale multi-modal datasets highlight our superior performance, demonstrating the adaptability, efficiency, and robustness in real-world autonomous driving scenarios.
最近,视觉基础模型(VFMs)在二维视觉感知领域的进展已经彻底革新了这一领域,然而它们在三维场景理解中的潜力,尤其是在自动驾驶应用中,仍未被充分探索。在这篇论文中,我们介绍了LargeAD,这是一个为大规模3D预训练设计的多功能、可扩展框架,适用于各种现实世界驾驶数据集。我们的框架利用VFMs从2D图像中提取语义丰富的超像素,并将其与激光雷达点云对齐以生成高质量的对比样本。这种对齐促进了跨模态表示学习,增强了二维和三维数据之间的语义一致性。 我们引入了几项关键创新: i) 由VFM驱动的超像素生成,用于详细的语义表示; ii) 一种辅助VFM进行的对比学习策略,以对齐多模式特征; iii) 超点时间一致性,以保持跨时间的稳定表示; iv) 多源数据预训练,以适应各种激光雷达配置。 我们的方法在基于LiDAR的分割和目标检测任务中的线性探测和微调任务上都比最先进的方法表现出显著的性能改进。我们在十一项大规模多模态数据集上的广泛实验中证明了我们方法的优越性能,展示了其在现实世界自动驾驶场景中的适应性、效率和鲁棒性。
https://arxiv.org/abs/2501.04005
LiDAR data pretraining offers a promising approach to leveraging large-scale, readily available datasets for enhanced data utilization. However, existing methods predominantly focus on sparse voxel representation, overlooking the complementary attributes provided by other LiDAR representations. In this work, we propose LiMoE, a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning to synergistically combine multiple representations, such as range images, sparse voxels, and raw points. Our approach consists of three stages: i) Image-to-LiDAR Pretraining, which transfers prior knowledge from images to point clouds across different representations; ii) Contrastive Mixture Learning (CML), which uses MoE to adaptively activate relevant attributes from each representation and distills these mixed features into a unified 3D network; iii) Semantic Mixture Supervision (SMS), which combines semantic logits from multiple representations to boost downstream segmentation performance. Extensive experiments across 11 large-scale LiDAR datasets demonstrate our effectiveness and superiority. The code and model checkpoints have been made publicly accessible.
LiDAR数据预训练提供了一种有前景的方法,能够利用大规模且易于获取的数据集来提升数据利用率。然而,现有的方法主要侧重于稀疏体素表示,忽略了其他LiDAR表示所提供的互补属性。为此,我们提出了一个框架——LiMoE(LiDAR Mixture of Experts),它将专家混合(Mixture of Experts, MoE)范式融入到LiDAR数据表示学习中,以协同结合多种表示方式,如范围图像、稀疏体素和原始点云。 我们的方法分为三个阶段: 1. **Image-to-LiDAR 预训练**:这个阶段将先前的知识从图像转移到不同表示形式的点云上。 2. **对比混合学习(Contrastive Mixture Learning, CML)**:该阶段使用MoE自适应激活每种表示的相关属性,并将这些混合特征提炼到一个统一的3D网络中。 3. **语义混合监督(Semantic Mixture Supervision, SMS)**:该阶段结合多种表示方式中的语义逻辑以增强下游分割任务的表现。 在11个大规模LiDAR数据集上进行的广泛实验展示了我们方法的有效性和优越性。代码和模型检查点已公开提供。
https://arxiv.org/abs/2501.04004
This paper introduces SEMISE, a novel method for representation learning in medical imaging that combines self-supervised and supervised learning. By leveraging both labeled and augmented data, SEMISE addresses the challenge of data scarcity and enhances the encoder's ability to extract meaningful features. This integrated approach leads to more informative representations, improving performance on downstream tasks. As result, our approach achieved a 12% improvement in classification and a 3% improvement in segmentation, outperforming existing methods. These results demonstrate the potential of SIMESE to advance medical image analysis and offer more accurate solutions for healthcare applications, particularly in contexts where labeled data is limited.
本文介绍了SEMISE,这是一种新颖的医疗影像表示学习方法,结合了自监督和监督学习。通过利用标记数据和增强数据,SEMISE解决了数据稀缺的问题,并增强了编码器提取有意义特征的能力。这种集成的方法产生了更具信息量的表示形式,从而提高了下游任务的表现。结果表明,我们的方法在分类任务上取得了12%的改进,在分割任务上取得了3%的改进,优于现有的方法。这些结果显示了SIMESE(原文中似乎是拼写错误,应该是SEMISE)在推进医疗影像分析方面的潜力,并为医疗应用提供更准确的解决方案,尤其是在标记数据有限的情况下。
https://arxiv.org/abs/2501.03848
Despite the significant improvements achieved by large language models (LLMs) in English reasoning tasks, these models continue to struggle with multilingual reasoning. Recent studies leverage a full-parameter and two-stage training paradigm to teach models to first understand non-English questions and then reason. However, this method suffers from both substantial computational resource computing and catastrophic forgetting. The fundamental cause is that, with the primary goal of enhancing multilingual comprehension, an excessive number of irrelevant layers and parameters are tuned during the first stage. Given our findings that the representation learning of languages is merely conducted in lower-level layers, we propose an efficient multilingual reasoning alignment approach that precisely identifies and fine-tunes the layers responsible for handling multilingualism. Experimental results show that our method, SLAM, only tunes 6 layers' feed-forward sub-layers including 6.5-8% of all parameters within 7B and 13B LLMs, achieving superior average performance than all strong baselines across 10 languages. Meanwhile, SLAM only involves one training stage, reducing training time by 4.1-11.9 compared to the two-stage method.
尽管大型语言模型(LLM)在英语推理任务中取得了显著改进,但它们在多语种推理方面仍然面临挑战。最近的研究利用了全参数和两阶段训练范式来教导模型首先理解非英文问题,然后进行推理。然而,这种方法既消耗大量的计算资源,又会导致灾难性遗忘。根本原因是,在提升多语言理解能力为主要目标的情况下,第一阶段过度调优了许多无关的层和参数。鉴于我们发现语言表征学习仅在较低级别的层次上完成,我们提出了一种高效的多语种推理对齐方法,该方法能够精准识别并微调负责处理多语种问题的相关层级。实验结果显示,我们的方法SLAM只需调整7B和13B规模LLM中6层前馈子层(约占所有参数的6.5-8%),在涉及10种语言的任务上达到了优于所有强大基准模型的平均性能。同时,SLAM仅需一个训练阶段,相比两阶段方法可将训练时间减少4.1到11.9倍。
https://arxiv.org/abs/2501.03681
Contrastive learning has gained significant attention in short text clustering, yet it has an inherent drawback of mistakenly identifying samples from the same category as negatives and then separating them in the feature space (false negative separation), which hinders the generation of superior representations. To generate more discriminative representations for efficient clustering, we propose a novel short text clustering method, called Discriminative Representation learning via \textbf{A}ttention-\textbf{E}nhanced \textbf{C}ontrastive \textbf{L}earning for Short Text Clustering (\textbf{AECL}). The \textbf{AECL} consists of two modules which are the pseudo-label generation module and the contrastive learning module. Both modules build a sample-level attention mechanism to capture similarity relationships between samples and aggregate cross-sample features to generate consistent representations. Then, the former module uses the more discriminative consistent representation to produce reliable supervision information for assist clustering, while the latter module explores similarity relationships and consistent representations optimize the construction of positive samples to perform similarity-guided contrastive learning, effectively addressing the false negative separation issue. Experimental results demonstrate that the proposed \textbf{AECL} outperforms state-of-the-art methods. If the paper is accepted, we will open-source the code.
对比学习在短文本聚类中获得了广泛关注,但它有一个内在的缺点:可能会错误地将同一类别中的样本识别为负例,并将其在特征空间中分离(假阴性分离),这阻碍了生成优质表示的能力。为了生成更具有区分性的表示以实现高效的聚类,我们提出了一种新的短文本聚类方法,称为通过注意力增强对比学习的判别式表示学习(Attention-Enhanced Contrastive Learning for Short Text Clustering, AECL)。AECL包含两个模块:伪标签生成模块和对比学习模块。这两个模块都构建了样本级别的注意机制来捕捉样本之间的相似关系,并聚合跨样本特征以生成一致的表示。前者使用更具区分性的稳定表示生成可靠的信息监督,辅助聚类;后者探索相似关系,并通过优化正例构造执行指导相似度的学习,有效解决了假阴性分离问题。实验结果表明,所提出的AECL优于最先进的方法。如果论文被接受,我们将开源代码。
https://arxiv.org/abs/2501.03584
Tabular data remains one of the most prevalent data types across a wide range of real-world applications, yet effective representation learning for this domain poses unique challenges due to its irregular patterns, heterogeneous feature distributions, and complex inter-column dependencies. This survey provides a comprehensive review of state-of-the-art techniques in tabular data representation learning, structured around three foundational design elements: training data, neural architectures, and learning objectives. Unlike prior surveys that focus primarily on either architecture design or learning strategies, we adopt a holistic perspective that emphasizes the universality and robustness of representation learning methods across diverse downstream tasks. We examine recent advances in data augmentation and generation, specialized neural network architectures tailored to tabular data, and innovative learning objectives that enhance representation quality. Additionally, we highlight the growing influence of self-supervised learning and the adaptation of transformer-based foundation models for tabular data. Our review is based on a systematic literature search using rigorous inclusion criteria, encompassing 127 papers published since 2020 in top-tier conferences and journals. Through detailed analysis and comparison, we identify emerging trends, critical gaps, and promising directions for future research, aiming to guide the development of more generalizable and effective tabular data representation methods.
表格数据仍然是众多现实世界应用程序中最常见的数据类型之一,但由于其不规则的模式、异质特征分布以及复杂的列间依赖关系,针对这一领域进行有效的表示学习提出了独特的挑战。本综述全面回顾了表征学习领域的最新技术,围绕三个基础设计要素:训练数据、神经架构和学习目标展开讨论。与以往侧重于单一架构设计或学习策略的调查不同,我们采取了一种整体视角,强调在各种下游任务中具有普适性和鲁棒性的表示学习方法的重要性。本综述考察了最近在数据增强和生成方面的进展,专为表格数据定制的专用神经网络架构以及提高表征质量的创新性学习目标。此外,还特别指出了自监督学习日益增长的影响以及基于Transformer的基础模型对表格数据进行适应的趋势。 我们的回顾建立在一个系统性的文献搜索基础上,使用严格的纳入标准,涵盖了自2020年以来在顶级会议和期刊上发表的127篇论文。通过详细的分析和比较,我们识别了新兴趋势、关键缺口及未来研究具有前景的方向,旨在引导更具通用性和有效性表格数据表示方法的发展。 综述的重点是表征学习领域的方法如何能够跨各种下游任务展现出普适性与鲁棒性,以期为未来的科学研究提供指导。
https://arxiv.org/abs/2501.03540
Self-supervised learning (SSL) has significantly advanced image representation learning, yet efficiency challenges persist, particularly with adversarial training. Many SSL methods require extensive epochs to achieve convergence, a demand further amplified in adversarial settings. To address this inefficiency, we revisit the robust EMP-SSL framework, emphasizing the importance of increasing the number of crops per image to accelerate learning. Unlike traditional contrastive learning, robust EMP-SSL leverages multi-crop sampling, integrates an invariance term and regularization, and reduces training epochs, enhancing time efficiency. Evaluated with both standard linear classifiers and multi-patch embedding aggregation, robust EMP-SSL provides new insights into SSL evaluation strategies. Our results show that robust crop-based EMP-SSL not only accelerates convergence but also achieves a superior balance between clean accuracy and adversarial robustness, outperforming multi-crop embedding aggregation. Additionally, we extend this approach with free adversarial training in Multi-Crop SSL, introducing the Cost-Free Adversarial Multi-Crop Self-Supervised Learning (CF-AMC-SSL) method. CF-AMC-SSL demonstrates the effectiveness of free adversarial training in reducing training time while simultaneously improving clean accuracy and adversarial robustness. These findings underscore the potential of CF-AMC-SSL for practical SSL applications. Our code is publicly available at this https URL.
自监督学习(Self-supervised Learning,SSL)在图像表示学习方面取得了显著进展,但仍然存在效率挑战,尤其是在对抗训练中。许多SSL方法需要大量的迭代周期才能达到收敛状态,而在对抗环境中这一需求进一步增加。为了应对这种低效问题,我们重新审视了鲁棒的EMP-SSL框架,并强调通过增加每张图片的采样数量来加速学习过程的重要性。 与传统的对比学习不同,鲁棒的EMP-SSL利用多作物抽样、整合不变性项和正则化,并减少了训练周期,从而提高了时间效率。该方法在标准线性分类器和多补丁嵌入聚合评估中提供了关于SSL评价策略的新见解。我们的实验结果显示,基于农作物的鲁棒EMP-SSL不仅加速了收敛过程,还实现了干净准确性和对抗健壮性的最佳平衡,优于多作物嵌入聚合。 此外,我们还在多作物SSL中引入了一种自由对抗训练的方法——成本免费对抗多作物自监督学习(Cost-Free Adversarial Multi-Crop Self-Supervised Learning, CF-AMC-SSL)。CF-AMC-SSL证明了自由对抗训练在减少训练时间的同时能够提升干净准确性和对抗健壮性。这些发现强调了CF-AMC-SSL在实际SSL应用中的潜力。 我们的代码可以在以下URL公开获取:[此链接](请将“this https URL”替换为实际的URL)。
https://arxiv.org/abs/2501.03507
Self-supervised learning (SSL) has emerged as a crucial technique in image processing, encoding, and understanding, especially for developing today's vision foundation models that utilize large-scale datasets without annotations to enhance various downstream tasks. This study introduces a novel SSL approach, Information-Maximized Soft Variable Discretization (IMSVD), for image representation learning. Specifically, IMSVD softly discretizes each variable in the latent space, enabling the estimation of their probability distributions over training batches and allowing the learning process to be directly guided by information measures. Motivated by the MultiView assumption, we propose an information-theoretic objective function to learn transform-invariant, non-travail, and redundancy-minimized representation features. We then derive a joint-cross entropy loss function for self-supervised image representation learning, which theoretically enjoys superiority over the existing methods in reducing feature redundancy. Notably, our non-contrastive IMSVD method statistically performs contrastive learning. Extensive experimental results demonstrate the effectiveness of IMSVD on various downstream tasks in terms of both accuracy and efficiency. Thanks to our variable discretization, the embedding features optimized by IMSVD offer unique explainability at the variable level. IMSVD has the potential to be adapted to other learning paradigms. Our code is publicly available at this https URL.
自监督学习(Self-supervised Learning,SSL)已成为图像处理、编码和理解领域中的关键技术,特别是在开发当今利用大规模无标注数据集来增强各种下游任务的视觉基础模型方面。本研究提出了一种新颖的SSL方法——信息最大化软变量离散化(Information-Maximized Soft Variable Discretization, IMSVD),用于图像表示学习。具体来说,IMSVD在隐空间中对每个变量进行软性离散化处理,这使得能够估计训练批次中的概率分布,并允许通过信息度量直接指导学习过程。 受多视图假设的启发,我们提出了一种基于信息理论的目标函数来学习变换不变、非冗余且最小冗余表示特征。随后,我们推导出一种联合交叉熵损失函数用于自监督图像表示学习,在理论上比现有方法更能减少特征冗余度。值得注意的是,我们的非对比式IMSVD方法在统计上能够实现对比学习的效果。 广泛的实验结果证明了IMSVD在各种下游任务中的有效性和高效性,无论是在准确性还是效率方面均表现出色。由于我们对变量进行离散化处理,通过IMSVD优化的嵌入特征提供了独特的可解释性,特别是在变量层面。此外,IMSVD具有适应其他学习范式的潜力。 我们的代码已公开发布在 [此处](https://this https URL)(请将URL替换为实际链接)。
https://arxiv.org/abs/2501.03469
Foundation models, particularly those that incorporate Transformer architectures, have demonstrated exceptional performance in domains such as natural language processing and image processing. Adapting these models to structured data, like tables, however, introduces significant challenges. These difficulties are even more pronounced when addressing multi-table data linked via foreign key, which is prevalent in the enterprise realm and crucial for empowering business use cases. Despite its substantial impact, research focusing on such linked business tables within enterprise settings remains a significantly important yet underexplored domain. To address this, we introduce a curated dataset sourced from an Enterprise Resource Planning (ERP) system, featuring extensive linked tables. This dataset is specifically designed to support research endeavors in table representation learning. By providing access to authentic enterprise data, our goal is to potentially enhance the effectiveness and applicability of models for real-world business contexts.
基础模型,特别是那些采用Transformer架构的模型,在自然语言处理和图像处理等领域表现出卓越性能。然而,将这些模型应用于结构化数据(如表格)时会遇到重大挑战。当涉及到通过外键链接的多表数据时,这种困难尤为显著,这种情况在企业环境中非常普遍,并且对于支持业务用例至关重要。尽管其影响巨大,但专注于此类企业环境下相互关联的业务表格的研究仍然是一个极其重要却鲜有探索的领域。 为解决这一问题,我们引入了一套来自企业资源规划(ERP)系统的精心策划的数据集,其中包含大量的链接表。该数据集旨在支持表格表示学习领域的研究工作。通过提供真实的企业的数据访问权限,我们的目标是有可能提高模型在现实世界业务场景中的有效性和适用性。
https://arxiv.org/abs/2501.03413
This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at this https URL
本文探讨了使用高斯点绘的遮蔽自编码器(Masked Autoencoders,简称MAE)技术。虽然像MAE这样的重建式自监督学习框架能够学到良好的语义抽象,但它并没有针对明确的空间感知进行训练。我们的方法,名为高斯遮蔽自编码器或GMAE,旨在同时学习语义抽象和空间理解。与MAE类似,它在像素空间中端到端地重构图像,但除此之外,还引入了一个基于3D高斯的中间表示,并通过点绘来渲染图像。我们展示了GMAE可以实现各种零样本学习的空间理解能力(例如前景-背景分割、图像分层、边缘检测等),同时保持了来自MAE的自监督表示质量的高级语义信息。据我们所知,这是首次在基于优化方法的单场景重构之外,在图像表示学习框架中使用高斯原语。我们认为GMAE将激发进一步的研究,并为开发用于建模高质量视觉数据的下一代技术做出贡献。更多详情请访问此链接:[https URL](原文中的URL被替换成了占位符)
https://arxiv.org/abs/2501.03229
Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.
最近训练的自监督学习(SSL)模型,这些模型使用类似于人类视角的第一人称视觉输入进行训练,在图像识别任务上的表现明显不如人类。这些模型使用的训练数据是从头部安装的摄像头收集到的原始、统一的视觉输入,这与人类的情况不同。在人类中,视网膜和视觉皮层的解剖结构相对放大了中央视野中的信息,即围绕着人们注视位置的信息。这种在人类身上的选择性放大可能有助于形成以物体为中心的视觉表征。 在这里,我们研究集中于中央视觉信息是否能增强第一人称视角下的视觉对象学习能力。我们使用大型Ego4D数据集模拟了五个月的第一人称视觉体验,并利用一个预测的人类注视点模型生成注视位置。为了考虑到人类对中心视野重要性的差异,我们将视觉区域裁剪为以注视位置为中心的部分。最后,我们在这些修改后的输入上训练了一个基于时间的SSL模型。 我们的实验表明,专注于中央视觉信息能够提升物体中心化的表征能力。分析显示,SSL模型利用了注视移动的时间动态性来构建更强的视觉表征。总体而言,本工作为生物启发式的视觉表示学习迈出了重要一步。
https://arxiv.org/abs/2501.02966
Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.
最近在自监督学习(SSL)模型中的成功案例是通过屏蔽图像的部分或激进地裁剪图像来建模视觉特征的空间共现。在这里,我们提出了一种新的方法,通过将局部表示(在池化之前)与全局图像表示对齐来建模空间共现。我们介绍了CO-SSL,这是一个实例区分方法家族,并展示了它在多个数据集上超越了先前的方法,包括ImageNet-1K,在该数据集中使用100个预训练周期达到了71.5%的Top-1准确率。CO-SSL还对噪声污染、内部污染、小规模对抗性攻击以及大规模训练裁剪尺寸更加鲁棒。我们的分析进一步表明,CO-SSL学习到了高度冗余的局部表示,这为其鲁棒性提供了解释。总体而言,我们的工作表明将局部和全局表示对齐可能是无监督类别学习的一个强大原则。
https://arxiv.org/abs/2501.02860
Accurate imputation of missing laboratory values in electronic health records (EHRs) is critical to enable robust clinical predictions and reduce biases in AI systems in healthcare. Existing methods, such as variational autoencoders (VAEs) and decision tree-based approaches such as XGBoost, struggle to model the complex temporal and contextual dependencies in EHR data, mainly in underrepresented groups. In this work, we propose Lab-MAE, a novel transformer-based masked autoencoder framework that leverages self-supervised learning for the imputation of continuous sequential lab values. Lab-MAE introduces a structured encoding scheme that jointly models laboratory test values and their corresponding timestamps, enabling explicit capturing temporal dependencies. Empirical evaluation on the MIMIC-IV dataset demonstrates that Lab-MAE significantly outperforms the state-of-the-art baselines such as XGBoost across multiple metrics, including root mean square error (RMSE), R-squared (R2), and Wasserstein distance (WD). Notably, Lab-MAE achieves equitable performance across demographic groups of patients, advancing fairness in clinical predictions. We further investigate the role of follow-up laboratory values as potential shortcut features, revealing Lab-MAE's robustness in scenarios where such data is unavailable. The findings suggest that our transformer-based architecture, adapted to the characteristics of the EHR data, offers a foundation model for more accurate and fair clinical imputation models. In addition, we measure and compare the carbon footprint of Lab-MAE with the baseline XGBoost model, highlighting its environmental requirements.
在电子健康记录(EHR)中准确填补缺失的实验室值对于实现稳健的临床预测和减少医疗保健中人工智能系统的偏差至关重要。现有的方法,如变分自动编码器(VAEs)和基于决策树的方法(例如XGBoost),难以捕捉EHR数据中的复杂时间序列和上下文依赖性,尤其是在代表性不足的人群中。在本研究中,我们提出了一种名为Lab-MAE的新框架,这是一种基于变压器的掩码自编码器,利用自我监督学习来填补连续序列实验室值。Lab-MAE引入了一种结构化编码方案,可以同时建模实验室测试值及其相应的时间戳,从而明确捕捉时间依赖性。 在MIMIC-IV数据集上的实证评估表明,Lab-MAE在包括均方根误差(RMSE)、R平方(R2)和Wasserstein距离(WD)等在内的多个指标上显著优于最先进的基线方法XGBoost。值得注意的是,Lab-MAE实现了不同患者人口统计群体的公平性能,在临床预测中推进了公正性。我们进一步探讨了后续实验室值作为潜在捷径特征的作用,揭示了在缺乏此类数据的情况下,Lab-MAE仍表现出稳健性的特点。 研究结果表明,根据EHR数据的特点调整的基于变压器架构为更准确和公正的临床填补模型提供了基础模型。此外,我们还测量并比较了Lab-MAE与基线XGBoost模型的碳足迹,强调了其环境要求。
https://arxiv.org/abs/2501.02648
Humans rely on high-level meta-representations to engage in abstract reasoning. In complex cognitive tasks, these meta-representations help individuals abstract general rules from experience. However, constructing such meta-representations from high-dimensional observations remains a longstanding challenge for reinforcement learning agents. For instance, a well-trained agent often fails to generalize to even minor variations of the same task, such as changes in background color, while humans can easily handle. In this paper, we build a bridge between meta-representation and generalization, showing that generalization performance benefits from meta-representation learning. We also hypothesize that deep mutual learning (DML) among agents can help them converge to meta-representations. Empirical results provide support for our theory and hypothesis. Overall, this work provides a new perspective on the generalization of deep reinforcement learning.
人类依赖高层次的元表示来进行抽象推理。在复杂的认知任务中,这些元表示帮助个体从经验中提炼出一般的规则。然而,从高维观察中构建这样的元表示仍然是强化学习代理面临的一个长期挑战。例如,一个经过良好训练的代理往往无法将所学的一般规则推广到同一任务中的细微变化上,比如背景颜色的变化,而人类则可以轻松应对这种变化。在这篇论文中,我们建立了一个连接元表示和泛化的桥梁,并展示了泛化性能可以从元表示学习中获益。此外,我们假设代理之间的深度互学习(DML)可以帮助它们收敛到元表示。实证结果支持了我们的理论和假设。总体而言,这项工作为深度强化学习的泛化提供了一种新的视角。
https://arxiv.org/abs/2501.02481
Remote sensing data is often distributed across multiple institutions, and due to privacy concerns and data-sharing restrictions, leveraging large-scale datasets in a centralized training framework is challenging. Federated learning offers a promising solution by enabling collaborative model training across distributed data sources without requiring data centralization. However, current Vision-Language Models (VLMs), which typically contain billions of parameters, pose significant communication challenges for traditional federated learning approaches based on model parameter updates, as they would incur substantial communication costs. In this paper, we propose FedRSCLIP, the first federated learning framework designed for remote sensing image classification based on a VLM, specifically CLIP. FedRSCLIP addresses the challenges of data heterogeneity and large-scale model transmission in federated environments by introducing Prompt Learning, which optimizes only a small set of tunable parameters. The framework introduces a dual-prompt mechanism, comprising Shared Prompts for global knowledge sharing and Private Prompts for client-specific adaptation. To maintain semantic coherence between shared and private prompts, we propose the Dual Prompt Alignment Constraint to balance global consistency and local adaptability across diverse client distributions. Additionally, to enhance cross-modal representation learning, we introduce the Cross-Modal Feature Alignment Constraint to align multimodal features between text and image prompts. To validate the effectiveness of our proposed model, we construct a Fed-RSIC dataset based on three existing remote sensing image classification datasets, specifically designed to simulate various federated learning configurations. Experimental results demonstrate the effectiveness and superiority of FedRSCLIP in remote sensing image classification.
遥感数据通常分布在多个机构之间,由于隐私顾虑和数据共享限制,在集中式训练框架中利用大规模的数据集具有挑战性。联邦学习通过允许多个分布式数据源协同模型训练而不需进行数据集中化提供了一个有前景的解决方案。然而,现有的视觉-语言模型(VLMs)通常包含数十亿参数,这对基于模型参数更新的传统联邦学习方法提出了显著的通信挑战,因为这将带来高昂的通信成本。在本文中,我们提出FedRSCLIP,这是首个为基于视觉-语言模型CLIP的遥感图像分类设计的联邦学习框架。FedRSCLIP通过引入提示学习解决了数据异构性和大规模模型传输在联邦环境中的挑战,仅优化一小部分可调参数。该框架引入了双提示机制,包括用于全局知识共享的共享提示和针对特定客户端适应的私有提示。为了保持共享与私人提示之间的语义一致性,我们提出了双提示对齐约束以在全球一致性和局部适应性之间平衡各种客户端分布情况下的差异。此外,为增强跨模态表示学习,我们引入了跨模态特征对齐约束,用于调整文本和图像提示之间的多模式特征。为了验证所提出模型的有效性,我们基于三个现有的遥感图像分类数据集构建了一个Fed-RSIC数据集,专门设计以模拟各种联邦学习配置。实验结果证明了FedRSCLIP在遥感图像分类中的有效性和优越性。
https://arxiv.org/abs/2501.02461
Meteorological factors (MF) are crucial in day-ahead load forecasting as they significantly influence the electricity consumption behaviors of consumers. Numerous studies have incorporated MF into the load forecasting model to achieve higher accuracy. Selecting MF from one representative location or the averaged MF as the inputs of the forecasting model is a common practice. However, the difference in MF collected in various locations within a region may be significant, which poses a challenge in selecting the appropriate MF from numerous locations. A representation learning framework is proposed to extract geo-distributed MF while considering their spatial relationships. In addition, this paper employs the Shapley value in the graph-based model to reveal connections between MF collected in different locations and loads. To reduce the computational complexity of calculating the Shapley value, an acceleration method is adopted based on Monte Carlo sampling and weighted linear regression. Experiments on two real-world datasets demonstrate that the proposed method improves the day-ahead forecasting accuracy, especially in extreme scenarios such as the "accumulation temperature effect" in summer and "sudden temperature change" in winter. We also find a significant correlation between the importance of MF in different locations and the corresponding area's GDP and mainstay industry.
气象因素(MF)在日前负荷预测中至关重要,因为它们显著影响消费者的用电行为。许多研究将气象因素纳入负荷预测模型以提高准确性。通常做法是从一个代表性地点选择气象因素或使用平均的气象数据作为预测模型的输入。然而,在区域内的不同位置收集到的气象差异可能很大,这给从众多地点选取适当的气象因素带来了挑战。本文提出了一种表示学习框架,用于提取地理位置分布的气象因素,并考虑它们的空间关系。此外,本文还利用基于图的模型中的夏普利值揭示了在不同地点采集的气象因素与负荷之间的联系。为了减少计算夏普利值的复杂度,采用了一种基于蒙特卡洛抽样和加权线性回归的方法进行加速。 实验使用两个真实世界的数据集证明,所提出的方法提高了日前预测准确性,尤其是在夏季“累积温度效应”和冬季“突然气温变化”等极端情况下。我们还发现,在不同地点气象因素的重要性与对应地区的GDP及主导产业之间存在显著的相关性。
https://arxiv.org/abs/2501.02241
The advent of blockchain technology has facilitated the widespread adoption of smart contracts in the financial sector. However, current fraud detection methodologies exhibit limitations in capturing both global structural patterns within transaction networks and local semantic relationships embedded in transaction data. Most existing models focus on either structural information or semantic features individually, leading to suboptimal performance in detecting complex fraud this http URL this paper, we propose a dynamic feature fusion model that combines graph-based representation learning and semantic feature extraction for blockchain fraud detection. Specifically, we construct global graph representations to model account relationships and extract local contextual features from transaction data. A dynamic multimodal fusion mechanism is introduced to adaptively integrate these features, enabling the model to capture both structural and semantic fraud patterns effectively. We further develop a comprehensive data processing pipeline, including graph construction, temporal feature enhancement, and text preprocessing. Experimental results on large-scale real-world blockchain datasets demonstrate that our method outperforms existing benchmarks across accuracy, F1 score, and recall metrics. This work highlights the importance of integrating structural relationships and semantic similarities for robust fraud detection and offers a scalable solution for securing blockchain systems.
区块链技术的出现促进了金融领域中智能合约的广泛应用。然而,现有的欺诈检测方法在捕捉交易网络中的全局结构模式和嵌入于交易数据中的局部语义关系方面存在局限性。大多数现有模型仅关注结构信息或语义特征之一,导致其在复杂欺诈行为检测方面的性能不理想。在这篇论文中,我们提出了一种动态特征融合模型,该模型结合了基于图的表示学习和语义特征提取方法来进行区块链欺诈检测。具体来说,我们构建全局图表示来建模账户关系,并从交易数据中提取局部上下文特征。引入了一个动态多模式融合机制以自适应地整合这些特征,使模型能够有效捕捉结构和语义欺诈模式。此外,我们还开发了一套全面的数据处理管道,包括图形构建、时间特征增强和文本预处理。在大规模真实世界区块链数据集上的实验结果表明,我们的方法在准确率、F1分数和召回率等指标上优于现有基准。这项工作突显了结合结构关系与语义相似性对于稳健的欺诈检测的重要性,并为保障区块链系统的安全性提供了可扩展的解决方案。
https://arxiv.org/abs/2501.02032