We present a hybrid-view-based knowledge distillation framework, termed HVDistill, to guide the feature learning of a point cloud neural network with a pre-trained image network in an unsupervised man- ner. By exploiting the geometric relationship between RGB cameras and LiDAR sensors, the correspondence between the two modalities based on both image- plane view and bird-eye view can be established, which facilitates representation learning. Specifically, the image-plane correspondences can be simply ob- tained by projecting the point clouds, while the bird- eye-view correspondences can be achieved by lifting pixels to the 3D space with the predicted depths un- der the supervision of projected point clouds. The image teacher networks provide rich semantics from the image-plane view and meanwhile acquire geometric information from the bird-eye view. Indeed, image features from the two views naturally comple- ment each other and together can ameliorate the learned feature representation of the point cloud stu- dent networks. Moreover, with a self-supervised pre- trained 2D network, HVDistill requires neither 2D nor 3D annotations. We pre-train our model on nuScenes dataset and transfer it to several downstream tasks on nuScenes, SemanticKITTI, and KITTI datasets for evaluation. Extensive experimental results show that our method achieves consistent improvements over the baseline trained from scratch and significantly out- performs the existing schemes. Codes are available at git@github.com:zhangsha1024/HVDistill.git.
我们提出了一个基于混合视图的知识蒸馏框架HVDistill,用于指导在无监督情况下预训练图像网络的点云神经网络的特征学习。通过利用RGB相机和LiDAR传感器之间的几何关系,可以建立基于图像平面视图和鸟眼视图之间的两个模态之间的对应关系,从而促进表示学习。具体来说,可以通过将点云投影来获得图像平面视图的对应关系,而通过预测深度并监督鸟眼视图中的像素来获得鸟眼视图的对应关系。图像教师网络从图像平面视图提供丰富的语义信息,同时从鸟眼视图获取几何信息。事实上,两种视图的图像特征自然互补, together可以减轻点云学生网络学习到的特征表示。此外,与自监督预训练的2D网络相比,HVDistill不需要2D或3D注释。我们在 nuScenes 数据集上预训练我们的模型,并将其转移到 nuScenes、SemanticKITTI 和 KITTI 数据集上进行评估。大量实验结果表明,我们的方法在基线自定义训练方法和现有方案上实现了稳健的改进,并且显著超过了现有方案。代码可以在[git@github.com](mailto:git@github.com:zhangsha1024/HVDistill.git)中获取:张莎1024/HVDistill。
https://arxiv.org/abs/2403.11817
Recently, feature relation learning has drawn widespread attention in cross-spectral image patch matching. However, existing related research focuses on extracting diverse relations between image patch features and ignores sufficient intrinsic feature representations of individual image patches. Therefore, an innovative relational representation learning idea is proposed for the first time, which simultaneously focuses on sufficiently mining the intrinsic features of individual image patches and the relations between image patch features. Based on this, we construct a lightweight Relational Representation Learning Network (RRL-Net). Specifically, we innovatively construct an autoencoder to fully characterize the individual intrinsic features, and introduce a Feature Interaction Learning (FIL) module to extract deep-level feature relations. To further fully mine individual intrinsic features, a lightweight Multi-dimensional Global-to-Local Attention (MGLA) module is constructed to enhance the global feature extraction of individual image patches and capture local dependencies within global features. By combining the MGLA module, we further explore the feature extraction network and construct an Attention-based Lightweight Feature Extraction (ALFE) network. In addition, we propose a Multi-Loss Post-Pruning (MLPP) optimization strategy, which greatly promotes network optimization while avoiding increases in parameters and inference time. Extensive experiments demonstrate that our RRL-Net achieves state-of-the-art (SOTA) performance on multiple public datasets. Our code will be made public later.
近年来,在跨光谱图像补丁匹配中,特征关系学习已经引起了广泛的关注。然而,现有相关研究仅关注从图像补丁特征中提取多样关系,而忽略了个体图像补丁的充分固有特征表示。因此,我们提出了一个新的关系表示学习想法,该想法同时关注个体图像补丁的固有特征和图像补丁特征之间的关系。基于此,我们构建了一个轻量级的关系表示学习网络(RRL-Net)。具体来说,我们创新地构建了一个自编码器,全面描述了个体固有特征,并引入了特征交互学习(FIL)模块来提取深度层特征关系。为了进一步挖掘个体固有特征,我们构建了一个轻量多维全局到局部关注(MGLA)模块,增强了个体图像补丁的全局特征提取,并捕捉到全局特征中的局部依赖关系。通过结合MGLA模块,我们进一步探索了特征提取网络,并构建了基于注意力的轻量级特征提取(ALFE)网络。此外,我们提出了一个多损失后修剪(MLPP)优化策略,可以在网络优化时大大降低参数和推理时间,同时在保持性能的同时提高网络的训练效率。大量实验证明,我们的RRL-Net在多个公共数据集上实现了最先进的(SOTA)性能。我们的代码稍后公开发布。
https://arxiv.org/abs/2403.11751
Self-supervised learning (SSL) is potentially useful in reducing the need for manual annotation and making deep learning models accessible for medical image analysis tasks. By leveraging the representations learned from unlabeled data, self-supervised models perform well on tasks that require little to no fine-tuning. However, for medical images, like chest X-rays, which are characterized by complex anatomical structures and diverse clinical conditions, there arises a need for representation learning techniques that can encode fine-grained details while preserving the broader contextual information. In this context, we introduce MLVICX (Multi-Level Variance-Covariance Exploration for Chest X-ray Self-Supervised Representation Learning), an approach to capture rich representations in the form of embeddings from chest X-ray images. Central to our approach is a novel multi-level variance and covariance exploration strategy that empowers the model to detect diagnostically meaningful patterns while reducing redundancy effectively. By enhancing the variance and covariance of the learned embeddings, MLVICX promotes the retention of critical medical insights by adapting both global and local contextual details. We demonstrate the performance of MLVICX in advancing self-supervised chest X-ray representation learning through comprehensive experiments. The performance enhancements we observe across various downstream tasks highlight the significance of the proposed approach in enhancing the utility of chest X-ray embeddings for precision medical diagnosis and comprehensive image analysis. For pertaining, we used the NIH-Chest X-ray dataset, while for downstream tasks, we utilized NIH-Chest X-ray, Vinbig-CXR, RSNA pneumonia, and SIIM-ACR Pneumothorax datasets. Overall, we observe more than 3% performance gains over SOTA SSL approaches in various downstream tasks.
自监督学习(SSL)有可能在减少手动注释的需求并使深度学习模型对医疗图像分析任务具有可达性方面发挥作用。通过利用无标签数据中学习的表示,自监督模型在需要很少或不进行微调的任务上表现良好。然而,对于像胸部X光片这样的医疗图像,由于其复杂的解剖结构和多样化的临床条件,需要学习能够保留细粒度信息的同时保留更广泛的上下文信息的表示学习技术。在這種背景下,我们引入了MLVICX(多层方差-协方差探索用于胸部X光片自监督表示学习),一种以捕捉胸部X光片图像中嵌入的丰富表示的方法。我们方法的核心是一种新颖的多层方差和协方差探索策略,它使模型能够检测出具有诊断意义的模式,同时有效地减少冗余。通过增强学习到的嵌入的方差和协方差,MLVICX通过适应全局和局部上下文细节来保留关键的医学见解。我们通过全面的实验评估了MLVICX在促进自监督胸部X光片表示学习方面的性能。我们观察到通过各种下游任务的性能提升,MLVICX在精准医学诊断和全面图像分析方面的应用价值得到了显著提高。对于相关任务,我们使用了NIH-Chest X光片数据集,而对于下游任务,我们使用了NIH-Chest X光片、Vinbig-CXR、RSNA肺炎和SIIM-ACR肺炎数据集。总的来说,我们观察到MLVICX在各种下游任务上的性能提升超过当前最先进的自监督学习方法。
https://arxiv.org/abs/2403.11504
In this study, we introduce a novel framework called Toast for learning general-purpose representations of road networks, along with its advanced counterpart DyToast, designed to enhance the integration of temporal dynamics to boost the performance of various time-sensitive downstream tasks. Specifically, we propose to encode two pivotal semantic characteristics intrinsic to road networks: traffic patterns and traveling semantics. To achieve this, we refine the skip-gram module by incorporating auxiliary objectives aimed at predicting the traffic context associated with a target road segment. Moreover, we leverage trajectory data and design pre-training strategies based on Transformer to distill traveling semantics on road networks. DyToast further augments this framework by employing unified trigonometric functions characterized by their beneficial properties, enabling the capture of temporal evolution and dynamic nature of road networks more effectively. With these proposed techniques, we can obtain representations that encode multi-faceted aspects of knowledge within road networks, applicable across both road segment-based applications and trajectory-based applications. Extensive experiments on two real-world datasets across three tasks demonstrate that our proposed framework consistently outperforms the state-of-the-art baselines by a significant margin.
在这项研究中,我们提出了一个名为Toast的新框架,用于学习通用道路网络表示,以及其高级对应物DyToast,旨在提高将时变动力整合到各种时间敏感的下游任务中,从而提高其性能。具体来说,我们通过引入两个关键的语义特征来编码道路网络的本质特征:交通模式和旅行语义。为了实现这一目标,我们通过引入旨在预测目标路段与交通上下文的辅助目标来优化跳字表模块。此外,我们利用轨迹数据并基于Transformer设计预训练策略,以在道路网络中提取旅行语义。DyToast通过使用统一三角函数,具有其有益的特性,进一步增强了这一框架。这些提出的技术可以使我们在道路网络中捕获到多方面的知识,并适用于基于道路段和基于轨迹的应用程序。在两个真实世界数据集上的三个任务上进行的大量实验证明,与其他最先进的基线相比,我们提出的框架在显著的领域内始终表现出色。
https://arxiv.org/abs/2403.11495
An effective technique for obtaining high-quality representations is adding a projection head on top of the encoder during training, then discarding it and using the pre-projection representations. Despite its proven practical effectiveness, the reason behind the success of this technique is poorly understood. The pre-projection representations are not directly optimized by the loss function, raising the question: what makes them better? In this work, we provide a rigorous theoretical answer to this question. We start by examining linear models trained with self-supervised contrastive loss. We reveal that the implicit bias of training algorithms leads to layer-wise progressive feature weighting, where features become increasingly unequal as we go deeper into the layers. Consequently, lower layers tend to have more normalized and less specialized representations. We theoretically characterize scenarios where such representations are more beneficial, highlighting the intricate interplay between data augmentation and input features. Additionally, we demonstrate that introducing non-linearity into the network allows lower layers to learn features that are completely absent in higher layers. Finally, we show how this mechanism improves the robustness in supervised contrastive learning and supervised learning. We empirically validate our results through various experiments on CIFAR-10/100, UrbanCars and shifted versions of ImageNet. We also introduce a potential alternative to projection head, which offers a more interpretable and controllable design.
获得高质量表示的一种有效技术是在训练过程中在编码器顶部添加投影头,然后丢弃它并使用预投影表示。尽管这种技术已经证明具有实际效果,但成功背后的原因仍然不太清楚。预投影表示不是由损失函数直接优化的,这引发了一个问题:它们为什么更好?在本文中,我们提供了对这个问题的严谨理论回答。我们首先研究了使用自监督对比损失训练的线性模型。我们揭示了训练算法的隐含偏见导致逐层增大的特征加权,其中特征在层次结构中变得越来越不平等。因此,低层往往具有归一化和更少 specialized的表示。我们理论上刻画了这种表示更有益的场景,突出了数据增强和输入特征之间的复杂相互作用。此外,我们还证明了引入非线性到网络中可以让低层学习完全存在于高层中没有的特征。最后,我们证明了这种机制可以提高在监督对比学习和监督学习中的鲁棒性。我们通过在CIFAR-10/100、UrbanCars和图像Net的转置版本上进行各种实验来验证我们的结果。我们还引入了一种可能的投影头替代方案,它具有更可解释和可控的设计。
https://arxiv.org/abs/2403.11391
Current LiDAR-based Vehicle-to-Everything (V2X) multi-agent perception systems have shown the significant success on 3D object detection. While these models perform well in the trained clean weather, they struggle in unseen adverse weather conditions with the real-world domain gap. In this paper, we propose a domain generalization approach, named V2X-DGW, for LiDAR-based 3D object detection on multi-agent perception system under adverse weather conditions. Not only in the clean weather does our research aim to ensure favorable multi-agent performance, but also in the unseen adverse weather conditions by learning only on the clean weather data. To advance research in this area, we have simulated the impact of three prevalent adverse weather conditions on two widely-used multi-agent datasets, resulting in the creation of two novel benchmark datasets: OPV2V-w and V2XSet-w. To this end, we first introduce the Adaptive Weather Augmentation (AWA) to mimic the unseen adverse weather conditions, and then propose two alignments for generalizable representation learning: Trust-region Weather-invariant Alignment (TWA) and Agent-aware Contrastive Alignment (ACA). Extensive experimental results demonstrate that our V2X-DGW achieved improvements in the unseen adverse weather conditions.
目前基于激光雷达的车辆到一切(V2X)多代理感知系统在3D物体检测方面表现出了显著的成功。然而,在训练干净天气时,这些模型在未见到的恶劣天气条件下表现不佳。在本文中,我们提出了一个名为V2X-DGW的多代理感知系统在恶劣天气条件下的领域泛化方法,以提高基于激光雷达的3D物体检测的性能。我们不仅要在干净天气中确保多代理物的优越性能,而且要在未见到的恶劣天气条件下通过仅学习干净天气数据来提高性能。为了推动该领域的研究,我们通过模拟三种普遍的恶劣天气条件对两个广泛使用的多代理数据集的影响,创建了两个新的基准数据集:OPV2V-w和V2XSet-w。因此,我们首先引入了自适应天气增强(AWA)来模拟未见到的恶劣天气条件,然后提出了两种可扩展表示学习 alignments:信任区域天气无关对齐(TWA)和代理感知对比对齐(ACA)。大量的实验结果表明,我们的V2X-DGW在未见到的恶劣天气条件下取得了显著的提高。
https://arxiv.org/abs/2403.11371
We investigate the entity alignment problem with unlabeled dangling cases, meaning that there are entities in the source or target graph having no counterparts in the other, and those entities remain unlabeled. The problem arises when the source and target graphs are of different scales, and it is much cheaper to label the matchable pairs than the dangling entities. To solve the issue, we propose a novel GNN-based dangling detection and entity alignment framework. While the two tasks share the same GNN and are trained together, the detected dangling entities are removed in the alignment. Our framework is featured by a designed entity and relation attention mechanism for selective neighborhood aggregation in representation learning, as well as a positive-unlabeled learning loss for an unbiased estimation of dangling entities. Experimental results have shown that each component of our design contributes to the overall alignment performance which is comparable or superior to baselines, even if the baselines additionally have 30\% of the dangling entities labeled as training data.
我们研究了未标记的悬空边例问题,这意味着在源或目标图中,有一些实体没有与之对应的其他实体,而这些实体仍然保持未标记状态。当源和目标图具有不同的规模时,标签匹配对比悬空边例要便宜得多。为了解决这个问题,我们提出了一个基于图卷积网络(GCN)的新型悬空边检测和实体对齐框架。虽然这两个任务共享相同的GCN,并且一起训练,但在对齐过程中检测到的悬空边实体被移除。我们的框架以其在表示学习中的选择性邻居聚合实体注意力机制和无偏估计悬空边实体正则化损失而闻名。实验结果表明,我们设计的每个组件都贡献于整体对齐性能,该性能与基线相当或者更好,即使基线还额外有30%的悬空边实体被标记作为训练数据。
https://arxiv.org/abs/2403.10978
Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain, highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end, we propose an innovative framework for multi-view representation learning, which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction, enabling the extraction of compact, high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally, we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations, resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations, enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover, we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. Our code is accessible at: this https URL.
多视角表示学习的目标是从多样数据源中提取具有视图一致性和视图特异性强的稳健表示。本文对这一领域现有的方法进行了深入分析,突出了一个常常被忽视的方面:视图一致性和视图特异性表示之间的冗余性。为此,我们提出了一个创新的多视角表示学习框架,其中我们称之为“去溶剂化去噪”的技术。我们的方法引入了遮罩跨视预测的概念,使得各种来源的视图一致性高质高效的表示能够从中提取,而不会产生额外的计算开销。此外,我们还开发了一个去溶剂化去噪模块,能够有效地从多视角表示中过滤出一致性相关的信息,从而得到更纯净的视图特异性表示。这种方法显著减少了视图一致性和视图特异性表示之间的冗余性,提高了整个学习过程的效率。我们的实证评估发现,较高的遮罩比例会大大提高视图一致性表示的质量。此外,我们还发现,将视图一致性表示的维度从视图特异性表示的维度中降低可以进一步优化结合表示的质量。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2403.10897
In social network service platforms, crime suspects are likely to use cybercrime coded words for communication by adding criminal meanings to existing words or replacing them with similar words. For instance, the word 'ice' is often used to mean methamphetamine in drug crimes. To analyze the nature of cybercrime and the behavior of criminals, quickly detecting such words and further understanding their meaning are critical. In the automated cybercrime coded word detection problem, it is difficult to collect a sufficient amount of training data for supervised learning and to directly apply language models that utilize context information to better understand natural language. To overcome these limitations, we propose a new two-step approach, in which a mean latent vector is constructed for each cybercrime through one of five different AutoEncoder models in the first step, and cybercrime coded words are detected based on multi-level latent representations in the second step. Moreover, to deeply understand cybercrime coded words detected through the two-step approach, we propose three novel methods: (1) Detection of new words recently coined, (2) Detection of words frequently appeared in both drug and sex crimes, and (3) Automatic generation of word taxonomy. According to our experimental results, among various AutoEncoder models, the stacked AutoEncoder model shows the best performance. Additionally, the F1-score of the two-step approach is 0.991, which is higher than 0.987 and 0.903 of the existing dark-GloVe and dark-BERT models. By analyzing the experimental results of the three proposed methods, we can gain a deeper understanding of drug and sex crimes.
在社交媒体服务平台上,犯罪嫌疑人可能会通过在现有单词或替换为具有犯罪含义的类似单词来使用网络犯罪编码单词进行通信。例如,单词'ice' 通常被用作毒品犯罪中的甲苯胺。为了分析网络犯罪及其罪犯行为的本质,快速检测这些单词以及进一步理解其意义至关重要。在自动网络犯罪编码单词检测问题中,很难收集到足够的训练数据来进行监督学习,并且很难直接应用利用上下文信息进行更好理解的 language 模型。为了克服这些限制,我们提出了一个新的两步方法,其中第一步通过五种不同的自编码器模型中的一个为每个网络犯罪构建平均潜在向量,第二步根据多级潜在表示进行网络犯罪编码单词的检测。此外,为了深入理解通过两步方法检测到的网络犯罪编码单词,我们提出了三种新颖的方法:(1)检测新近 coin-created 单词,(2)检测同时出现在毒品和性犯罪中的单词,(3)自动生成单词词典。根据我们的实验结果,在各种自编码器模型中,堆叠自编码器模型显示出最佳性能。此外,两步方法的平均 F1 分数为0.991,高于现有 dark-GloVe 和 dark-BERT 模型的0.987和0.903。通过分析三种方法的研究结果,我们可以更深入地了解毒品和性犯罪。
https://arxiv.org/abs/2403.10838
Finding effective representations for time series data is a useful but challenging task. Several works utilize self-supervised or unsupervised learning methods to address this. However, there still remains the open question of how to leverage available label information for better representations. To answer this question, we exploit pre-existing techniques in time series and representation learning domains and develop a simple, yet novel fusion model, called: \textbf{S}upervised \textbf{CO}ntrastive \textbf{T}emporal \textbf{T}ransformer (SCOTT). We first investigate suitable augmentation methods for various types of time series data to assist with learning change-invariant representations. Secondly, we combine Transformer and Temporal Convolutional Networks in a simple way to efficiently learn both global and local features. Finally, we simplify Supervised Contrastive Loss for representation learning of labelled time series data. We preliminarily evaluate SCOTT on a downstream task, Time Series Classification, using 45 datasets from the UCR archive. The results show that with the representations learnt by SCOTT, even a weak classifier can perform similar to or better than existing state-of-the-art models (best performance on 23/45 datasets and highest rank against 9 baseline models). Afterwards, we investigate SCOTT's ability to address a real-world task, online Change Point Detection (CPD), on two datasets: a human activity dataset and a surgical patient dataset. We show that the model performs with high reliability and efficiency on the online CPD problem ($\sim$98\% and $\sim$97\% area under precision-recall curve respectively). Furthermore, we demonstrate the model's potential in tackling early detection and show it performs best compared to other candidates.
寻找有效的时序数据表示是一个有价值但具有挑战性的任务。几篇工作利用自监督或无监督学习方法来解决这个问题的。然而,仍然存在一个未解决的问题,那就是如何利用现有标签信息来获得更好的表示。为了回答这个问题,我们利用时间序列和表示学习领域中的预先存在的技术,开发了一个简单但新颖的融合模型,称为:有监督对比时空Transformer (SCOTT)。 我们首先调查了各种类型的时间序列数据中有效的增强方法,以帮助学习不变的特征表示。接着,我们将Transformer和Temporal Convolutional Networks以一种简单的方式结合,以有效地学习全局和局部特征。最后,我们简化了监督对比损失在有标签时间序列数据表示学习中的应用。 我们在UCR档案中的45个数据集上初步评估了SCOTT。结果表明,即使在没有标签信息的弱分类器也能表现出与现有最先进的模型相似或更好的性能(在23/45个数据集上的最佳性能和最高排名)。此后,我们研究了SCOTT在两个真实世界任务上的能力:在线变化点检测(CPD)和手术患者数据集。我们发现,该模型在在线CPD问题上表现出了高可靠性和高效率(分别约为98%和97%的准确率)。此外,我们还证明了该模型在解决早期检测问题方面的潜力,并将其与其他候选模型相比表现得最好。
https://arxiv.org/abs/2403.10787
Recently there has been a growing interest in industry and academia, regarding the use of wireless chargers to prolong the operational longevity of unmanned aerial vehicles (commonly knowns as drones). In this paper we consider a charger-assisted drone application: a drone is deployed to observe a set points of interest, while a charger can move to recharge the drone's battery. We focus on the route and charging schedule of the drone and the mobile charger, to obtain high observation utility with the shortest possible time, while ensuring the drone remains operational during task execution. Essentially, this proposed drone-charger scheduling problem is a multi-stage decision-making process, in which the drone and the mobile charger act as two agents who cooperate to finish a task. The discrete-continuous hybrid action space of the two agents poses a significant challenge in our problem. To address this issue, we present a hybrid-action deep reinforcement learning framework, called HaDMC, which uses a standard policy learning algorithm to generate latent continuous actions. Motivated by representation learning, we specifically design and train an action decoder. It involves two pipelines to convert the latent continuous actions into original discrete and continuous actions, by which the drone and the charger can directly interact with environment. We embed a mutual learning scheme in model training, emphasizing the collaborative rather than individual actions. We conduct extensive numerical experiments to evaluate HaDMC and compare it with state-of-the-art deep reinforcement learning approaches. The experimental results show the effectiveness and efficiency of our solution.
近年来,在工业和学术界对使用无线充电器延长无人机(通常称为无人机)的操作寿命产生了浓厚的兴趣。在本文中,我们考虑了一个充电辅助无人机应用:无人机部署到观察感兴趣点,同时充电器可以移动给无人机充电。我们关注无人机和移动充电器的路线和充电计划,以实现以最短时间获得高观察值,同时确保无人机在任务执行期间保持操作状态。本质上,所提出的无人机充电器调度问题是一个多阶段决策过程,其中无人机和移动充电器作为两个合作完成任务的代理。两个代理的离散-连续混合动作空间给我们带来了巨大的挑战。为解决这一问题,我们提出了一个混合动作深度强化学习框架,称为HaDMC,它使用标准的策略学习算法生成潜在连续动作。受到表示学习的影响,我们特别设计并训练了一个动作编码器。它包括两个通道,将潜在连续动作转换为原始离散和连续动作,从而使无人机和充电器可以直接与环境互动。我们在模型训练中嵌入互学习方案,强调合作行动而不是单独行动。我们进行了广泛的数值实验来评估HaDMC并将其与最先进的深度强化学习方法进行比较。实验结果表明,我们的解决方案的有效性和效率。
https://arxiv.org/abs/2403.10761
Semi-supervised learning (SSL) seeks to enhance task performance by training on both labeled and unlabeled data. Mainstream SSL image classification methods mostly optimize a loss that additively combines a supervised classification objective with a regularization term derived solely from unlabeled data. This formulation neglects the potential for interaction between labeled and unlabeled images. In this paper, we introduce InterLUDE, a new approach to enhance SSL made of two parts that each benefit from labeled-unlabeled interaction. The first part, embedding fusion, interpolates between labeled and unlabeled embeddings to improve representation learning. The second part is a new loss, grounded in the principle of consistency regularization, that aims to minimize discrepancies in the model's predictions between labeled versus unlabeled inputs. Experiments on standard closed-set SSL benchmarks and a medical SSL task with an uncurated unlabeled set show clear benefits to our approach. On the STL-10 dataset with only 40 labels, InterLUDE achieves 3.2% error rate, while the best previous method reports 14.9%.
半监督学习(SSL)旨在通过同时训练有标签和无标签数据来提高任务性能。主流的 SSL 图像分类方法主要优化一个将监督分类目标与仅来自无标签数据的正则化项相结合的损失。这种表示法忽略了标签和无标签图像之间可能存在的交互作用。在本文中,我们引入了 InterLUDE,一种由两个部分组成的新的 SSL 增强方法,每个部分都从有标签和无标签的交互中受益。第一部分是嵌入融合,将有标签和无标签嵌入之间进行平滑处理以提高表示学习。第二部分是基于一致正则化原则的新损失,旨在最小化模型预测的有标签和无标签输入之间的差异。在标准的关闭集 SSL 基准测试和带无标签数据的医学 SSL 任务上进行实验,我们的方法显示出明显的优势。在 STL-10 数据集上,仅包含 40 个标签,InterLUDE 实现 3.2% 的误差率,而最佳先前方法报告的误差率为 14.9%。
https://arxiv.org/abs/2403.10658
Vision-language pre-training (VLP) models have shown significant advancements in the medical domain. Yet, most VLP models align raw reports to images at a very coarse level, without modeling fine-grained relationships between anatomical and pathological concepts outlined in reports and the corresponding semantic counterparts in images. To address this problem, we propose a Medical Dual-Stream Language-Image Pre-training (MeDSLIP) framework. Specifically, MeDSLIP establishes vision-language fine-grained alignments via disentangling visual and textual representations into anatomy-relevant and pathology-relevant streams. Moreover, a novel vision-language Prototypical Contr-astive Learning (ProtoCL) method is adopted in MeDSLIP to enhance the alignment within the anatomical and pathological streams. MeDSLIP further employs cross-stream Intra-image Contrastive Learning (ICL) to ensure the consistent coexistence of paired anatomical and pathological concepts within the same image. Such a cross-stream regularization encourages the model to exploit the synchrony between two streams for a more comprehensive representation learning. MeDSLIP is evaluated under zero-shot and supervised fine-tuning settings on three public datasets: NIH CXR14, RSNA Pneumonia, and SIIM-ACR Pneumothorax. Under these settings, MeDSLIP outperforms six leading CNN-based models on classification, grounding, and segmentation tasks.
视觉语言预训练(VLP)模型在医学领域取得了显著的进步。然而,大多数VLP模型在非常粗浅的层次将原始报告与图像对齐,而没有对报告中提到的解剖结构和病理概念与图像中的相应语义对应之间进行微调。为了解决这个问题,我们提出了医学双流语言-图像预训练(MeDSLIP)框架。具体来说,MeDSLIP通过将视觉和文本表示分离成解剖学和病理学相关的流来建立视觉语言细粒度对齐。此外,MeDSLIP还采用了一种新颖的视觉语言原型对比学习(ProtoCL)方法来增强解剖学和病理学流之间的对齐。MeDSLIP进一步采用跨流图像内对比学习(ICL)来确保同一图像中相邻解剖和病理概念的 consistent coexistence。这种跨流 regularization 鼓励模型利用两个流之间的同步来进行更全面的表示学习。MeDSLIP 在三个公共数据集上的评估是在零散监督和微调设置下进行的:NIH CXR14、RSNA Pneumonia 和 SIIM-ACR Pneumothorax。在这些设置下,MeDSLIP 超越了六个领先于CNN的模型在分类、定量和分割任务上的表现。
https://arxiv.org/abs/2403.10635
Deep learning (DL) models have been advancing automatic medical image analysis on various modalities, including echocardiography, by offering a comprehensive end-to-end training pipeline. This approach enables DL models to regress ejection fraction (EF) directly from 2D+time echocardiograms, resulting in superior performance. However, the end-to-end training pipeline makes the learned representations less explainable. The representations may also fail to capture the continuous relation among echocardiogram clips, indicating the existence of spurious correlations, which can negatively affect the generalization. To mitigate this issue, we propose CoReEcho, a novel training framework emphasizing continuous representations tailored for direct EF regression. Our extensive experiments demonstrate that CoReEcho: 1) outperforms the current state-of-the-art (SOTA) on the largest echocardiography dataset (EchoNet-Dynamic) with MAE of 3.90 & R2 of 82.44, and 2) provides robust and generalizable features that transfer more effectively in related downstream tasks. The code is publicly available at this https URL.
深度学习(DL)模型已经在各种模式上 advances automatic medical image analysis,包括超声心动图,通过提供端到端的训练管道。这种方法使得DL模型可以从2D+时间超声心动图中直接回归射血分数(EF),从而实现卓越的性能。然而,端到端的训练管道使学习到的表示变得难以解释。表示也可能无法捕捉超声心动图片段之间的连续关系,表明存在伪相关,这可能对泛化产生负面影响。为了减轻这个问题,我们提出了CoReEcho,一种关注于直接EF回归的连续表示的训练框架。我们的大量实验证明,CoReEcho:1) 在echoNet-Dynamic超声心动图数据集(EchoNet-Dynamic)上超越了当前最先进的水平(SOTA),其均方误差(MAE)为3.90,相关方差(R2)为82.44;2) 提供了稳健且具有更好通用性的特征,在相关下游任务上传递更有效地。代码公开在這個 https URL 上。
https://arxiv.org/abs/2403.10164
Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem, they suffer from unreliable predictions under distribution shifts during test time. Accordingly, several online learning methods have been proposed using regression loss from the ground truth of observed data leveraging the auto-labeling nature of trajectory prediction task. We mainly tackle the following two issues. First, previous works underfit and overfit as they only optimize the last layer of the motion decoder. To this end, we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second, utilizing the sequential nature of driving data, we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. Our proposed method has been validated across various challenging cross-dataset distribution shift scenarios including nuScenes, Lyft, Waymo, and Interaction. Our method surpasses the performance of existing state-of-the-art online learning methods in terms of both prediction accuracy and computational efficiency. The code is available at this https URL.
轨迹预测是一个具有挑战性的问题,需要考虑多个演员之间的相互作用和周围环境。虽然数据驱动方法已用于解决此复杂问题,但在分布变化时测试时间会出现不可靠预测。因此,我们提出了一种利用观测数据真值进行回归损失的在线学习方法,利用轨迹预测任务的自动标注特性。我们主要解决以下两个问题。首先,以前的工作在优化运动解码器的最后层时出现了过拟合和欠拟合问题。为此,我们采用遮罩自动编码器(MAE)进行表示学习,以鼓励在分布变化时进行复杂的交互建模,更新更深层。其次,利用驾驶数据的序列性质,我们提出了一种针对每个角色的特定标记记忆,实现了在测试时间学习每个角色的运动特征。我们的方法在各种具有挑战性的跨数据集分布变化场景(包括 nuScenes、Lyft、Waymo 和 Interaction)中进行了验证。我们的方法在预测准确性和计算效率方面超过了现有最先进的在线学习方法。代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2403.10052
Causal Representation Learning (CRL) aims at identifying high-level causal factors and their relationships from high-dimensional observations, e.g., images. While most CRL works focus on learning causal representations in a single environment, in this work we instead propose a first step towards learning causal representations from temporal sequences of images that can be adapted in a new environment, or composed across multiple related environments. In particular, we introduce DECAF, a framework that detects which causal factors can be reused and which need to be adapted from previously learned causal representations. Our approach is based on the availability of intervention targets, that indicate which variables are perturbed at each time step. Experiments on three benchmark datasets show that integrating our framework with four state-of-the-art CRL approaches leads to accurate representations in a new environment with only a few samples.
因果表示学习(CRL)旨在从高维观测中识别高层次因果因素及其关系,例如图像。与大多数CRL工作集中在在单个环境中学习因果表示不同,本文提出了从图像的时间序列中学习可以在新环境中适应或跨越多个相关环境的第一步。特别是,我们引入了DECAF,一个检测可以重用和需要适应从先前学习到的因果表示的因果因素的框架。我们的方法基于干预目标的存在,这些目标指示每个时间步骤的变量是否受到影响。在三个基准数据集上的实验表明,将我们的框架与四个最先进的CRL方法集成起来,可以在仅几个样本的新环境中获得准确的表达。
https://arxiv.org/abs/2403.09830
Self-supervised learning (SSL) has recently emerged as a powerful approach to learning representations from large-scale unlabeled data, showing promising results in time series analysis. The self-supervised representation learning can be categorized into two mainstream: contrastive and generative. In this paper, we will present a comprehensive comparative study between contrastive and generative methods in time series. We first introduce the basic frameworks for contrastive and generative SSL, respectively, and discuss how to obtain the supervision signal that guides the model optimization. We then implement classical algorithms (SimCLR vs. MAE) for each type and conduct a comparative analysis in fair settings. Our results provide insights into the strengths and weaknesses of each approach and offer practical recommendations for choosing suitable SSL methods. We also discuss the implications of our findings for the broader field of representation learning and propose future research directions. All the code and data are released at \url{this https URL}.
自监督学习(SSL)作为一种从大规模无标签数据中学习表示的强大方法,在时间序列分析中取得了良好的结果。自监督表示学习可以分为两种主流:对比学习和生成学习。在本文中,我们将对时间序列中对比学习和生成学习方法的全面比较进行介绍。我们首先介绍对比学习和生成学习的基本框架,并讨论如何获得指导模型优化的监督信号。然后,我们为每种类型实现经典的算法(SimCLR vs. MAE)并进行公平设置的比较分析。我们的结果提供了关于每种方法的优劣之处的洞察,并为选择合适的SSL方法提供了实际建议。我们还讨论了我们的发现的对表示学习领域更广泛范围的意义,并提出了未来的研究方向。所有代码和数据都已发布在 \url{这个链接}。
https://arxiv.org/abs/2403.09809
Self-supervised 3D representation learning aims to learn effective representations from large-scale unlabeled point clouds. Most existing approaches adopt point discrimination as the pretext task, which assigns matched points in two distinct views as positive pairs and unmatched points as negative pairs. However, this approach often results in semantically identical points having dissimilar representations, leading to a high number of false negatives and introducing a "semantic conflict" problem. To address this issue, we propose GroupContrast, a novel approach that combines segment grouping and semantic-aware contrastive learning. Segment grouping partitions points into semantically meaningful regions, which enhances semantic coherence and provides semantic guidance for the subsequent contrastive representation learning. Semantic-aware contrastive learning augments the semantic information extracted from segment grouping and helps to alleviate the issue of "semantic conflict". We conducted extensive experiments on multiple 3D scene understanding tasks. The results demonstrate that GroupContrast learns semantically meaningful representations and achieves promising transfer learning performance.
自监督3D表示学习旨在从大规模未标记点云中学习有效的表示。大多数现有方法采用点区分作为预处理任务,将两个不同视图中的匹配点分配为正对,将不匹配的点分配为负对。然而,这种方法通常会导致语义相同的点具有不同的表示,从而导致大量的假阴性结果,并引入了“语义冲突”问题。为了解决这个问题,我们提出了GroupContrast,一种结合 segment grouping 和语义感知对比学习的新颖方法。语义分组部分将点划分为语义上有意义的区域,这增强了语义连贯性并为后续的对比表示学习提供了语义指导。语义感知对比学习增加了从语义分组中提取的语义信息,并有助于减轻“语义冲突”问题。我们在多个3D场景理解任务上进行了广泛的实验。结果表明,GroupContrast能够学习语义上有意义的表示,并取得了鼓舞人心的转移学习效果。
https://arxiv.org/abs/2403.09639
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks.
近年来,自监督音频视觉表示学习的发展已经证明了其捕捉丰富而全面的表示的潜力。然而,尽管在许多学习方法中数据增强验证了很多优势,但音频视觉学习仍未能充分利用这些优势,因为增强会轻易破坏输入对之间的对应关系。为了克服这个局限,我们引入了EquiAV,一种利用等价性进行音频视觉对比学习的新框架。我们的方法首先通过基于共享注意的变换预测器扩展等价性至音频视觉学习。这使得来自不同增强的feature可以聚合到一个具有代表性的嵌入,从而提供强监督。值得注意的是,这可以在最小计算开销的情况下实现。广泛的消融研究和定性结果证实了我们的方法的有效性。EquiAV在各种音频视觉基准测试中都优于之前的 works。
https://arxiv.org/abs/2403.09502
While the introduction of contrastive learning frameworks in sentence representation learning has significantly contributed to advancements in the field, it still remains unclear whether state-of-the-art sentence embeddings can capture the fine-grained semantics of sentences, particularly when conditioned on specific perspectives. In this paper, we introduce Hyper-CL, an efficient methodology that integrates hypernetworks with contrastive learning to compute conditioned sentence representations. In our proposed approach, the hypernetwork is responsible for transforming pre-computed condition embeddings into corresponding projection layers. This enables the same sentence embeddings to be projected differently according to various conditions. Evaluation on two representative conditioning benchmarks, namely conditional semantic text similarity and knowledge graph completion, demonstrates that Hyper-CL is effective in flexibly conditioning sentence representations, showcasing its computational efficiency at the same time. We also provide a comprehensive analysis of the inner workings of our approach, leading to a better interpretation of its mechanisms.
尽管在句子表示学习领域引入对比性学习框架已经显著推动了进步,但仍然不清楚最先进的句子嵌入是否能够捕捉到句子中微妙的语义,特别是当特定视角下时。在本文中,我们引入了Hyper-CL,一种将超网络与对比学习相结合的有效方法,用于计算有条件句子表示。在我们的方法中,超网络负责将预计算的有条件嵌入转换为相应的投影层。这使得根据各种条件投影相同的句子嵌入。在两个具有代表性的 conditioning 基准(即条件语义文本相似度和知识图谱完成)上的评估表明,Hyper-CL 有效地调节了句子表示,同时展示了其在计算效率方面的优势。我们还对我们的方法进行了全面的分析,从而更好地解释了其机制。
https://arxiv.org/abs/2403.09490