Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this paper, aiming to solve this problem, we propose the single-direction tuning (SDT) strategy, which serves as a bridge, allowing us to leverage existing labeled HSI datasets even RGB datasets to enhance the performance on new HSI datasets with limited samples. The proposed SDT inherits the idea of prompt tuning, aiming to reuse pre-trained models with minimal modifications for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed to accommodate the characteristics of HSIs. The proposed SDT utilizes a parallel architecture, an asynchronous cold-hot gradient update strategy, and unidirectional interaction. It aims to fully harness the potent representation learning capabilities derived from training on heterologous, even cross-modal datasets. In addition, we also introduce a novel Triplet-structured transformer (Tri-Former), where spectral attention and spatial attention modules are merged in parallel to construct the token mixing component for reducing computation cost and a 3D convolution-based channel mixer module is integrated to enhance stability and keep structure information. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.
最近,一些研究人员开始探索使用VITs解决HSI分类的问题,并取得了显著成果。然而,训练VIT模型需要大量的训练样本,而高分辨率数据由于其标注成本高,通常拥有相对较少的训练样本。这个矛盾并没有得到有效的解决。在本文中,我们提出了一种单向调优策略(SDT),作为连接的手段,使我们可以利用现有的标注HSI数据集,甚至RGB数据集,以提高有限的样本量下的HSI数据集的性能。我们所提出的SDT继承了快速调优的想法,旨在使用少量的修改重新使用训练过的模型以适应新的任务。但是与快速调优不同,SDT是专门为适应HSI特点而设计的。我们提出的SDT使用并行架构、异步冷温梯度更新策略和单向交互,旨在完全利用从训练异构数据集中获得的潜在表示学习能力。此外,我们还介绍了一种独特的三向结构转换器(Tri-former),其中 spectral和空间注意力模块在并行中合并,构建 token 混合组件,以降低计算成本,并整合基于3D卷积通道融合模块的3D卷积通道融合模块,以增强稳定性并保留结构信息。对不同传感器捕获的三个代表性HSI数据集进行的比较实验表明,相比 several state-of-the-art methods,Tri-former achieve better performance。同源性和跨同源性调优实验验证了我们提出的SDT的有效性。
https://arxiv.org/abs/2309.12865
Self-supervised representation learning (SSRL) has improved the performance on downstream phoneme recognition versus supervised models. Training SSRL models requires a large amount of pre-training data and this poses a challenge for low resource languages. A common approach is transferring knowledge from other languages. Instead, we propose to use audio augmentation to pre-train SSRL models in a low resource condition and evaluate phoneme recognition as downstream task. We performed a systematic comparison of augmentation techniques, namely: pitch variation, noise addition, accented target-language speech and other language speech. We found combined augmentations (noise/pitch) was the best augmentation strategy outperforming accent and language knowledge transfer. We compared the performance with various quantities and types of pre-training data. We examined the scaling factor of augmented data to achieve equivalent performance to models pre-trained with target domain speech. Our findings suggest that for resource constrained languages, in-domain synthetic augmentation can outperform knowledge transfer from accented or other language speech.
自监督表示学习(SSRL)已经提高了后续音节识别相对于监督模型的性能。训练SSRL模型需要大量的预训练数据,这对资源有限的语言来说是一个挑战。一种常见方法是从其他语言中转移知识。相反,我们建议利用音频增强来在资源有限的情况下预训练SSRL模型,并将音节识别作为后续任务进行评估。我们对增强技术进行了系统性的比较,包括音调变化、噪音添加、目标语言语音带有口音以及其他语言语音。我们发现合并增强(噪音/音调)是最佳的增强策略,比口音和语言知识转移表现更好。我们与各种数量和类型的预训练数据进行了比较,并研究了增强数据的缩放因子,以获得与目标语言语音预训练模型相当的性能。我们的发现表明,对于资源有限的语言来说,跨域合成增强可以优于带有口音或其他语言语音的知识转移。
https://arxiv.org/abs/2309.12763
This paper presents BELT, a novel model and learning framework for the pivotal topic of brain-to-language translation research. The translation from noninvasive brain signals into readable natural language has the potential to promote the application scenario as well as the development of brain-computer interfaces (BCI) as a whole. The critical problem in brain signal decoding or brain-to-language translation is the acquisition of semantically appropriate and discriminative EEG representation from a dataset of limited scale and quality. The proposed BELT method is a generic and efficient framework that bootstraps EEG representation learning using off-the-shelf large-scale pretrained language models (LMs). With a large LM's capacity for understanding semantic information and zero-shot generalization, BELT utilizes large LMs trained on Internet-scale datasets to bring significant improvements to the understanding of EEG signals. In particular, the BELT model is composed of a deep conformer encoder and a vector quantization encoder. Semantical EEG representation is achieved by a contrastive learning step that provides natural language supervision. We achieve state-of-the-art results on two featuring brain decoding tasks including the brain-to-language translation and zero-shot sentiment classification. Specifically, our model surpasses the baseline model on both tasks by 5.45% and over 10% and archives a 42.31% BLEU-1 score and 67.32% precision on the main evaluation metrics for translation and zero-shot sentiment classification respectively.
本论文介绍了BEFL,一个关键主题的大脑到语言翻译研究的创新模型和学习框架。将非侵入性脑信号翻译为可读的自然语言有潜力促进应用场景和提高整个脑-计算机接口(BCI)的发展。脑信号解码或脑到语言翻译的关键问题是从规模有限且质量有限的数据集中提取语义合适的、具有区分性的EEG表示。 proposed BEFL方法是一种通用且高效的框架,使用现有的大型预训练语言模型(LMs)BootstrapEEG表示学习。大型LM的能力能够理解语义信息和零样本泛化,BEFL利用在Internet上训练的大型LMs来显著改善对EEG信号的理解。特别是,BEFL模型由一个深度 conformer编码器和一个向量点编码器组成。语义EEG表示是通过比较学习步骤提供自然语言监督实现的。我们在两个关键的大脑解码任务中实现了最先进的结果,包括脑到语言翻译和零样本情感分类。具体来说,我们的模型在两个任务上超过了基准模型,分别超过5.45%和10%。在翻译和零样本情感分类的主要评估指标上,我们实现了42.31%的BLEU-1得分和67.32%的精度。
https://arxiv.org/abs/2309.12056
Self-supervised representation learning for human action recognition has developed rapidly in recent years. Most of the existing works are based on skeleton data while using a multi-modality setup. These works overlooked the differences in performance among modalities, which led to the propagation of erroneous knowledge between modalities while only three fundamental modalities, i.e., joints, bones, and motions are used, hence no additional modalities are explored. In this work, we first propose an Implicit Knowledge Exchange Module (IKEM) which alleviates the propagation of erroneous knowledge between low-performance modalities. Then, we further propose three new modalities to enrich the complementary information between modalities. Finally, to maintain efficiency when introducing new modalities, we propose a novel teacher-student framework to distill the knowledge from the secondary modalities into the mandatory modalities considering the relationship constrained by anchors, positives, and negatives, named relational cross-modality knowledge distillation. The experimental results demonstrate the effectiveness of our approach, unlocking the efficient use of skeleton-based multi-modality data. Source code will be made publicly available at this https URL.
过去几年中,人类行为识别 self-supervised Representation Learning 快速发展。大多数现有工作都基于骨骼数据,同时使用多模态 setup。这些工作忽略了不同模态的性能差异,导致不同模态之间的错误知识传播,而仅使用三个基本模态(即关节、骨骼和运动),因此没有探索额外的模态。在这项工作中,我们提出了一种隐含知识交换模块(IKEM),减轻低性能模态之间错误知识的传播。我们还提出了三个新的模态,以丰富不同模态之间的互补信息。最后,为了在引入新模态时保持效率,我们提出了一种独特的教师学生框架,将知识从 secondary 模态中提取,并将其转换为必要的模态,称为关系跨模态知识蒸馏。实验结果显示了我们的方法的有效性,解锁了基于骨骼的多模态数据的有效使用。源代码将在 this https URL 上公开发布。
https://arxiv.org/abs/2309.12009
Self-supervised representation learning has seen remarkable progress in the last few years, with some of the recent methods being able to learn useful image representations without labels. These methods are trained using backpropagation, the de facto standard. Recently, Geoffrey Hinton proposed the forward-forward algorithm as an alternative training method. It utilizes two forward passes and a separate loss function for each layer to train the network without backpropagation. In this study, for the first time, we study the performance of forward-forward vs. backpropagation for self-supervised representation learning and provide insights into the learned representation spaces. Our benchmark employs four standard datasets, namely MNIST, F-MNIST, SVHN and CIFAR-10, and three commonly used self-supervised representation learning techniques, namely rotation, flip and jigsaw. Our main finding is that while the forward-forward algorithm performs comparably to backpropagation during (self-)supervised training, the transfer performance is significantly lagging behind in all the studied settings. This may be caused by a combination of factors, including having a loss function for each layer and the way the supervised training is realized in the forward-forward paradigm. In comparison to backpropagation, the forward-forward algorithm focuses more on the boundaries and drops part of the information unnecessary for making decisions which harms the representation learning goal. Further investigation and research are necessary to stabilize the forward-forward strategy for self-supervised learning, to work beyond the datasets and configurations demonstrated by Geoffrey Hinton.
自监督表示学习在过去几年中取得了显著进展,一些最近的方法能够在没有标签的情况下学习有用的图像表示。这些方法使用回退作为事实上的标准训练方法。最近,Geoffrey Hinton提出了前向-前向算法作为另一种训练方法。它使用两个前向遍历和每个层单独的损失函数来训练网络而无需回退。在本研究中,我们首次研究前向-前向相对于回退在自监督表示学习中的表现,并提供了学到的表示空间 insights。我们的基准使用四个标准数据集,分别是米NIST、F-米NIST、SVHN和CIFAR-10,以及三种常见的自监督表示学习技术,分别是旋转、翻转和拼图。我们的主要发现是,虽然在(自)监督训练中前向-前向算法表现与回退相当,但在所有研究设置中传输性能显著落后。这可能是由多种因素的组合造成的,包括每个层都有一个损失函数以及前向-前向范式中监督训练的实现方式。与回退相比,前向-前向算法更关注边界,并删除不必要的决策信息,这损害了表示学习目标。需要进行进一步研究和研究以稳定自监督学习的前向-前向策略,超越Geoffrey Hinton演示的数据集和配置。
https://arxiv.org/abs/2309.11955
To improve word representation learning, we propose a probabilistic prior which can be seamlessly integrated with word embedding models. Different from previous methods, word embedding is taken as a probabilistic generative model, and it enables us to impose a prior regularizing word representation learning. The proposed prior not only enhances the representation of embedding vectors but also improves the model's robustness and stability. The structure of the proposed prior is simple and effective, and it can be easily implemented and flexibly plugged in most existing word embedding models. Extensive experiments show the proposed method improves word representation on various tasks.
为了提高单词表示学习,我们提出了一个概率先验,它可以与单词嵌入模型无缝集成。与以前的方法不同,单词嵌入被看作是一个概率生成模型,这使我们能够实施概率 Regularization 的单词表示学习。 proposed 先验不仅增强嵌入向量的表示,还提高了模型的稳健性和稳定性。 proposed 先验的结构简单而有效,它可以轻松实现并灵活地插入大多数现有的单词嵌入模型。广泛的实验表明, proposed 方法在各种任务中提高了单词表示。
https://arxiv.org/abs/2309.11824
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.
现如今,在监督学习场景中,3D人类姿态估计可以实现极高的精度。因此,解决3D姿态标注不足的问题已经引起了越来越多的关注。特别是,有一些方法提出了通过自监督学习来学习图像表示,以便将外观信息从姿态信息中分离。这些方法只需要少量的监督数据来训练姿态回归器,使用姿态相关的隐向量作为输入,因为外观信息应该被排除在外。在本文中,我们进行了深入分析,以了解art-of-the-art的分离表示学习方法在何种程度上真正将外观信息从姿态信息中分离。首先,我们从自监督网络的角度出发,通过多种图像合成实验研究了分离的实现方式。其次,我们研究了从攻击视角下的3D姿态回归器的分离实现。具体来说,我们设计了一种新的攻击策略,重点是生成物体的自然外观变化,我们可以期望分离网络具有鲁棒性。总之,我们的分析表明,三个art-of-the-art的分离表示学习框架的分离程度虽然还没有完全完成,但它们的姿态代码中包含重要的外观信息。我们相信,我们的方法提供了评估自监督3D人类姿态估计中姿态与外观分离程度的宝贵测试平台。
https://arxiv.org/abs/2309.11667
The AI community has made significant strides in developing powerful foundation models, driven by large-scale multimodal datasets. However, in the audio representation learning community, the present audio-language datasets suffer from limitations such as insufficient volume, simplistic content, and arduous collection procedures. To tackle these challenges, we present an innovative and automatic audio caption generation pipeline based on a series of public tools or APIs, and construct a large-scale, high-quality, audio-language dataset, named as Auto-ACD, comprising over 1.9M audio-text pairs. To demonstrate the effectiveness of the proposed dataset, we train popular models on our dataset and show performance improvement on various downstream tasks, namely, audio-language retrieval, audio captioning, environment classification. In addition, we establish a novel test set and provide a benchmark for audio-text tasks. The proposed dataset will be released at this https URL.
人工智能社区在开发强大的基础模型方面取得了显著进展,这些模型受到大规模多模式数据集的驱动。然而,在音频表示学习社区中,当前的音频语言数据集存在诸如音量不足、简单内容以及繁琐收集程序等限制。为了解决这些挑战,我们提出了一种创新且自动的音频配准生成管道,基于一系列公开工具或API,并建立了一个大规模的、高质量音频-语言数据集,名为Auto-ACD,其中包括超过190万音频-文本对。为了证明该 proposed dataset 的有效性,我们在我们的数据集上训练了流行模型,并显示在多种后续任务中的表现改进,例如音频语言检索、音频配准和环境分类。此外,我们还建立了一个新颖的测试集,并为音频文本任务提供了基准。该 proposed dataset 将在此httpsURL上发布。
https://arxiv.org/abs/2309.11500
Unsupervised multi-view representation learning has been extensively studied for mining multi-view data. However, some critical challenges remain. On the one hand, the existing methods cannot explore multi-view data comprehensively since they usually learn a common representation between views, given that multi-view data contains both the common information between views and the specific information within each view. On the other hand, to mine the nonlinear relationship between data, kernel or neural network methods are commonly used for multi-view representation learning. However, these methods are lacking in interpretability. To this end, this paper proposes a new multi-view fuzzy representation learning method based on the interpretable Takagi-Sugeno-Kang (TSK) fuzzy system (MVRL_FS). The method realizes multi-view representation learning from two aspects. First, multi-view data are transformed into a high-dimensional fuzzy feature space, while the common information between views and specific information of each view are explored simultaneously. Second, a new regularization method based on L_(2,1)-norm regression is proposed to mine the consistency information between views, while the geometric structure of the data is preserved through the Laplacian graph. Finally, extensive experiments on many benchmark multi-view datasets are conducted to validate the superiority of the proposed method.
未监督的多视图表示学习已经广泛研究了以挖掘多视图数据。然而,仍然有一些关键挑战。一方面,现有的方法不能全面探索多视图数据,因为它们通常学习视图之间的共同表示,因为多视图数据包含视图之间的共同信息和每个视图的特定信息。另一方面,为了挖掘数据之间的非线性关系,内核或神经网络方法通常用于多视图表示学习。然而,这些方法缺乏解释性。为此,本文提出了一种基于可解释的 Takagi-Sugeno-Kang (TSK) 模糊系统(MVRL_FS)的新多视图模糊表示学习方法。方法从两个方面实现多视图表示学习。第一,多视图数据转换为高维模糊特征空间,同时探索视图之间的共同信息和每个视图的特定信息。第二,一种新的正则化方法基于 L_(2,1)-范数回归,用于挖掘视图之间的一致性信息,同时通过拉普拉斯图保留数据的几何结构。最后,对许多基准多视图数据集进行了广泛的实验,以验证所提出方法的优越性。
https://arxiv.org/abs/2309.11473
Few-shot point cloud semantic segmentation aims to train a model to quickly adapt to new unseen classes with only a handful of support set samples. However, the noise-free assumption in the support set can be easily violated in many practical real-world settings. In this paper, we focus on improving the robustness of few-shot point cloud segmentation under the detrimental influence of noisy support sets during testing time. To this end, we first propose a Component-level Clean Noise Separation (CCNS) representation learning to learn discriminative feature representations that separates the clean samples of the target classes from the noisy samples. Leveraging the well separated clean and noisy support samples from our CCNS, we further propose a Multi-scale Degree-based Noise Suppression (MDNS) scheme to remove the noisy shots from the support set. We conduct extensive experiments on various noise settings on two benchmark datasets. Our results show that the combination of CCNS and MDNS significantly improves the performance. Our code is available at this https URL.
少量点云语义分割的目标是训练模型,以在新 unseen 类只有少量支持样本的情况下,快速适应这些新类。然而,支持样本中的噪声假设在许多实际实际场景中很容易违反。在本文中,我们重点是在测试期间,在噪声支持样本的有害影响下,提高少量点云分割的鲁棒性。为此,我们提出了一种组件级别的干净噪声分离(CCNS)表示学习,以学习区分目标类干净样本和噪声样本的表示。利用我们的 CCNS 中的干净和噪声支持样本,我们进一步提出了一种多尺度度数based噪声抑制(MDNS)方案,以从支持样本中删除噪声样本。我们在两个基准数据集上进行了广泛的实验,研究了各种噪声设置。我们的结果表明,CCNS 和 MDNS 的组合显著提高了性能。我们的代码可在 this https URL 上获取。
https://arxiv.org/abs/2309.11228
Unsupervised sentence representation learning aims to transform input sentences into fixed-length vectors enriched with intricate semantic information while obviating the reliance on labeled data. Recent progress within this field, propelled by contrastive learning and prompt engineering, has significantly bridged the gap between unsupervised and supervised strategies. Nonetheless, the potential utilization of Chain-of-Thought, remains largely untapped within this trajectory. To unlock latent capabilities within pre-trained models, such as BERT, we propose a two-stage approach for sentence representation: comprehension and summarization. Subsequently, the output of the latter phase is harnessed as the vectorized representation of the input sentence. For further performance enhancement, we meticulously refine both the contrastive learning loss function and the template denoising technique for prompt engineering. Rigorous experimentation substantiates our method, CoT-BERT, transcending a suite of robust baselines without necessitating other text representation models or external databases.
unsupervised sentence representation learning 旨在将输入句子转换为固定长度的向量,其中包含了丰富的语义信息,同时避免了依赖标记数据。近期,通过比较学习和创新工程,这一领域的进展得到了显著加强。然而,在这一路径中,信念链的潜力利用率仍 largely untapped。为了解锁预训练模型(如 BERT)中的隐藏能力,我们提出了一种 sentence Representation 的两步方法:理解和摘要。随后,后阶段的输出被用作输入句子的向量表示。为了进一步改善性能,我们仔细优化了比较学习损失函数和模板去噪技术,以 prompt engineering。严格的实验支持了我们的方法 CoT-BERT,它超越了一组稳健基准,而不需要其他文本表示模型或外部数据库。
https://arxiv.org/abs/2309.11143
Federated Learning (FL) is currently one of the most popular technologies in the field of Artificial Intelligence (AI) due to its collaborative learning and ability to preserve client privacy. However, it faces challenges such as non-identically and non-independently distributed (non-IID) and data with imbalanced labels among local clients. To address these limitations, the research community has explored various approaches such as using local model parameters, federated generative adversarial learning, and federated representation learning. In our study, we propose a novel Clustered FedStack framework based on the previously published Stacked Federated Learning (FedStack) framework. The local clients send their model predictions and output layer weights to a server, which then builds a robust global model. This global model clusters the local clients based on their output layer weights using a clustering mechanism. We adopt three clustering mechanisms, namely K-Means, Agglomerative, and Gaussian Mixture Models, into the framework and evaluate their performance. We use Bayesian Information Criterion (BIC) with the maximum likelihood function to determine the number of clusters. The Clustered FedStack models outperform baseline models with clustering mechanisms. To estimate the convergence of our proposed framework, we use Cyclical learning rates.
分布式学习(FL)目前是人工智能(AI)领域最受欢迎的技术之一,因为它的协作学习和保护客户隐私的能力。然而,它面临着一些挑战,例如在本地客户之间不存在等长的序列(非IID)数据和标签不平衡的数据。为了解决这些问题,研究社区已经探索了许多方法,例如使用本地模型参数、分布式生成对抗网络学习以及分布式表示学习。在我们的研究中,我们提出了基于先前发布的StackedFederated Learning(FedStack)框架的ClusteredFedStack框架。本地客户将模型预测和输出层权重发送到服务器,然后建立一种强大的全球模型。该全球模型使用 clustering 机制将本地客户按照其输出层权重进行分组。我们采用了三种 clustering 机制,即 K-Means、Agglomerative 和Gaussian Mixture Models,并将其加入框架并评估其性能。我们使用贝叶斯信息准则(BIC)和最大似然函数来确定簇的数量。ClusteredFedStack模型优于带有 clustering 机制的基线模型。为了估计我们提出的框架的收敛情况,我们使用循环学习率。
https://arxiv.org/abs/2309.11044
Audio-visual representation learning aims to develop systems with human-like perception by utilizing correlation between auditory and visual information. However, current models often focus on a limited set of tasks, and generalization abilities of learned representations are unclear. To this end, we propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations on 7 datasets covering 5 audio-visual tasks in speech and audio processing. We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks, emphasizing the need for future study on improving universal model performance. In addition, we show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task. We release our benchmark with evaluation code and a model submission platform to encourage further research in audio-visual learning.
音频可视化表示学习旨在通过利用听觉和视觉信息之间的相关性来开发具有人类感知能力的系统。然而,当前模型通常只关注有限的任务,而学习表示的泛化能力不明确。为此,我们提出了AV-SuperB基准,该基准可以在包括语音和音频处理中的5个听觉视觉任务的数据集上提供通用的评估。我们评估了5个最近的自监督模型,并表明这些模型无法在所有任务上都泛化,强调了改善通用模型表现的必要性。此外,我们还表明,通过中级任务微调和AudioSet作为强有力的中级任务,可以改进表示。我们发布了评估代码和模型提交平台,以鼓励音频可视化学习的进一步研究。
https://arxiv.org/abs/2309.10787
Current state-of-the-art methods for panoptic segmentation require an immense amount of annotated training data that is both arduous and expensive to obtain posing a significant challenge for their widespread adoption. Concurrently, recent breakthroughs in visual representation learning have sparked a paradigm shift leading to the advent of large foundation models that can be trained with completely unlabeled images. In this work, we propose to leverage such task-agnostic image features to enable few-shot panoptic segmentation by presenting Segmenting Panoptic Information with Nearly 0 labels (SPINO). In detail, our method combines a DINOv2 backbone with lightweight network heads for semantic segmentation and boundary estimation. We show that our approach, albeit being trained with only ten annotated images, predicts high-quality pseudo-labels that can be used with any existing panoptic segmentation method. Notably, we demonstrate that SPINO achieves competitive results compared to fully supervised baselines while using less than 0.3% of the ground truth labels, paving the way for learning complex visual recognition tasks leveraging foundation models. To illustrate its general applicability, we further deploy SPINO on real-world robotic vision systems for both outdoor and indoor environments. To foster future research, we make the code and trained models publicly available at this http URL.
当前的Panoptic segmentation方法需要巨大的标记阴性训练数据,这对广泛采用这些方法提出了巨大的挑战。同时,视觉表示学习领域的最近突破引发了范式的转变,导致出现了可以训练完全无标签图像的大型基础模型。在这个研究中,我们提议利用这些任务无关的图像特征,以通过呈现几乎无标签的Panoptic信息分割(SPINO)方法实现多次 Panoptic 分割。具体来说,我们的方法结合了 DINOv2 骨干网络和轻量级网络头部,用于语义分割和边界估计。我们表明,尽管我们训练了只有十张标记阴性的图像,但我们预测了高质量的伪标签,可以与任何现有的 Panoptic 分割方法一起使用。值得注意的是,我们证明了SPINO相对于完全监督基准线实现了竞争结果,同时使用了不到0.3%的 ground truth 标签,为利用基础模型学习复杂的视觉识别任务开辟了道路。为了展示其通用性,我们进一步在室内外真实的机器人视觉系统中部署了SPINO。为了促进未来的研究,我们将代码和训练模型在此httpURL上公开发布。
https://arxiv.org/abs/2309.10726
High-Dimensional and Incomplete (HDI) data is commonly encountered in big data-related applications like social network services systems, which are concerning the limited interactions among numerous nodes. Knowledge acquisition from HDI data is a vital issue in the domain of data science due to their embedded rich patterns like node behaviors, where the fundamental task is to perform HDI data representation learning. Nonnegative Latent Factor Analysis (NLFA) models have proven to possess the superiority to address this issue, where a linear bias incorporation (LBI) scheme is important in present the training overshooting and fluctuation, as well as preventing the model from premature convergence. However, existing LBI schemes are all statistic ones where the linear biases are fixed, which significantly restricts the scalability of the resultant NLFA model and results in loss of representation learning ability to HDI data. Motivated by the above discoveries, this paper innovatively presents the dynamic linear bias incorporation (DLBI) scheme. It firstly extends the linear bias vectors into matrices, and then builds a binary weight matrix to switch the active/inactive states of the linear biases. The weight matrix's each entry switches between the binary states dynamically corresponding to the linear bias value variation, thereby establishing the dynamic linear biases for an NLFA model. Empirical studies on three HDI datasets from real applications demonstrate that the proposed DLBI-based NLFA model obtains higher representation accuracy several than state-of-the-art models do, as well as highly-competitive computational efficiency.
高维度和不完整(HDI)数据在大数据相关应用,如社交网络服务系统,中经常出现。这些数据关注于众多节点之间的有限交互。从HDI数据获取知识是数据科学领域中一个重要的问题,因为它们具有节点行为等丰富的模式。基本的工作任务是进行HDI数据表示学习。非负潜在因子分析(NLFA)模型已经证明具有解决这一问题的优势,其中线性偏差融入(LBI) scheme在训练中的超程和波动非常重要,并且可以防止模型过早收敛。然而,现有的LBI scheme都是统计的,其中线性偏差是固定的,这大大限制了NLFA模型的 scalability,并导致HDI数据的表示学习能力的损失。基于上述发现,本文创新性地提出了动态线性偏差融入(DLBI) scheme。首先将线性偏差向量扩展为矩阵,然后构建一个二进制权重矩阵,以切换线性偏差的 Active/Inactive 状态。权重矩阵的每个元素动态地切换为二进制状态,对应于线性偏差值的变化,从而建立了NLFA模型的动态线性偏差。从实际应用领域收集的三个HDI数据集的实证研究表明,提出的基于DLBI的NLFA模型的表示精度高于现有模型,计算效率也非常高。
https://arxiv.org/abs/2309.10618
Music motif, as a conceptual building block of composition, is crucial for music structure analysis and automatic composition. While human listeners can identify motifs easily, existing computational models fall short in representing motifs and their developments. The reason is that the nature of motifs is implicit, and the diversity of motif variations extends beyond simple repetitions and modulations. In this study, we aim to learn the implicit relationship between motifs and their variations via representation learning, using the Siamese network architecture and a pretraining and fine-tuning pipeline. A regularization-based method, VICReg, is adopted for pretraining, while contrastive learning is used for fine-tuning. Experimental results on a retrieval-based task show that these two methods complement each other, yielding an improvement of 12.6% in the area under the precision-recall curve. Lastly, we visualize the acquired motif representations, offering an intuitive comprehension of the overall structure of a music piece. As far as we know, this work marks a noteworthy step forward in computational modeling of music motifs. We believe that this work lays the foundations for future applications of motifs in automatic music composition and music information retrieval.
音乐主题作为创作概念的基础组成部分,对于音乐结构分析和自动创作至关重要。虽然人类听众可以轻松地识别主题,但现有的计算模型在代表主题及其发展方面存在缺陷。原因在于主题的特性是隐含的,主题的变异多样性超越了简单的重复和调制。在本研究中,我们希望通过表示学习来学习主题及其变异之间的隐含关系,使用Siamese网络结构和一个预训练和微调流水线。我们采用Regularization-based方法进行预训练,同时采用比较学习进行微调。基于检索任务的实验结果显示,这两个方法相互补充,在精度记忆曲线下的面积中取得了12.6%的提高。最后,我们可视化了获得的主题表示,提供了对音乐作品整体结构的直觉理解。据我们所知,这项工作在计算建模音乐主题方面迈出了重要的一步。我们相信,这项工作为自动音乐创作和音乐信息检索的未来应用奠定了基础。
https://arxiv.org/abs/2309.10597
Graph Neural Networks (GNNs) have become popular in Graph Representation Learning (GRL). One fundamental application is few-shot node classification. Most existing methods follow the meta learning paradigm, showing the ability of fast generalization to few-shot tasks. However, recent works indicate that graph contrastive learning combined with fine-tuning can significantly outperform meta learning methods. Despite the empirical success, there is limited understanding of the reasons behind it. In our study, we first identify two crucial advantages of contrastive learning compared to meta learning, including (1) the comprehensive utilization of graph nodes and (2) the power of graph augmentations. To integrate the strength of both contrastive learning and meta learning on the few-shot node classification tasks, we introduce a new paradigm: Contrastive Few-Shot Node Classification (COLA). Specifically, COLA employs graph augmentations to identify semantically similar nodes, which enables the construction of meta-tasks without the need for label information. Therefore, COLA can utilize all nodes to construct meta-tasks, further reducing the risk of overfitting. Through extensive experiments, we validate the essentiality of each component in our design and demonstrate that COLA achieves new state-of-the-art on all tasks.
图形神经网络(GNNs)在图形表示学习(GRL)中变得越来越流行。一个基本的应用是多次采样节点分类。大多数现有方法都遵循元学习范式,表明能够快速 generalization 到多次采样任务的能力。然而,最近的研究表明,Graph Contrastive Learning 与微调相结合可以显著地优于元学习方法。尽管取得了实验成功,但对其背后原因的理解仍然有限。在我们的研究中,我们首先识别了对比学习相对于元学习的两个关键优势,包括(1)全面利用图形节点和(2)图形增强的力量。为了将对比学习和元学习的力量集成到多次采样节点分类任务中,我们引入了一种新的范式:对比多次采样节点分类(COLA)。具体来说,COLA使用图形增强来识别语义上相似的节点,从而使元学习任务无需标签信息即可构建。因此,COLA可以利用所有节点构建元学习任务,进一步减少过拟合风险。通过广泛的实验,我们验证了我们设计中每个组件的重要性,并证明COLA在所有任务上都实现了新的先进水平。
https://arxiv.org/abs/2309.10376
Graph transformers have gained popularity in various graph-based tasks by addressing challenges faced by traditional Graph Neural Networks. However, the quadratic complexity of self-attention operations and the extensive layering in graph transformer architectures present challenges when applying them to graph based prediction tasks. Fine-tuning, a common approach, is resource-intensive and requires storing multiple copies of large models. We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning for leveraging large graph transformer models in downstream graph based prediction tasks. Our method introduces trainable feature nodes to the graph and pre-pends task-specific tokens to the graph transformer, enhancing the model's expressive power. By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies, making it suitable for small datasets and scalable to large graphs. Through extensive experiments on various-sized datasets, we demonstrate that deep graph prompt tuning achieves comparable or even superior performance to fine-tuning, despite utilizing significantly fewer task-specific parameters. Our contributions include the introduction of prompt tuning for graph transformers, its application to both graph transformers and message passing graph neural networks, improved efficiency and resource utilization, and compelling experimental results. This work brings attention to a promising approach to leverage pre-trained models in graph based prediction tasks and offers new opportunities for exploring and advancing graph representation learning.
Graph Transformer 在各种 Graph based 任务中获得了 popularity,通过解决传统 Graph Neural Networks 面临的挑战而实现的。然而, self-attention 操作的复杂性以及 Graph Transformer 架构中广泛的层积带来了挑战,将其应用于 Graph based 预测任务时。我们提出了一种新方法,称为 Deep Graph Prompttuning,作为 Fine-tuning 的替代品,用于利用大型 Graph Transformer 模型在后续 Graph based 预测任务中的运用。我们的方法向 Graph 中添加训练节点,并将任务特定的 tokens 预先附加到 Graph Transformer 中,增强了模型的表达能力。通过冻结训练参数,仅更新添加的 tokens,我们的方法减少了剩余的参数数量,并消除需要多个模型副本的需求,使其适用于小型数据集并可以扩展到大型 Graph。通过对不同大小的数据集进行广泛的实验,我们证明了 Deep Graph Prompttuning 与 Fine-tuning 的性能相当甚至优于 Fine-tuning,尽管使用了任务特定参数的数量较少。我们的贡献包括 Graph Transformer 的 prompttuning 引入,将其应用于 Graph Transformer 和消息传递 Graph 神经网络,提高了效率和资源利用率,并提供了令人 compelling 的实验结果。这项工作关注利用预先训练模型在 Graph based 预测任务中的运用,提供了探索和推进 Graph 表示学习的新机会。
https://arxiv.org/abs/2309.10131
When translating UI design prototypes to code in industry, automatically generating code from design prototypes can expedite the development of applications and GUI iterations. However, in design prototypes without strict design specifications, UI components may be composed of fragmented elements. Grouping these fragmented elements can greatly improve the readability and maintainability of the generated code. Current methods employ a two-stage strategy that introduces hand-crafted rules to group fragmented elements. Unfortunately, the performance of these methods is not satisfying due to visually overlapped and tiny UI elements. In this study, we propose EGFE, a novel method for automatically End-to-end Grouping Fragmented Elements via UI sequence prediction. To facilitate the UI understanding, we innovatively construct a Transformer encoder to model the relationship between the UI elements with multi-modal representation learning. The evaluation on a dataset of 4606 UI prototypes collected from professional UI designers shows that our method outperforms the state-of-the-art baselines in the precision (by 29.75\%), recall (by 31.07\%), and F1-score (by 30.39\%) at edit distance threshold of 4. In addition, we conduct an empirical study to assess the improvement of the generated front-end code. The results demonstrate the effectiveness of our method on a real software engineering application. Our end-to-end fragmented elements grouping method creates opportunities for improving UI-related software engineering tasks.
当将UI设计原型翻译成代码在工业界时,自动从设计原型中生成代码可以加速应用程序和GUI迭代的开发。然而,在缺乏严格设计规格的设计原型中,UI组件可能由分散的元素组成。通过将这些分散的元素分组,可以极大地改善生成的代码可读性和维护性。当前的方法采用一种两步策略,引入手工制定的规则来分组分散的元素。不幸的是,这些方法的表现因视觉重叠和微小的UI元素而不满意。在本研究中,我们提出了 EGFE,一种通过UI序列预测自动End-to-end分组分散元素的新方法。为了便于UI理解,我们创新性地构建了Transformer编码器,以模型UI元素之间的多模态表示学习关系。从专业UI设计师收集的4606个UI原型数据集的评估表明,我们的方法和最先进的基准相比,在精度(提高29.75%)、召回(提高31.07%)和F1得分(提高30.39%)的编辑距离阈值4时表现更好。此外,我们进行了一项实证研究,评估了生成的前端代码改进。结果证明了我们方法在真实软件工程应用中的效力。我们的End-to-end分散元素分组方法创造了改善与UI相关的软件工程任务的机会。
https://arxiv.org/abs/2309.09867
Scene transfer for vision-based mobile robotics applications is a highly relevant and challenging problem. The utility of a robot greatly depends on its ability to perform a task in the real world, outside of a well-controlled lab environment. Existing scene transfer end-to-end policy learning approaches often suffer from poor sample efficiency or limited generalization capabilities, making them unsuitable for mobile robotics applications. This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment. Control policies relying on the embedding are able to operate in unseen environments without the need for finetuning in the deployment environment. We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight. Extensive simulation and real-world experiments demonstrate that our approach successfully generalizes beyond the training domain and outperforms all baselines.
视觉based移动机器人应用程序的场景转移是一个高度相关且具有挑战性的问题。机器人的 utility 在很大程度上取决于其在真实世界中完成一项任务的能力,而在控制良好的实验室环境中。现有的场景转移 end-to-end policy learning 方法通常存在 poor 样本效率和有限的泛化能力,因此不适合用于移动机器人应用程序。本研究提出了一种自适应的多对对比学习策略,用于视觉表示学习,以实现零样本场景转移和现实世界部署。基于嵌入的控制政策能够在未知的环境中进行操作,无需在部署环境中进行微调。我们证明了我们的方法在敏捷、视觉based四旋翼飞行任务中的性能表现。广泛的模拟和现实世界实验表明,我们的方法成功地超越了训练域,并比所有基准表现更好。
https://arxiv.org/abs/2309.09865