Mixed initiative serves as one of the key factors in controlling conversation directions. For a speaker, responding passively or leading proactively would result in rather different responses. However, most dialogue systems focus on training a holistic response generation model without any distinction among different initiatives. It leads to the cross-contamination problem, where the model confuses different initiatives and generates inappropriate responses. Moreover, obtaining plenty of human annotations for initiative labels can be expensive. To address this issue, we propose a general mix-Initiative Dynamic Prefix Tuning framework (IDPT) to decouple different initiatives from the generation model, which learns initiative-aware prefixes in both supervised and unsupervised settings. Specifically, IDPT decouples initiative factors into different prefix parameters and uses the attention mechanism to adjust the selection of initiatives in guiding generation dynamically. The prefix parameters can be tuned towards accurate initiative prediction as well as mix-initiative response generation. Extensive experiments on two public dialogue datasets show that the proposed IDPT outperforms previous baselines on both automatic metrics and human evaluations. It also manages to generate appropriate responses with manipulated initiatives.
混合启动是一种控制对话方向的关键因素。对于发言者来说,被动响应或主动引导响应会产生相当不同的结果。然而,大多数对话系统都专注于训练一个没有区分不同启动的全面响应生成模型。这导致了交叉污染问题,其中模型混淆了不同的启动并生成不适当的响应。此外,为启动标签获得大量人类注释可能代价昂贵。为解决这个问题,我们提出了一个通用的混合启动动态前缀调整框架(IDPT),将不同的启动与生成模型解耦,并在监督和无监督设置中学习倡议相关的前缀。具体来说,IDPT将启动因素解耦为不同的前缀参数,并使用注意机制动态调整指导生成。前缀参数可以针对准确启动预测以及混合启动响应生成进行调整。在两个公共对话数据集上进行的广泛实验证明,与之前的基本模型相比,所提出的IDPT在自动指标和人类评估上都表现出色。它还设法在操纵启动后生成适当的响应。
https://arxiv.org/abs/2403.17636
In this study, we address a gap in existing unsupervised domain adaptation approaches on LiDAR-based 3D object detection, which have predominantly concentrated on adapting between established, high-density autonomous driving datasets. We focus on sparser point clouds, capturing scenarios from different perspectives: not just from vehicles on the road but also from mobile robots on sidewalks, which encounter significantly different environmental conditions and sensor configurations. We introduce Unsupervised Adversarial Domain Adaptation for 3D Object Detection (UADA3D). UADA3D does not depend on pre-trained source models or teacher-student architectures. Instead, it uses an adversarial approach to directly learn domain-invariant features. We demonstrate its efficacy in various adaptation scenarios, showing significant improvements in both self-driving car and mobile robot domains. Our code is open-source and will be available soon.
在这项研究中,我们填补了现有无监督领域适应方法在基于激光雷达的3D物体检测中的空白,这些方法主要集中在适应已有的、高密度自动驾驶数据集。我们关注稀疏点云,捕捉来自不同角度的场景:不仅仅是道路上行驶的车辆,还包括行进在人行道上的移动机器人,它们遭遇的环境条件和传感器配置差异较大。我们引入了无监督对抗领域适应方法(UADA3D)用于3D物体检测。UADA3D不依赖于预训练的源模型或教师-学生架构。相反,它使用对抗方法直接学习领域无关的特征。我们在各种适应场景中证明了其有效性,显示了在自动驾驶汽车和移动机器人领域显著的改善。我们的代码是开源的,将很快可用。
https://arxiv.org/abs/2403.17633
In this work, we revisit the problem of semi-supervised named entity recognition (NER) focusing on extremely light supervision, consisting of a lexicon containing only 10 examples per class. We introduce ELLEN, a simple, fully modular, neuro-symbolic method that blends fine-tuned language models with linguistic rules. These rules include insights such as ''One Sense Per Discourse'', using a Masked Language Model as an unsupervised NER, leveraging part-of-speech tags to identify and eliminate unlabeled entities as false negatives, and other intuitions about classifier confidence scores in local and global context. ELLEN achieves very strong performance on the CoNLL-2003 dataset when using the minimal supervision from the lexicon above. It also outperforms most existing (and considerably more complex) semi-supervised NER methods under the same supervision settings commonly used in the literature (i.e., 5% of the training data). Further, we evaluate our CoNLL-2003 model in a zero-shot scenario on WNUT-17 where we find that it outperforms GPT-3.5 and achieves comparable performance to GPT-4. In a zero-shot setting, ELLEN also achieves over 75% of the performance of a strong, fully supervised model trained on gold data. Our code is available at: this https URL.
在这项工作中,我们重新审视了半监督命名实体识别(NER)问题,重点关注非常轻度的监督,即仅包含每个类别仅10个示例的词汇表。我们引入了ELLEN,一种简单、完全模块化、神经符号的方法,将细粒度语言模型与语义规则相结合。这些规则包括诸如"每个语篇一个意义"这样的见解,使用遮罩语言模型作为无监督NER,利用Part-of-Speech(词性)标签来识别和消除未标记实体作为假阳性,以及关于分类器置信分数的局部和全局上下文的其他直觉。当使用上面词汇表的最小监督时,ELLEN在CoNLL-2003数据集上取得了非常出色的成绩。它还在与相同监督设置下大多数现有的半监督NER方法相媲美的环境中实现了卓越的表现(即训练数据中的5%)。此外,我们在WNUT-17上评估了我们的CoNLL-2003模型,发现在零样本情况下,它超过了GPT-3.5的表现,并与GPT-4相当。在零样本环境中,ELLEN还实现了与训练在金数据上的强大完全监督模型超过75%的性能相当。我们的代码可在此处下载:https://this URL。
https://arxiv.org/abs/2403.17385
Unsupervised Domain Adaptation (UDA) aims to adapt models from labeled source domains to unlabeled target domains. When adapting to adverse scenes, existing UDA methods fail to perform well due to the lack of instructions, leading their models to overlook discrepancies within all adverse scenes. To tackle this, we propose CoDA which instructs models to distinguish, focus, and learn from these discrepancies at scene and image levels. Specifically, CoDA consists of a Chain-of-Domain (CoD) strategy and a Severity-Aware Visual Prompt Tuning (SAVPT) mechanism. CoD focuses on scene-level instructions to divide all adverse scenes into easy and hard scenes, guiding models to adapt from source to easy domains with easy scene images, and then to hard domains with hard scene images, thereby laying a solid foundation for whole adaptations. Building upon this foundation, we employ SAVPT to dive into more detailed image-level instructions to boost performance. SAVPT features a novel metric Severity that divides all adverse scene images into low-severity and high-severity images. Then Severity directs visual prompts and adapters, instructing models to concentrate on unified severity features instead of scene-specific features, without adding complexity to the model architecture. CoDA achieves SOTA performances on widely-used benchmarks under all adverse scenes. Notably, CoDA outperforms the existing ones by 4.6%, and 10.3% mIoU on the Foggy Driving, and Foggy Zurich benchmarks, respectively. Our code is available at this https URL
无监督领域适应(UDA)旨在将已标注的源域模型适应到未标注的目标域。当适应恶劣场景时,现有的UDA方法由于缺乏指导而表现不佳,导致模型忽视所有恶劣场景内的差异。为了解决这个问题,我们提出了CoDA,它通过指导模型在场景和图像级别区分、关注并从这些差异中学习来进行域迁移。具体来说,CoDA由链式领域(CoD)策略和严重性感知视觉提示(SAVPT)机制组成。CoD关注场景级别的指令,将所有恶劣场景划分为容易和困难场景,引导模型从源域到容易场景,然后从困难场景到容易场景,从而为整个迁移打下坚实的基础。在此基础上,我们使用SAVPT在更详细的图像级别指令上进行改进,提高性能。SAVPT具有一个新颖的指标严重性,将所有恶劣场景图像划分为低严重性和高严重性图像。然后严重性指导视觉提示和适配器,指示模型集中于统一严重性特征而不是场景特定特征,而不会增加模型的复杂性。CoDA在所有恶劣场景上都实现了最先进的性能。值得注意的是,CoDA在雾天驾驶和雾天苏黎世基准测试中分别比现有方法提高了4.6%和10.3%的mIoU。我们的代码可以从该链接https://www.xxx完成
https://arxiv.org/abs/2403.17369
Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of \emph{regret}. We first empirically study the {no-regret} behaviors of LLMs in canonical (non-stationary) online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To promote the no-regret behaviors, we propose a novel \emph{unsupervised} training loss of \emph{regret-loss}, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. We then establish the statistical guarantee of generalization bound for regret-loss minimization, followed by the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above ``regrettable'' cases.
大语言模型(LLMs)越来越多地应用于(交互式)决策,通过开发基于LLM的自主代理。尽管LLM代理在决策方面的表现日益取得成功,但通过定量指标对LLM代理在决策方面的表现进行全面调查还是不充分的,尤其是在多代理器场景中它们相互交互时,这是现实世界LLM代理应用中典型的情景。为了更好地了解LLM代理在这些交互式环境中的局限性,我们提议研究它们在在线学习和游戏理论中的基准决策场景,通过 regret 指标来评估其表现。 首先,我们通过经验实证研究了LLMs在经典(非平稳)在线学习问题和通过玩重复游戏出现均衡时的无后悔行为。然后,在某些假设条件下,我们探讨了LLM代理无后悔行为的理论见解,这些假设包括监督预训练和人类决策者理性模型。值得注意的是,我们还发现了(简单)案例,如先进的LLM GPT-4,在无后悔行为方面无法达到无后悔。为了促进无后悔行为,我们提出了一个新颖的基于后悔损失的无监督训练损失,与监督预训练损失不同,不需要最优行动的标签。然后,我们建立了关于后悔损失最小化的统计保证,并探讨了这种损失的最小化可能自动导致已知的无后悔学习算法。 我们进一步的实验结果表明,我们的后悔损失在解决上述“可悲”案例方面非常有效。
https://arxiv.org/abs/2403.16843
Deep learning faces significant challenges during the training of neural networks, including internal covariate shift, label shift, vanishing/exploding gradients, overfitting, and computational complexity. While conventional normalization methods, such as Batch Normalization, aim to tackle some of these issues, they often depend on assumptions that constrain their adaptability. Mixture Normalization faces computational hurdles in its pursuit of handling multiple Gaussian distributions. This paper introduces Cluster-Based Normalization (CB-Norm) in two variants - Supervised Cluster-Based Normalization (SCB-Norm) and Unsupervised Cluster-Based Normalization (UCB-Norm) - proposing a groundbreaking one-step normalization approach. CB-Norm leverages a Gaussian mixture model to specifically address challenges related to gradient stability and learning acceleration. For SCB-Norm, a supervised variant, the novel mechanism involves introducing predefined data partitioning, termed clusters, to normalize activations based on the assigned cluster. This cluster-driven approach creates a space that conforms to a Gaussian mixture model. On the other hand, UCB-Norm, an unsupervised counterpart, dynamically clusters neuron activations during training, adapting to task-specific challenges without relying on predefined data partitions (clusters). This dual approach ensures flexibility in addressing diverse learning scenarios. CB-Norm innovatively uses a one-step normalization approach, where parameters of each mixture component (cluster in activation space) serve as weights for deep neural networks. This adaptive clustering process tackles both clustering and resolution of deep neural network tasks concurrently during training, signifying a notable advancement in the field.
在神经网络训练过程中,深度学习面临着显著的挑战,包括内部协变量漂移、标签漂移、消失/爆炸的梯度、过拟合和计算复杂度。虽然传统的归一化方法(如批量归一化)试图解决其中的一些问题,但它们通常依赖于限制其适应性的假设。在追求处理多个高斯分布的同时,混合归一化方法面临着计算障碍。本文提出了两种变体 - 监督聚类归一化(SCB-Norm)和无监督聚类归一化(UCB-Norm),提出了一种革命性的归一化方法。SCB-Norm利用高斯混合模型专门解决梯度稳定性和学习加速相关的问题。对于SCB-Norm,这是一种监督变体, novel mechanism涉及引入预定义的数据分区,称为聚类,根据指定的聚类对激活进行归一化。这种聚类驱动的方法创建了一个符合高斯混合模型的空间。另一方面,无监督的UCB-Norm版本在训练过程中动态聚类神经元激活,适应任务特定的挑战,而不依赖于预定义的数据分区(聚类)。这种双方法确保了在解决多样学习场景时具有灵活性。CB-Norm通过创新性地使用一步归一化方法,其中每个混合成分(在激活空间中的聚类)的参数作为深度神经网络的权重,同时解决聚类和解决深度神经网络任务。这种自适应聚类过程在训练过程中同时处理聚类和分辨率,标志着该领域的重要进步。
https://arxiv.org/abs/2403.16798
Event-based semantic segmentation has gained popularity due to its capability to deal with scenarios under high-speed motion and extreme lighting conditions, which cannot be addressed by conventional RGB cameras. Since it is hard to annotate event data, previous approaches rely on event-to-image reconstruction to obtain pseudo labels for training. However, this will inevitably introduce noise, and learning from noisy pseudo labels, especially when generated from a single source, may reinforce the errors. This drawback is also called confirmation bias in pseudo-labeling. In this paper, we propose a novel hybrid pseudo-labeling framework for unsupervised event-based semantic segmentation, HPL-ESS, to alleviate the influence of noisy pseudo labels. In particular, we first employ a plain unsupervised domain adaptation framework as our baseline, which can generate a set of pseudo labels through self-training. Then, we incorporate offline event-to-image reconstruction into the framework, and obtain another set of pseudo labels by predicting segmentation maps on the reconstructed images. A noisy label learning strategy is designed to mix the two sets of pseudo labels and enhance the quality. Moreover, we propose a soft prototypical alignment module to further improve the consistency of target domain features. Extensive experiments show that our proposed method outperforms existing state-of-the-art methods by a large margin on the DSEC-Semantic dataset (+5.88% accuracy, +10.32% mIoU), which even surpasses several supervised methods.
基于事件的语义分割由于能够处理高速运动和极端光照条件下的场景,而得到了越来越多的关注。与传统的RGB相机不同,由于很难标注事件数据,因此之前的方法依赖于事件到图像的重建来获得伪标签进行训练。然而,这会引入噪声,尤其是在从单个来源生成时,从噪声伪标签学习可能会加强错误。这一缺点在伪标签学习中被称为确认偏见。在本文中,我们提出了一种新颖的半监督事件基于语义分割的新伪标签框架HPL-ESS,以减轻噪声伪标签的影响。 具体来说,我们首先采用简单的自监督域迁移框架作为 baseline,可以通过自我训练生成一系列伪标签。然后,我们将离线事件到图像的重建纳入框架中,通过预测重构图像上的分割图获得另一组伪标签。为了混合这两个伪标签并提高质量,我们设计了一个噪声标签学习策略。此外,我们还提出了一个软原型对齐模块,以进一步改善目标域特征的一致性。 广泛的实验证明,与现有最先进的方法相比,我们提出的方法在DSEC-Semantic数据集上具有很大的优势(+5.88%的准确率,+10.32%的mIoU),甚至超过了几个监督方法。
https://arxiv.org/abs/2403.16788
Unsupervised person re-identification aims to retrieve images of a specified person without identity labels. Many recent unsupervised Re-ID approaches adopt clustering-based methods to measure cross-camera feature similarity to roughly divide images into clusters. They ignore the feature distribution discrepancy induced by camera domain gap, resulting in the unavoidable performance degradation. Camera information is usually available, and the feature distribution in the single camera usually focuses more on the appearance of the individual and has less intra-identity variance. Inspired by the observation, we introduce a \textbf{C}amera-\textbf{A}ware \textbf{L}abel \textbf{R}efinement~(CALR) framework that reduces camera discrepancy by clustering intra-camera similarity. Specifically, we employ intra-camera training to obtain reliable local pseudo labels within each camera, and then refine global labels generated by inter-camera clustering and train the discriminative model using more reliable global pseudo labels in a self-paced manner. Meanwhile, we develop a camera-alignment module to align feature distributions under different cameras, which could help deal with the camera variance further. Extensive experiments validate the superiority of our proposed method over state-of-the-art approaches. The code is accessible at this https URL.
无监督的人重新识别的目的是检索指定人物的图像,而无需身份标签。许多最近的无监督 Re-ID 方法采用聚类为基础的方法来测量跨相机特征的相似性,将图像大致分为簇。它们忽略了由相机领域差异引起的特征分布差异,导致性能降低。相机信息通常可用,而单个相机的特征分布通常更加关注单个个人的外观,并且具有较少的内部identity variance。受到观察的启发,我们引入了一个 Camera-Aware Label Refinement (CALR) 框架,通过聚类相机内相似性来减少相机差异。具体来说,我们使用相机内训练来获得每个相机内的可靠局部伪标签,然后通过 inter-camera 聚类生成的全局标签,以更可靠的全局伪标签的方式训练判别模型。同时,我们开发了一个相机对齐模块,用于在不同相机上对特征分布进行对齐,这可以帮助我们进一步处理相机变化。大量实验验证了我们提出的方法相对于最先进方法的优越性。代码可在此链接访问:
https://arxiv.org/abs/2403.16450
This paper introduces InstUPR, an unsupervised passage reranking method based on large language models (LLMs). Different from existing approaches that rely on extensive training with query-document pairs or retrieval-specific instructions, our method leverages the instruction-following capabilities of instruction-tuned LLMs for passage reranking without any additional fine-tuning. To achieve this, we introduce a soft score aggregation technique and employ pairwise reranking for unsupervised passage reranking. Experiments on the BEIR benchmark demonstrate that InstUPR outperforms unsupervised baselines as well as an instruction-tuned reranker, highlighting its effectiveness and superiority. Source code to reproduce all experiments is open-sourced at this https URL
本文介绍了InstUPR,一种基于大型语言模型(LLMs)的无监督 passage reranking 方法。与现有的方法不同,这些方法依赖于广泛训练与查询-文档对或检索特定指令,我们的方法依赖于经过指令调整的 LLM 的指令跟随能力来进行无监督 passage reranking,没有任何额外的微调。为了实现这一目标,我们引入了一种软分数聚合技术,并使用 pairwise reranking 进行无监督 passage reranking。在 BEIR 基准测试上进行的实验证明,InstUPR 能够超越无监督基线和指令调整的 reranker,充分证明了其有效性和优越性。源代码可在此链接中获取:https://github.com/yourgmt/InstUPR
https://arxiv.org/abs/2403.16435
Unsupervised point cloud shape correspondence aims to establish point-wise correspondences between source and target point clouds. Existing methods obtain correspondences directly by computing point-wise feature similarity between point clouds. However, non-rigid objects possess strong deformability and unusual shapes, making it a longstanding challenge to directly establish correspondences between point clouds with unconventional shapes. To address this challenge, we propose an unsupervised Template-Assisted point cloud shape correspondence Network, termed TANet, including a template generation module and a template assistance module. The proposed TANet enjoys several merits. Firstly, the template generation module establishes a set of learnable templates with explicit structures. Secondly, we introduce a template assistance module that extensively leverages the generated templates to establish more accurate shape correspondences from multiple perspectives. Extensive experiments on four human and animal datasets demonstrate that TANet achieves favorable performance against state-of-the-art methods.
无监督点云形状对应旨在建立源点云和目标点云之间的点对点对应关系。现有的方法通过计算点云之间的点特征相似性来直接获得对应关系。然而,非刚性物体具有强烈的变形和奇特的形状,使得直接建立具有不规则形状的点云之间的对应关系成为了一个长期挑战。为了解决这个挑战,我们提出了一个无监督的模板辅助点云形状对应网络,称之为TANet,包括模板生成模块和模板辅助模块。所提出的TANet具有多个优点。首先,模板生成模块建立了一组可学习模板,具有明确的结构。其次,我们引入了一个模板辅助模块,它充分利用生成的模板来从多个角度建立更精确的形状对应关系。对人类和动物数据集的广泛实验证明,TANet在实现最先进方法方面具有优势。
https://arxiv.org/abs/2403.16412
Federated learning achieves effective performance in modeling decentralized data. In practice, client data are not well-labeled, which makes it potential for federated unsupervised learning (FUSL) with non-IID data. However, the performance of existing FUSL methods suffers from insufficient representations, i.e., (1) representation collapse entanglement among local and global models, and (2) inconsistent representation spaces among local models. The former indicates that representation collapse in local model will subsequently impact the global model and other local models. The latter means that clients model data representation with inconsistent parameters due to the deficiency of supervision signals. In this work, we propose FedU2 which enhances generating uniform and unified representation in FUSL with non-IID data. Specifically, FedU2 consists of flexible uniform regularizer (FUR) and efficient unified aggregator (EUA). FUR in each client avoids representation collapse via dispersing samples uniformly, and EUA in server promotes unified representation by constraining consistent client model updating. To extensively validate the performance of FedU2, we conduct both cross-device and cross-silo evaluation experiments on two benchmark datasets, i.e., CIFAR10 and CIFAR100.
联邦学习在建模去中心化数据方面实现有效性能。在实践中,客户端数据通常标签不良,这使得非IID数据下的联邦无监督学习(FUSL)具有潜在可能性。然而,现有FUSL方法的性能存在不足之处,即(1)本地模型中表现衰减和全局模型之间的表现衰减,以及(2)本地模型之间表现不一致。第一个表明,本地模型中的表现衰减会影响全局模型和其他本地模型。第二个表示客户端模型数据使用不一致的参数,由于监督信号不足。 在这项工作中,我们提出了FedU2,它通过增强使用非IID数据的FUSL的生成统一和统一的表示来提高性能。具体来说,FedU2由灵活的均匀 regularizer(FUR)和有效的统一聚合器(EUA)组成。FUR在每个客户端通过离散化样本均匀分布来避免表现衰减,而EUA在服务器上通过约束一致的客户端模型更新来促进统一表示。 为了验证FedU2在两个具有广泛应用的基准数据集(即CIFAR10和CIFAR100)上的性能,我们进行了跨设备(设备)和跨 silo(服务器)评估实验。
https://arxiv.org/abs/2403.16398
Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at this https URL.
最近,随着收购设备(如激光雷达传感器)的进步,对户外3D环境的感知取得了显著进展。理解这种3D获取需要精细的场景理解,例如基于实例的3D场景分割。通常,为此任务训练一个神经网络;然而,这需要访问一个大、密集注释的数据集,而众所周知,这是具有挑战性的。为了解决这个问题,本文提出了一种自无监督地预测3D场景实例分割的方法,不依赖于 ground-truth 注释。为此,我们构建了一个学习框架,包括两个组件:(1)用于生成初始无监督伪标签的伪注释方案;(2)用于实例分割的自训练算法,以适应初始噪声较大的建议。为了生成3D实例掩码建议,我们通过连接3D点并通过整合多模态图像和点基础自监督特征的边来构建加权代理图,并进行图切割来隔离单个伪实例。然后,我们在一个最先进的基于点的架构上构建并训练了3D实例分割模型,从而显著改善了初始建议。为了扩展到任意复杂度的3D场景,我们设计了一个在局部3D点块上操作并构建合并步骤以生成场景级实例分割的方法。在具有挑战性的 SemanticKITTI 基准上进行的实验证明了我们方法的潜力,与最佳性能基线相比,其平均精度提高了13.3%,F1分数提高了9.1%。代码将在这个https URL上公开提供。
https://arxiv.org/abs/2403.16318
Topic modelling, as a well-established unsupervised technique, has found extensive use in automatically detecting significant topics within a corpus of documents. However, classic topic modelling approaches (e.g., LDA) have certain drawbacks, such as the lack of semantic understanding and the presence of overlapping topics. In this work, we investigate the untapped potential of large language models (LLMs) as an alternative for uncovering the underlying topics within extensive text corpora. To this end, we introduce a framework that prompts LLMs to generate topics from a given set of documents and establish evaluation protocols to assess the clustering efficacy of LLMs. Our findings indicate that LLMs with appropriate prompts can stand out as a viable alternative, capable of generating relevant topic titles and adhering to human guidelines to refine and merge topics. Through in-depth experiments and evaluation, we summarise the advantages and constraints of employing LLMs in topic extraction.
主题建模是一种经过充分验证的无监督技术,在文本语料库中自动检测重要主题方面得到了广泛应用。然而,经典的文本主题建模方法(如LDA)也存在一些不足,例如语义理解不足和主题重叠。在这项工作中,我们研究了大型语言模型(LLMs)在揭示广泛文本语料库中的潜在主题方面的未开发潜力。为此,我们引入了一个框架,提示LLMs从给定的文档集合中生成主题,并建立评估协议来评估LLMs的聚类效果。我们的研究结果表明,具有适当提示的LLMs可以作为一种可行的替代方法,具有生成相关主题标题并遵循人类指南来精炼和合并主题的能力。通过深入实验和评估,我们总结了在主题提取中使用LLMs的优势和限制。
https://arxiv.org/abs/2403.16248
We study the problem of unsupervised domain adaptation for egocentric videos. We propose a transformer-based model to learn class-discriminative and domain-invariant feature representations. It consists of two novel designs. The first module is called Generative Adversarial Domain Alignment Network with the aim of learning domain-invariant representations. It simultaneously learns a mask generator and a domain-invariant encoder in an adversarial way. The domain-invariant encoder is trained to minimize the distance between the source and target domain. The masking generator, conversely, aims at producing challenging masks by maximizing the domain distance. The second is a Masked Consistency Learning module to learn class-discriminative representations. It enforces the prediction consistency between the masked target videos and their full forms. To better evaluate the effectiveness of domain adaptation methods, we construct a more challenging benchmark for egocentric videos, U-Ego4D. Our method achieves state-of-the-art performance on the Epic-Kitchen and the proposed U-Ego4D benchmark.
我们研究了自监督领域适应性问题,特别是针对以自我为中心的视频。我们提出了一种基于Transformer的模型来学习分类分辨率和领域无关的特征表示。它由两个新颖的设计组成。第一个模块称为生成对抗域对齐网络,旨在学习领域无关的特征表示。同时,它在对抗中学习掩码生成器和领域无关的编码器。领域无关的编码器通过最小化源域和目标域之间的距离进行训练。相反,掩码生成器通过最大化领域距离来产生具有挑战性的掩码。第二个是用于学习分类分辨率的掩码一致性学习模块。它通过最大化被掩码目标视频和它们完整形式之间的差异来强制预测一致性。为了更好地评估领域迁移方法的实效性,我们为以自我为中心的视频构建了一个更具挑战性的基准,即U-Ego4D。我们的方法在Epic-Kitchen和提出的U-Ego4D基准上实现了最先进的性能。
https://arxiv.org/abs/2403.16242
This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image, which is challenging to deblur, into another blurry image that is more amenable to deblurring. The transformation process, from one blurry state to another, leverages unpaired data consisting of sharp and blurry images captured by the target camera device. Learning this blur-to-blur transformation is inherently simpler than direct blur-to-sharp conversion, as it primarily involves modifying blur patterns rather than the intricate task of reconstructing fine image details. The efficacy of the proposed approach has been demonstrated through comprehensive experiments on various benchmarks, where it significantly outperforms state-of-the-art methods both quantitatively and qualitatively. Our code and data are available at this https URL
本文提出了一种为特定相机设备定制图像去雾算法的创新框架。该算法通过将模糊输入图像转换为另一个更容易去雾的模糊图像来工作。从模糊状态到另一个模糊状态的转换过程利用了目标相机设备捕获的清晰和模糊图像的成对数据。学习这种从模糊到模糊的转换比直接从模糊到清晰的转换更简单,因为主要涉及修改模糊模式而不是复杂的图像细节恢复任务。通过在各种基准测试上的全面实验,证明了所提出方法的有效性,其在定量和定性方面都显著超过了最先进的Methods。我们的代码和数据可在此链接处获得:https://www.xxxxxx.com
https://arxiv.org/abs/2403.16205
Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework, we explore the potential of a recent paradigm of self-supervised learning algorithms, known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for the ULD task, we make the following core contributions. First, we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than existing ULD methods. Second, motivated by the ZeroShot performance, we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third, we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling, resulting in a significant performance improvement. Overall, our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW, MAFL, CatHeads and LS3D by significant margins.
无监督地标发现(ULD)对于物体类别是一个具有挑战性的计算机视觉问题。为了开发一个稳健的ULD框架,我们探讨了扩散模型这一最近范式的潜力。一些最近的工作表明,这些模型 implicit地包含重要的对应线索。为了利用扩散模型的潜力,我们做出了以下核心贡献。首先,我们提出了一个基于简单聚类随机像素位置与最近邻匹配的零击ULD基线。它比现有的ULD方法表现出更好的性能。其次,受到零击性能的启发,我们开发了一种基于扩散特征的自训练聚类ULD算法,它也显著超越了先前的方法。第三,我们引入了一种基于生成潜在姿态编码的新代理任务,并提出了两个阶段的聚类机制,以促进有效的伪标签,从而实现了显著的性能提升。总体而言,我们的方法在四个具有挑战性的基准(AFLW,MAFL,CatHeads和LS3D)上 consistently超越了最先进的methods。
https://arxiv.org/abs/2403.16194
Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.
在视觉语言模型中,幻觉的存在对它们的可靠性提出了重大挑战,尤其是在生成长篇摘要时。目前的方法不足以准确地识别和减轻这些幻觉。为了解决这个问题,我们引入了ESREAL,一种新的无监督学习框架,旨在通过准确地定位和惩罚幻觉的token来抑制幻觉的生成。最初,ESREAL根据生成的摘要创建了一个重构的图像,并将其相应的区域与原始图像的相应区域对齐。这种语义重构有助于在生成的摘要中识别出token级别幻觉的存在和类型。接下来,ESREAL根据幻觉的类型评估对齐的区域的语义相似度,然后采用一种代理策略优化算法,根据token级别幻觉得分对幻觉的token进行选择性惩罚。我们的框架在CHAIR指标上显著减少了LLaVA、InstructBLIP和mPLUG-Owl2的幻觉,分别为32.81%、27.08%和7.46%。这种改善仅通过图像本身产生的信号实现,而无需任何图像-文本对。
https://arxiv.org/abs/2403.16167
Graph data, also known as complex network data, is omnipresent across various domains and applications. Prior graph neural network models primarily focused on extracting task-specific structural features through supervised learning objectives, but they fell short in capturing the inherent semantic and structural features of the entire graph. In this paper, we introduce the semantic-structural attention-enhanced graph convolutional network (SSA-GCN), which not only models the graph structure but also extracts generalized unsupervised features to enhance vertex classification performance. The SSA-GCN's key contributions lie in three aspects: firstly, it derives semantic information through unsupervised feature extraction from a knowledge graph perspective; secondly, it obtains structural information through unsupervised feature extraction from a complex network perspective; and finally, it integrates these features through a cross-attention mechanism. By leveraging these features, we augment the graph convolutional network, thereby enhancing the model's generalization capabilities. Our experiments on the Cora and CiteSeer datasets demonstrate the performance improvements achieved by our proposed method. Furthermore, our approach also exhibits excellent accuracy under privacy settings, making it a robust and effective solution for graph data analysis.
图形数据(也称为复杂网络数据)在各种领域和应用中无处不在。先前的图形神经网络模型主要通过有监督学习目标从监督学习目标中提取任务特定的结构特征,但他们不足以捕捉整个图形固有的语义和结构特征。在本文中,我们引入了语义-结构注意力增强图卷积网络(SSA-GCN),它不仅在图形结构上建模,而且通过无监督特征提取从复杂网络角度获得普遍的无监督特征,从而提高顶点分类性能。SSA-GCN的关键贡献体现在三个方面:首先,它通过从知识图谱的角度进行无监督特征提取获得语义信息;其次,它通过从复杂网络的角度进行无监督特征提取获得结构信息;最后,它通过跨注意机制整合这些特征。通过利用这些特征,我们增强了图形卷积网络,从而提高了模型的泛化能力。我们对Cora和CiteSeer数据集的实验结果表明,我们提出的方法取得了显著的性能提升。此外,我们的方法在隐私设置下也表现出了卓越的准确性,使得它成为处理图形数据分析的稳健有效解决方案。
https://arxiv.org/abs/2403.16033
The rise of social media platforms has led to an increase in polarised online discussions, especially on political and socio-cultural topics such as elections and climate change. We propose a simple and novel unsupervised method to predict whether the authors of two posts agree or disagree, leveraging user stances about named entities obtained from their posts. We present STEntConv, a model which builds a graph of users and named entities weighted by stance and trains a Signed Graph Convolutional Network (SGCN) to detect disagreement between comment and reply posts. We run experiments and ablation studies and show that including this information improves disagreement detection performance on a dataset of Reddit posts for a range of controversial subreddit topics, without the need for platform-specific features or user history.
社交媒体平台的崛起导致政治和社会文化主题的网上讨论日益分化。我们提出了一种简单而新颖的无监督方法,用于预测两篇文章的作者是否同意或反对,利用用户关于从其文章中获得的命名实体的立场。我们介绍了STEntConv模型,该模型基于用户的立场和权重构建了一个用户-命名实体加权图,并训练了一个有符号图卷积网络(SGCN)来检测评论和回复之间的分歧。我们进行了实验和消融研究,并证明了包括这种信息能显著提高Reddit帖子数据集中的分歧检测性能,而无需考虑平台特定功能或用户历史。
https://arxiv.org/abs/2403.15885
The inductive bias of the convolutional neural network (CNN) can act as a strong prior for image restoration, which is known as the Deep Image Prior (DIP). In recent years, DIP has been utilized in unsupervised dynamic MRI reconstruction, which adopts a generative model from the latent space to the image space. However, existing methods usually utilize a single pyramid-shaped CNN architecture to parameterize the generator, which cannot effectively exploit the spatio-temporal correlations within the dynamic data. In this work, we propose a novel scheme to exploit the DIP prior for dynamic MRI reconstruction, named ``Graph Image Prior'' (GIP). The generative model is decomposed into two stages: image recovery and manifold discovery, which is bridged by a graph convolutional network to exploit the spatio-temporal correlations. In addition, we devise an ADMM algorithm to alternately optimize the images and the network parameters to further improve the reconstruction performance. Experimental results demonstrate that GIP outperforms compressed sensing methods and unsupervised methods over different sampling trajectories, and significantly reduces the performance gap with the state-of-art supervised deep-learning methods. Moreover, GIP displays superior generalization ability when transferred to a different reconstruction setting, without the need for any additional data.
卷积神经网络(CNN)的归纳偏见可以作为强大的先验用于图像修复,也就是Deep Image Prior(DIP)。近年来,DIP已经被用于无监督的动态MRI重建,它采用从潜在空间到图像空间的生成模型。然而,现有的方法通常使用单个金字塔形状的CNN架构对生成器进行参数化,这无法有效利用动态数据中的空间时间关联。在本文中,我们提出了一种新方法来利用DIP先验进行动态MRI重建,名为“图图像优先”(GIP)。生成模型被分解为两个阶段:图像恢复和多态发现,通过一个图卷积网络连接起来,以利用空间时间和关联。此外,我们还设计了一个ADMM算法来交替优化图像和网络参数,以进一步提高重建性能。实验结果表明,GIP在不同的采样轨迹上优于压缩感知方法和无监督方法,并且与最先进的监督深度学习方法相比,显著减少了性能差距。此外,当将GIP应用于不同的重建设置时,它具有卓越的泛化能力,无需额外数据。
https://arxiv.org/abs/2403.15770