We propose RainyScape, an unsupervised framework for reconstructing clean scenes from a collection of multi-view rainy images. RainyScape consists of two main modules: a neural rendering module and a rain-prediction module that incorporates a predictor network and a learnable latent embedding that captures the rain characteristics of the scene. Specifically, based on the spectral bias property of neural networks, we first optimize the neural rendering pipeline to obtain a low-frequency scene representation. Subsequently, we jointly optimize the two modules, driven by the proposed adaptive direction-sensitive gradient-based reconstruction loss, which encourages the network to distinguish between scene details and rain streaks, facilitating the propagation of gradients to the relevant components. Extensive experiments on both the classic neural radiance field and the recently proposed 3D Gaussian splatting demonstrate the superiority of our method in effectively eliminating rain streaks and rendering clean images, achieving state-of-the-art performance. The constructed high-quality dataset and source code will be publicly available.
我们提出了RainyScape,一个无监督的框架,用于从一组多视角雨景图像中重构干净的场景。RainyScape由两个主要模块组成:一个神经渲染模块和一个融入预测网络和学习可塑嵌入的雨特性预测模块。具体来说,基于神经网络的离散余弦性质,我们首先优化神经渲染流程以获得低频场景表示。随后,我们通过基于所提出的自适应方向敏感梯度恢复损失共同优化两个模块,该损失鼓励网络区分场景细节和雨纹,从而促进梯度传播到相关组件。对经典神经辐射场和最近提出的3D高斯分裂进行的大量实验证明了我们方法在有效地消除雨纹和生成干净图像方面的优越性,达到了最先进的性能水平。构建的高质量数据集和源代码将公开可用。
https://arxiv.org/abs/2404.11401
Time series anomaly detection (TAD) faces a significant challenge due to the scarcity of labelled data, which hinders the development of accurate detection models. Unsupervised domain adaptation (UDA) addresses this challenge by leveraging a labelled dataset from a related domain to detect anomalies in a target dataset. Existing domain adaptation techniques assume that the number of anomalous classes does not change between the source and target domains. In this paper, we propose a novel Domain Adaptation Contrastive learning for Anomaly Detection in multivariate time series (DACAD) model to address this issue by combining UDA and contrastive representation learning. DACAD's approach includes an anomaly injection mechanism that introduces various types of synthetic anomalies, enhancing the model's ability to generalise across unseen anomalous classes in different domains. This method significantly broadens the model's adaptability and robustness. Additionally, we propose a supervised contrastive loss for the source domain and a self-supervised contrastive triplet loss for the target domain, improving comprehensive feature representation learning and extraction of domain-invariant features. Finally, an effective Centre-based Entropy Classifier (CEC) is proposed specifically for anomaly detection, facilitating accurate learning of normal boundaries in the source domain. Our extensive evaluation across multiple real-world datasets against leading models in time series anomaly detection and UDA underscores DACAD's effectiveness. The results validate DACAD's superiority in transferring knowledge across domains and its potential to mitigate the challenge of limited labelled data in time series anomaly detection.
时间序列异常检测(TAD)面临一个重大挑战,因为稀疏的标注数据,这阻碍了准确检测模型的开发。无监督域适应(UDA)通过利用相关领域有标注的数据集中的标注数据来检测目标数据集中的异常,从而解决了这个挑战。现有的域适应技术假设源域和目标域中的异常类数量不变。在本文中,我们提出了一个名为多维时间序列异常检测(DACAD)的新域适应异常检测模型,通过结合UDA和对比表示学习来解决这个挑战。DACAD的方法包括异常注入机制,引入各种类型的合成异常,增强了模型在不同领域的未见异常类上的泛化能力。这种方法显著拓宽了模型的适应性和稳健性。此外,我们提出了针对源域的监督对比损失和针对目标域的自监督对比三元组损失,提高了全面特征表示学习和提取域间特征的能力。最后,我们提出了一个特定的基于中心的熵分类器(CEC)用于异常检测,从而准确地学习源域中的正常边界。我们对多个现实世界数据集进行广泛的评估,与时间序列异常检测领域的领先模型进行对比,证明了DACAD的有效性。结果证实了DACAD在跨领域传递知识和减轻时间序列异常检测领域有限标注数据方面的优越性。
https://arxiv.org/abs/2404.11269
Federated learning aims to tackle the ``isolated data island" problem, where it trains a collective model from physically isolated clients while safeguarding the privacy of users' data. However, supervised federated learning necessitates that each client labels their data for training, which can be both time-consuming and resource-intensive, and may even be impractical for edge devices. Moreover, the training and transmission of deep models present challenges to the computation and communication capabilities of the clients. To address these two inherent challenges in supervised federated learning, we propose a novel lightweight unsupervised federated learning approach that leverages unlabeled data on each client to perform lightweight model training and communication by harnessing pretrained vision-language models, such as CLIP. By capitalizing on the zero-shot prediction capability and the well-trained image encoder of the pre-trained CLIP model, we have carefully crafted an efficient and resilient self-training approach. This method refines the initial zero-shot predicted pseudo-labels of unlabeled instances through the sole training of a linear classifier on top of the fixed image encoder. Additionally, to address data heterogeneity within each client, we propose a class-balanced text feature sampling strategy for generating synthetic instances in the feature space to support local training. Experiments are conducted on multiple benchmark datasets. The experimental results demonstrate that our proposed method greatly enhances model performance in comparison to CLIP's zero-shot predictions and even outperforms supervised federated learning benchmark methods given limited computational and communication overhead.
联邦学习旨在解决“孤立数据岛”问题,即在保护用户数据隐私的前提下,从物理上隔离的客户端训练一个集体模型。然而,监督式联邦学习需要每个客户端为训练数据进行标注,这可能耗时且资源密集,甚至对于边缘设备来说可能不可行。此外,深度模型的训练和传输对客户端的计算和通信能力提出了挑战。为了应对监督式联邦学习中的这两个固有挑战,我们提出了一种新颖的轻量级无监督联邦学习方法,它通过利用预训练的视觉-语言模型(如CLIP)在每个客户端上进行轻量级模型训练和通信,从而实现了 harnessing the zero-shot prediction capability and the well-trained image encoder of the pre-trained CLIP model. 通过充分利用预训练CLIP模型的零 shot预测能力和预训练的图像编码器良好的图像处理能力,我们精心设计了一种高效且鲁棒的自我训练方法。通过在固定图像编码器之上仅通过线性分类器进行训练,我们通过自我训练对未标注实例的初始零 shot预测伪标签进行了优化。此外,为了解决每个客户端内数据异质性问题,我们提出了一个类平衡文本特征抽样策略,在特征空间中生成合成实例以支持局部训练。我们在多个基准数据集上进行了实验。实验结果表明,与CLIP的零 shot预测相比,我们提出的方法在模型性能方面 greatly增强了效果,甚至在有限计算和通信开销下,甚至超过了监督式联邦学习基准方法。
https://arxiv.org/abs/2404.11046
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, without any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases and can learn high-dimensional functions with numerical inputs. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
大语言模型(LLMs)在几轮上下文理解学习(ICL)方面表现出色——从上下文提供的少量示例中进行推理,而无需进行权重更新。扩展的上下文窗口允许我们研究 ICL 的多轮范式——许多轮范式。从几轮到多轮,我们在各种生成和判别任务中观察到显著的性能提升。虽然具有前景,但许多轮 ICL 可能受到可用的人类示例数量的限制。为了减轻这一限制,我们探讨了两个新的设置:强化和无监督 ICL。强化 ICL 使用模型生成的推理线索代替人类示例。无监督 ICL 完全移除了提示,并仅向模型提出领域特定问题。我们发现,两者在多轮范式下都可以取得相当有效的效果,特别是在复杂推理任务中。最后,我们证明了,与 few-shot 学习不同,许多轮学习可以有效地抵消预训练偏见,并能够学习具有数值输入的高维函数。我们的分析还揭示了下一词预测损失作为下游 ICL 表现指标的局限性。
https://arxiv.org/abs/2404.11018
Lung-infected area segmentation is crucial for assessing the severity of lung diseases. However, existing image-text multi-modal methods typically rely on labour-intensive annotations for model training, posing challenges regarding time and expertise. To address this issue, we propose a novel attribute knowledge-guided framework for unsupervised lung-infected area segmentation (AKGNet), which achieves segmentation solely based on image-text data without any mask annotation. AKGNet facilitates text attribute knowledge learning, attribute-image cross-attention fusion, and high-confidence-based pseudo-label exploration simultaneously. It can learn statistical information and capture spatial correlations between image and text attributes in the embedding space, iteratively refining the mask to enhance segmentation. Specifically, we introduce a text attribute knowledge learning module by extracting attribute knowledge and incorporating it into feature representations, enabling the model to learn statistical information and adapt to different attributes. Moreover, we devise an attribute-image cross-attention module by calculating the correlation between attributes and images in the embedding space to capture spatial dependency information, thus selectively focusing on relevant regions while filtering irrelevant areas. Finally, a self-training mask improvement process is employed by generating pseudo-labels using high-confidence predictions to iteratively enhance the mask and segmentation. Experimental results on a benchmark medical image dataset demonstrate the superior performance of our method compared to state-of-the-art segmentation techniques in unsupervised scenarios.
肺感染区域分割对于评估肺疾病严重程度至关重要。然而,现有的图像文本多模态方法通常依赖于劳动密集的注释来进行模型训练,这存在时间和专业技能方面的挑战。为解决这一问题,我们提出了一个新颖的属性知识引导的框架,用于无监督肺感染区域分割(AKGNet),它仅基于图像文本数据进行分割,没有任何掩码注释。AKGNet促进文本属性知识学习、属性图像跨注意力和高置信度基于伪标签探索。它可以学习统计信息,并捕获图像和文本属性在嵌入空间中的空间相关性,迭代地优化掩码以增强分割。具体来说,我们通过提取属性和将其融入特征表示中引入了一个文本属性知识学习模块,使模型能够学习统计信息和适应不同的属性。此外,我们还设计了一个属性图像跨注意力模块,通过计算嵌入空间中属性和图像之间的相关性来捕捉空间依赖信息,从而选择性地关注相关区域并过滤无关区域。最后,采用自训练掩码改进过程生成伪标签,以迭代增强掩码和分割。在医学图像数据集的基准上进行实验,结果表明,与最先进的无监督分割技术相比,我们的方法在无监督场景中的性能具有优越性。
https://arxiv.org/abs/2404.11008
Traditional search systems focus on query formulation for effective results but face challenges in scenarios such as product searches where crucial product details (e.g., size, color) remain concealed until users visit specific product pages. This highlights the need for intelligent web navigation agents capable of formulating queries and navigating web pages according to users' high-level intents. In response to this need, this work introduces a Grounded Language Agent for Intelligent Web Interactions, called GLAINTEL. Drawing upon advancements in language modeling and reinforcement learning, GLAINTEL investigates the efficacy of transformer-based models in enhancing the search capabilities of interactive web environments. Given the dynamic action space for each state in web navigation, GLAINTEL employs the Flan-T5 architecture and incorporates language modeling and value estimation heads. This work focuses on training smaller language models as agents across various scenarios, systematically evaluating the impact of human demonstrations on the training process. Specifically, we investigate scenarios where no human demonstrations are available and subsequently assess the effective utilization of such demonstrations. We also explore unsupervised domain adaptation for situations where demonstrations are confined to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of training agents in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised learning-based methods. Additionally, combining human demonstrations with Reinforcement Learning-based training yields results comparable to models utilizing GPT-4.
传统搜索引擎关注的是有效结果的查询形成,但在场景(如产品搜索)中,用户访问特定产品页面时,关键产品细节(如尺寸,颜色)可能仍然被隐藏。这凸显了需要具有智能网络导航代理能力根据用户高级意图制定查询并进行网页导航的需求。为了满足这一需求,本文介绍了一个基于GRounded Language Agent for Intelligent Web Interactions(GLAINTEL)的工作。通过利用自然语言处理和强化学习的进展,GLAINTEL研究了Transformer基模型在增强交互式Web环境搜索功能方面的效果。由于每个状态在导航过程中的动态动作空间,GLAINTEL采用Flan-T5架构,并融入语言建模和价值估计头。本文重点关注在不同场景中训练较小的语言模型作为代理,系统地评估人类演示对训练过程的影响。具体来说,我们研究了没有人类演示的情况,并随后评估了此类演示的有效性。我们还研究了在特定领域内演示受限的情况。在多样设置的实验评估中,训练代理在无监督环境中有效,优于采用具有最大540亿参数的大型模型进行的有监督学习方法。令人惊讶的是,直接使用人类演示的行为克隆方法并没有优于基于无监督学习的 methods。此外,将人类演示与强化学习训练相结合产生了与使用GPT-4相当的结果。
https://arxiv.org/abs/2404.10887
This paper presents a computationally efficient and distributed speaker diarization framework for networked IoT-style audio devices. The work proposes a Federated Learning model which can identify the participants in a conversation without the requirement of a large audio database for training. An unsupervised online update mechanism is proposed for the Federated Learning model which depends on cosine similarity of speaker embeddings. Moreover, the proposed diarization system solves the problem of speaker change detection via. unsupervised segmentation techniques using Hotelling's t-squared Statistic and Bayesian Information Criterion. In this new approach, speaker change detection is biased around detected quasi-silences, which reduces the severity of the trade-off between the missed detection and false detection rates. Additionally, the computational overhead due to frame-by-frame identification of speakers is reduced via. unsupervised clustering of speech segments. The results demonstrate the effectiveness of the proposed training method in the presence of non-IID speech data. It also shows a considerable improvement in the reduction of false and missed detection at the segmentation stage, while reducing the computational overhead. Improved accuracy and reduced computational cost makes the mechanism suitable for real-time speaker diarization across a distributed IoT audio network.
本文提出了一种计算高效且分布式的音频设备 speaker diarization 框架,适用于网络化 IOT 式音频设备。该工作提出了一种联邦学习模型,可以在不需要大量音频数据进行训练的情况下,识别对话中的参与者。对于联邦学习模型,提出了一种无监督在线更新机制,它依赖于说话人嵌入的余弦相似性。此外,所提出的 diarization 系统通过使用 Hotelling 的 t-平方统计量和贝叶斯信息准则来解决说话人切换检测问题。在新方法中,说话人切换检测存在偏差,集中在检测到的伪 silence 上,这减少了错检和误检率之间的权衡。此外,通过无监督对语音段进行聚类,可以降低识别每个说话人所需的计算开销。结果表明,所提出的训练方法在非 IID 语音数据存在的情况下非常有效。它还表明,在分割阶段,错检和误检的减少效果显著,同时降低了计算开销。这种提高准确性和降低计算成本使得该机制适用于分布式 IOT 音频网络中的实时 speaker diarization。
https://arxiv.org/abs/2404.10842
Anomaly detection (AD) is often focused on detecting anomaly areas for industrial quality inspection and medical lesion examination. However, due to the specific scenario targets, the data scale for AD is relatively small, and evaluation metrics are still deficient compared to classic vision tasks, such as object detection and semantic segmentation. To fill these gaps, this work first constructs a large-scale and general-purpose COCO-AD dataset by extending COCO to the AD field. This enables fair evaluation and sustainable development for different methods on this challenging benchmark. Moreover, current metrics such as AU-ROC have nearly reached saturation on simple datasets, which prevents a comprehensive evaluation of different methods. Inspired by the metrics in the segmentation field, we further propose several more practical threshold-dependent AD-specific metrics, ie, m$F_1$$^{.2}_{.8}$, mAcc$^{.2}_{.8}$, mIoU$^{.2}_{.8}$, and mIoU-max. Motivated by GAN inversion's high-quality reconstruction capability, we propose a simple but more powerful InvAD framework to achieve high-quality feature reconstruction. Our method improves the effectiveness of reconstruction-based methods on popular MVTec AD, VisA, and our newly proposed COCO-AD datasets under a multi-class unsupervised setting, where only a single detection model is trained to detect anomalies from different classes. Extensive ablation experiments have demonstrated the effectiveness of each component of our InvAD. Full codes and models are available at this https URL.
异常检测(AD)通常关注工业质量检测和医学伤口检验中的异常区域检测。然而,由于特定场景的目标,AD的数据规模相对较小,与经典视觉任务(如物体检测和语义分割)相比,评估指标仍然缺乏。为了填补这些空白,本文首先通过在AD领域扩展COCO来构建一个大规模且通用的COCO-AD数据集。这使得对于这个具有挑战性的基准,不同方法具有公平的评估和可持续的发展。此外,如分割域中的指标一样,目前的AU-ROC指标在简单的数据集上几乎达到饱和,这阻止了不同方法的全面评估。受到分割域指标的启发,我们进一步提出了几个更具体的AD特定指标,即m$F_1^{.2}_{.8}$,mAcc$^{.2}_{.8}$,mIoU$^{.2}_{.8}$和mIoU-max。受到GAN逆变换高质量重构能力的影响,我们提出了一个简单但更强大的InvAD框架,以实现高质量特征重构。我们的方法在多类无监督设置中提高了基于重构的复原方法在流行MVTec AD,VisA和我们所提出的COCO-AD数据集上的有效性。广泛的消融实验证明了每个组件的有效性。完整代码和模型可以从该链接的https URL中获取。
https://arxiv.org/abs/2404.10760
Modern smartphone camera quality heavily relies on the image signal processor (ISP) to enhance captured raw images, utilizing carefully designed modules to produce final output images encoded in a standard color space (e.g., sRGB). Neural-based end-to-end learnable ISPs offer promising advancements, potentially replacing traditional ISPs with their ability to adapt without requiring extensive tuning for each new camera model, as is often the case for nearly every module in traditional ISPs. However, the key challenge with the recent learning-based ISPs is the urge to collect large paired datasets for each distinct camera model due to the influence of intrinsic camera characteristics on the formation of input raw images. This paper tackles this challenge by introducing a novel method for unpaired learning of raw-to-raw translation across diverse cameras. Specifically, we propose Rawformer, an unsupervised Transformer-based encoder-decoder method for raw-to-raw translation. It accurately maps raw images captured by a certain camera to the target camera, facilitating the generalization of learnable ISPs to new unseen cameras. Our method demonstrates superior performance on real camera datasets, achieving higher accuracy compared to previous state-of-the-art techniques, and preserving a more robust correlation between the original and translated raw images.
现代智能手机摄像头质量很大程度上取决于图像信号处理器(ISP)对捕捉到的原始图像进行增强,利用精心设计的模块将编码为标准色彩空间(如sRGB)的最终输出图像。基于神经元的端到端可学习ISP带来了有前景的进步,可能最终将传统ISP与不需要对每个新相机模型进行广泛调整的能力相取代,这种情况在传统ISP中的几乎每个模块都常常如此。然而,最近基于学习的ISP的一个关键挑战是,由于固有相机特性对输入原始图像形成的影
https://arxiv.org/abs/2404.10700
We explore simple methods for adapting a trained multi-task UNet which predicts canopy cover and height to a new geographic setting using remotely sensed data without the need of training a domain-adaptive classifier and extensive fine-tuning. Extending previous research, we followed a selective alignment process to identify similar images in the two geographical domains and then tested an array of data-based unsupervised domain adaptation approaches in a zero-shot setting as well as with a small amount of fine-tuning. We find that the selective aligned data-based image matching methods produce promising results in a zero-shot setting, and even more so with a small amount of fine-tuning. These methods outperform both an untransformed baseline and a popular data-based image-to-image translation model. The best performing methods were pixel distribution adaptation and fourier domain adaptation on the canopy cover and height tasks respectively.
我们探讨了简单的方法来适应训练好的多任务UNet,该模型通过遥感数据在没有训练领域自适应分类器和大量微调的情况下预测树冠覆盖度和高度,以预测新的地理设置。扩展前人研究,我们采用选择性对齐过程来识别两个地理域中的相似图像,然后在一击零数据的情况下以及少量微调的情况下测试了一组数据自适应域迁移方法。我们发现,基于数据的图像匹配方法在零击零数据的情况下产生了积极的结果,而且甚至更好的结果是在少量微调的情况下。这些方法优于未转换的基线和一种流行的基于数据的照片-图像转换模型。在树冠覆盖度和高度任务上,最佳表现的方法是像素分布适应和傅里叶域适应。
https://arxiv.org/abs/2404.10626
Standard Unsupervised Domain Adaptation (UDA) aims to transfer knowledge from a labeled source domain to an unlabeled target but usually requires simultaneous access to both source and target data. Moreover, UDA approaches commonly assume that source and target domains share the same labels space. Yet, these two assumptions are hardly satisfied in real-world scenarios. This paper considers the more challenging Source-Free Open-set Domain Adaptation (SF-OSDA) setting, where both assumptions are dropped. We propose a novel approach for SF-OSDA that exploits the granularity of target-private categories by segregating their samples into multiple unknown classes. Starting from an initial clustering-based assignment, our method progressively improves the segregation of target-private samples by refining their pseudo-labels with the guide of an uncertainty-based sample selection module. Additionally, we propose a novel contrastive loss, named NL-InfoNCELoss, that, integrating negative learning into self-supervised contrastive learning, enhances the model robustness to noisy pseudo-labels. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed method over existing approaches, establishing new state-of-the-art performance. Notably, additional analyses show that our method is able to learn the underlying semantics of novel classes, opening the possibility to perform novel class discovery.
标准无监督领域适应(UDA)旨在将来自标记源域的知识转移到未标记的目标域,通常需要同时访问源和目标数据。此外,UDA方法通常假设源和目标域具有相同的标签空间。然而,在现实场景中,这两个假设很难得到满足。本文考虑了更具有挑战性的源自由开放域适应(SF-OSDA)设置,其中这两个假设都被放弃了。我们提出了一种新颖的方法来处理SF-OSDA,它通过将目标私有类别的样本划分为多个未知类别,利用了目标私有类别的粒度。从基于聚类的初始聚类分配开始,我们的方法通过基于不确定性的样本选择模块逐步改进目标私有样本的划分。此外,我们提出了一个名为NL-InfoNCELoss的新颖对比损失,该损失将负学习融入自监督对比学习,增强了模型的鲁棒性。在基准数据集上的大量实验证明,与现有方法相比,所提出的方法具有优越性,推动了领域最新的性能水平。值得注意的是,进一步的分析表明,我们的方法能够学习新类别的潜在语义,为进行新类发现开辟了道路。
https://arxiv.org/abs/2404.10574
This paper makes the first attempt towards unsupervised preference alignment in Vision-Language Models (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code will be available.
本文是关于在Vision-Language Models(VLMs)上进行无监督偏好对齐的第一尝试。我们生成原始图像对和增强图像对的选择和拒绝响应,并使用直接偏好优化进行偏好对齐。其基于一个核心思想:对图像输入进行适当设计的增强将导致VLM生成虚假但困难的负响应,这有助于模型从学习和产生更稳健和强大的答案。整个流程不再依赖于GPT4的监督或人类参与对齐,代码量很少。仅使用8k个随机的无监督数据,在LLaVaBench上的复杂推理相对GPT-4的得分提高90\%,在MM-Vet上的复杂多模态基准上提高了LLaVa-7B/13B的6.7\%/5.6\%得分。可视化显示其与用户意图的 alignment 能力得到显著提高。对其进行了系列抽象,揭示了方法的内置机制,这表明其有进一步扩展的潜力。代码将可用。
https://arxiv.org/abs/2404.10501
Trustworthiness is a major prerequisite for the safe application of opaque deep learning models in high-stakes domains like medicine. Understanding the decision-making process not only contributes to fostering trust but might also reveal previously unknown decision criteria of complex models that could advance the state of medical research. The discovery of decision-relevant concepts from black box models is a particularly challenging task. This study proposes Concept Discovery through Latent Diffusion-based Counterfactual Trajectories (CDCT), a novel three-step framework for concept discovery leveraging the superior image synthesis capabilities of diffusion models. In the first step, CDCT uses a Latent Diffusion Model (LDM) to generate a counterfactual trajectory dataset. This dataset is used to derive a disentangled representation of classification-relevant concepts using a Variational Autoencoder (VAE). Finally, a search algorithm is applied to identify relevant concepts in the disentangled latent space. The application of CDCT to a classifier trained on the largest public skin lesion dataset revealed not only the presence of several biases but also meaningful biomarkers. Moreover, the counterfactuals generated within CDCT show better FID scores than those produced by a previously established state-of-the-art method, while being 12 times more resource-efficient. Unsupervised concept discovery holds great potential for the application of trustworthy AI and the further development of human knowledge in various domains. CDCT represents a further step in this direction.
信任度是深度学习中奥秘深度学习模型在高风险领域(如医疗领域)安全应用的重要前提条件。理解决策过程不仅有助于增强信任,还可能揭示复杂模型的 previously unknown 决策标准,有助于推动医学研究的发展。从黑盒模型中发现相关概念是一个具有挑战性的任务。本研究提出了通过基于潜在扩散的逆向传播(CDCT)进行概念发现的新颖的三步框架,利用扩散模型的卓越图像合成能力。在第一步中,CDCT 使用 Latent Diffusion Model (LDM) 生成反事实轨迹数据集。这个数据集用于通过变分自编码器(VAE)获得分类相关概念的分离表示。最后,应用搜索算法在分离的潜在空间中识别相关概念。将 CDCT 应用于在公共皮肤斑疹数据集上训练的分类器揭示了不仅存在几种偏见,而且还有有意义的生物标记物。此外,CDCT 生成的反事实例在 FID 分数上优于之前确定的最先进方法,而资源消耗量降低了 12 倍。无监督的概念发现对于信任 AI 的应用和各个领域的知识进一步发展具有巨大的潜力。CDCT 代表了这一方向的一个进一步步骤。
https://arxiv.org/abs/2404.10356
The widespread use of social media has led to a surge in popularity for automated methods of analyzing public opinion. Supervised methods are adept at text categorization, yet the dynamic nature of social media discussions poses a continual challenge for these techniques due to the constant shifting of the focus. On the other hand, traditional unsupervised methods for extracting themes from public discourse, such as topic modeling, often reveal overarching patterns that might not capture specific nuances. Consequently, a significant portion of research into social media discourse still depends on labor-intensive manual coding techniques and a human-in-the-loop approach, which are both time-consuming and costly. In this work, we study the problem of discovering arguments associated with a specific theme. We propose a generic LLMs-in-the-Loop strategy that leverages the advanced capabilities of Large Language Models (LLMs) to extract latent arguments from social media messaging. To demonstrate our approach, we apply our framework to contentious topics. We use two publicly available datasets: (1) the climate campaigns dataset of 14k Facebook ads with 25 themes and (2) the COVID-19 vaccine campaigns dataset of 9k Facebook ads with 14 themes. Furthermore, we analyze demographic targeting and the adaptation of messaging based on real-world events.
社交媒体的广泛应用导致了对自动分析公共舆论的自动化方法的浓厚兴趣。监督方法擅长文本分类,但由于社交媒体讨论的动态性,这些技术因持续关注焦点转移而面临持续的挑战。另一方面,从公共话语中提取主题的传统无监督方法,如主题建模,通常揭示出可能不会捕捉到具体细微之处的总体模式。因此,研究社交媒体讨论的大部分仍然依赖于劳动密集型的手动编码技术和人机交互方法,这些方法既耗时又昂贵。在这项工作中,我们研究了发现与特定主题相关的论点的问题。我们提出了一种通用的LLMs-in-the-Loop策略,利用大型语言模型的先进功能来从社交媒体消息中提取潜在论点。为了证明我们的方法,我们将其应用于有争议的话题。我们使用了两个公开可用的数据集:(1)14k个Facebook广告的气候变化活动数据集,分为25个主题;(2)9k个Facebook广告的新冠疫苗活动数据集,分为14个主题。此外,我们还分析了基于现实事件的人口统计学分析和消息适应。
https://arxiv.org/abs/2404.10259
There is a belief that learning to compress well will lead to intelligence. Recently, language modeling has been shown to be equivalent to compression, which offers a compelling rationale for the success of large language models (LLMs): the development of more advanced language models is essentially enhancing compression which facilitates intelligence. Despite such appealing discussions, little empirical evidence is present for the interplay between compression and intelligence. In this work, we examine their relationship in the context of LLMs, treating LLMs as data compressors. Given the abstract concept of "intelligence", we adopt the average downstream benchmark scores as a surrogate, specifically targeting intelligence related to knowledge and commonsense, coding, and mathematical reasoning. Across 12 benchmarks, our study brings together 30 public LLMs that originate from diverse organizations. Remarkably, we find that LLMs' intelligence -- reflected by average benchmark scores -- almost linearly correlates with their ability to compress external text corpora. These results provide concrete evidence supporting the belief that superior compression indicates greater intelligence. Furthermore, our findings suggest that compression efficiency, as an unsupervised metric derived from raw text corpora, serves as a reliable evaluation measure that is linearly associated with the model capabilities. We open-source our compression datasets as well as our data collection pipelines to facilitate future researchers to assess compression properly.
有一种观点认为,学会压缩信息会提高智力。最近,语言建模已经被证明与压缩相同,这为大型语言模型(LLMs)的成功提供了有力的理由:开发更先进的语言模型本质上是在增强压缩,从而促进了智力的发展。尽管有如此引人注目的讨论,但压缩和智力的相互作用几乎没有实证证据。在这项工作中,我们研究了它们在LLM环境中的关系,将LLM视为数据压缩器。鉴于“智力”这个抽象概念,我们采用平均下游基准得分作为代理,特别关注知识和相关、编码和数学推理的智力。在12个基准测试中,我们的研究汇集了来自各种组织的30个公共LLM。令人惊讶的是,我们发现LLM的智力(通过平均基准得分反映)几乎与它们压缩外部文本语料库的能力成线性关系。这些结果提供了支持优化压缩表示的更好智力的具体证据。此外,我们的研究还表明,作为从原始文本语料库中提取的无监督度量,压缩效率(作为模型能力的线性相关指标)是一个可靠的评估指标。我们公开我们的压缩数据集以及数据收集管道,以促进未来研究人员正确评估压缩。
https://arxiv.org/abs/2404.09937
This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.
本文描述了AssemblyAI设计的工业规模自动语音识别(ASR)系统,旨在满足大规模、多语言ASR满足各种应用需求的需求。我们的系统利用四个语言中的无监督(12.5M小时)、有监督(188k小时)和伪标签(1.6M小时)数据构建了多样化的训练数据集。我们详细描述了我们的模型架构,包括使用BEST-RQ预训练的完整上下文600M参数的Conformer编码器以及与编码器共同细化的RNN-T解码器。我们的大量评估显示,与更大、更昂贵的大型模型(如Whisper large和Canary-1B)相比,我们的具有竞争力的词错误率(WERs)。此外,我们的架构选择带来几个关键优势,包括提高的换挡能力、与优化后的Whisper基线相比的5倍推理速度、对语音数据的幻觉率降低30%以及与Whisper相比的90%的环境噪声降低。在本文中,我们采用系统中心的方法分析各种大規模ASR模型的各个方面,以获得与实际服务操作规模相关的实际相关见解,这些见解对于大规模服务至关重要。
https://arxiv.org/abs/2404.09841
Although Reinforcement Learning (RL) algorithms acquire sequential behavioral patterns through interactions with the environment, their effectiveness in noisy and high-dimensional scenarios typically relies on specific structural priors. In this paper, we propose a novel and general Structural Information principles-based framework for effective Decision-Making, namely SIDM, approached from an information-theoretic perspective. This paper presents a specific unsupervised partitioning method that forms vertex communities in the state and action spaces based on their feature similarities. An aggregation function, which utilizes structural entropy as the vertex weight, is devised within each community to obtain its embedding, thereby facilitating hierarchical state and action abstractions. By extracting abstract elements from historical trajectories, a directed, weighted, homogeneous transition graph is constructed. The minimization of this graph's high-dimensional entropy leads to the generation of an optimal encoding tree. An innovative two-layer skill-based learning mechanism is introduced to compute the common path entropy of each state transition as its identified probability, thereby obviating the requirement for expert knowledge. Moreover, SIDM can be flexibly incorporated into various single-agent and multi-agent RL algorithms, enhancing their performance. Finally, extensive evaluations on challenging benchmarks demonstrate that, compared with SOTA baselines, our framework significantly and consistently improves the policy's quality, stability, and efficiency up to 32.70%, 88.26%, and 64.86%, respectively.
虽然强化学习(RL)算法通过与环境的交互来获取序列行为模式,但它们在嘈杂和高维场景中的有效性通常依赖于特定的结构先验。在本文中,我们从信息论的角度提出了一种新颖且通用的基于结构信息原理的决策-制定框架,称为SIDM。本文提出了一种特定的自适应无监督聚类方法,根据它们的特征相似性在状态和动作空间中形成顶点社区。在每个社区内,设计了一个利用结构熵作为顶点权重的聚合函数,从而获得其嵌入,促进层次化状态和动作抽象。通过从历史轨迹中提取抽象元素,构建了一个有向、加权、均匀的转移图形。这个图形的高维熵最小化导致生成最优编码树。引入了一种创新的两层技能基于学习机制,计算每个状态转移的共同路径熵作为其确定的概率,从而消除专家知识的需要。此外,SIDM可以灵活地应用于各种单智能体和多智能体强化学习算法,提高它们的性能。最后,在具有挑战性的基准测试中进行广泛的评估,与当前最佳基线相比,我们的框架在提高政策质量、稳定性和效率方面显著且一致地提高了32.70%、88.26%和64.86%。
https://arxiv.org/abs/2404.09760
Today, there have been many achievements in learning the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the likeness of voices and faces following contrastive learning, subsequently applied to retrieval and matching tasks. This method only considers the embeddings as high-dimensional vectors, utilizing a minimal scope of available information. This paper introduces a novel framework within an unsupervised setting for learning voice-face associations. By employing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence demonstrates that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.
today, there have been many accomplishments in understanding the association between voice and face. However, most previous work models rely on cosine similarity or L2 distance to evaluate the similarity of voices and faces after contrastive learning, which was then applied to retrieval and matching tasks. This approach only considers the embeddings as high-dimensional vectors and makes minimal use of the available information. This paper introduces a novel framework for learning voice-face associations within an unsupervised setting. By utilizing a multimodal encoder after contrastive learning and addressing the problem through binary classification, we can learn the implicit information within the embeddings in a more effective and varied manner. Furthermore, by introducing an effective pair selection method, we enhance the learning outcomes of both contrastive learning and the matching task. Empirical evidence shows that our framework achieves state-of-the-art results in voice-face matching, verification, and retrieval tasks, improving verification by approximately 3%, matching by about 2.5%, and retrieval by around 1.3%.
https://arxiv.org/abs/2404.09509
Media Storms, dramatic outbursts of attention to a story, are central components of media dynamics and the attention landscape. Despite their significance, there has been little systematic and empirical research on this concept due to issues of measurement and operationalization. We introduce an iterative human-in-the-loop method to identify media storms in a large-scale corpus of news articles. The text is first transformed into signals of dispersion based on several textual characteristics. In each iteration, we apply unsupervised anomaly detection to these signals; each anomaly is then validated by an expert to confirm the presence of a storm, and those results are then used to tune the anomaly detection in the next iteration. We demonstrate the applicability of this method in two scenarios: first, supplementing an initial list of media storms within a specific time frame; and second, detecting media storms in new time periods. We make available a media storm dataset compiled using both scenarios. Both the method and dataset offer the basis for comprehensive empirical research into the concept of media storms, including characterizing them and predicting their outbursts and durations, in mainstream media or social media platforms.
媒体风暴,即对故事关注的高涨,是媒体动态和关注力格局的核心组成部分。尽管这一概念具有重要意义,但由於测量和操作问题,系统性和实证研究仍然很少。我们引入了一种迭代的人-反馈方法,以在大规模新闻文章语料库中识别媒体风暴。首先将文本转换为基于多个文本特征的扩散信号。在每一次迭代中,我们对这些信号应用无监督异常检测;然后由专家验证每个异常的存在,并使用这些结果对下一次迭代中的异常检测进行调整。我们展示了这种方法的两种应用场景:一是补充特定时间段内的媒体风暴列表;二是在新时间段内检测媒体风暴。我们还提供了这两种场景下的媒体风暴数据集。这种方法和数据集为全面实证研究媒体风暴的概念提供了基础,包括对其进行描述和预测爆发时间和持续时间,以及在主流媒体或社交媒体平台上。
https://arxiv.org/abs/2404.09299
Over the past decade, a series of unflagging efforts have been dedicated to developing highly expressive and controllable text-to-speech (TTS) systems. In general, the holistic TTS comprises two interconnected components: the frontend module and the backend module. The frontend excels in capturing linguistic representations from the raw text input, while the backend module converts linguistic cues to speech. The research community has shown growing interest in the study of the frontend component, recognizing its pivotal role in text-to-speech systems, including Text Normalization (TN), Prosody Boundary Prediction (PBP), and Polyphone Disambiguation (PD). Nonetheless, the limitations posed by insufficient annotated textual data and the reliance on homogeneous text signals significantly undermine the effectiveness of its supervised learning. To evade this obstacle, a novel two-stage TTS frontend prediction pipeline, named TAP-FM, is proposed in this paper. Specifically, during the first learning phase, we present a Multi-scale Contrastive Text-audio Pre-training protocol (MC-TAP), which hammers at acquiring richer insights via multi-granularity contrastive pre-training in an unsupervised manner. Instead of mining homogeneous features in prior pre-training approaches, our framework demonstrates the ability to delve deep into both global and local text-audio semantic and acoustic representations. Furthermore, a parallelized TTS frontend model is delicately devised to execute TN, PD, and PBP prediction tasks, respectively in the second stage. Finally, extensive experiments illustrate the superiority of our proposed method, achieving state-of-the-art performance.
在过去的十年里,一系列无休止的努力都致力于开发高度表达性强且可控制性强的文本转语音(TTS)系统。通常,整体TTS由前端模块和后端模块组成。前端模块在捕捉原始文本输入的语义表示方面表现出色,而后端模块将语义线索转换为语音。研究社区越来越关注前端模块,认识到其在文本转语音系统中的关键作用,包括文本标准化(TN)、 prosody 边界预测(PBP)和多普勒消歧(PD)。然而,缺乏注释的文本数据和依赖同质文本信号的局限性,严重削弱了其监督学习的有效性。为了克服这一障碍,本文提出了一个新颖的两阶段TTS前端预测管道,名为TAP-FM。具体来说,在第一学习阶段,我们提出了一个多尺度对比文本-音频预训练协议(MC-TAP),通过无监督方式通过多粒度对比预训练来获取更丰富的见解。我们不再通过先前的预训练方法挖掘同质特征,而我们的框架能够深入研究全局和局部文本-音频语义和听觉表示。此外,我们还专门设计了一个并行的TTS前端模型,分别执行TN、PD和PBP预测任务。最后,大量实验证实了我们所提出方法的优势,实现了最先进的性能。
https://arxiv.org/abs/2404.09192