Transformer-based methods have demonstrated excellent performance on super-resolution visual tasks, surpassing conventional convolutional neural networks. However, existing work typically restricts self-attention computation to non-overlapping windows to save computational costs. This means that Transformer-based networks can only use input information from a limited spatial range. Therefore, a novel Hybrid Multi-Axis Aggregation network (HMA) is proposed in this paper to exploit feature potential information better. HMA is constructed by stacking Residual Hybrid Transformer Blocks(RHTB) and Grid Attention Blocks(GAB). On the one side, RHTB combines channel attention and self-attention to enhance non-local feature fusion and produce more attractive visual results. Conversely, GAB is used in cross-domain information interaction to jointly model similar features and obtain a larger perceptual field. For the super-resolution task in the training phase, a novel pre-training method is designed to enhance the model representation capabilities further and validate the proposed model's effectiveness through many experiments. The experimental results show that HMA outperforms the state-of-the-art methods on the benchmark dataset. We provide code and models at this https URL.
基于Transformer的方法在超分辨率视觉任务上已经表现出卓越的性能,超过了传统的卷积神经网络。然而,现有的工作通常将自注意力计算限制在非重叠的窗口以节省计算成本。这意味着基于Transformer的网络只能利用输入信息的有限空间范围。因此,本文提出了一种新颖的混合多轴聚合网络(HMA)来更好地利用特征潜力信息。HMA由Residual Hybrid Transformer Blocks(RHTB)和Grid Attention Blocks(GAB)堆叠而成。一方面,RHTB通过结合通道关注和自注意力来增强非局部特征融合,产生更具吸引力的视觉结果。另一方面,GAB用于跨域信息交互,共同建模类似特征,获得更大的感知场。在训练阶段,为增强模型表示能力并验证所提出的模型的有效性,设计了一种新颖的预训练方法。实验结果表明,HMA在基准数据集上优于最先进的方法。我们在这个网址提供了代码和模型。
https://arxiv.org/abs/2405.05001
Accurately estimating a Health Index (HI) from condition monitoring data (CM) is essential for reliable and interpretable prognostics and health management (PHM) in complex systems. In most scenarios, complex systems operate under varying operating conditions and can exhibit different fault modes, making unsupervised inference of an HI from CM data a significant challenge. Hybrid models combining prior knowledge about degradation with deep learning models have been proposed to overcome this challenge. However, previously suggested hybrid models for HI estimation usually rely heavily on system-specific information, limiting their transferability to other systems. In this work, we propose an unsupervised hybrid method for HI estimation that integrates general knowledge about degradation into the convolutional autoencoder's model architecture and learning algorithm, enhancing its applicability across various systems. The effectiveness of the proposed method is demonstrated in two case studies from different domains: turbofan engines and lithium batteries. The results show that the proposed method outperforms other competitive alternatives, including residual-based methods, in terms of HI quality and their utility for Remaining Useful Life (RUL) predictions. The case studies also highlight the comparable performance of our proposed method with a supervised model trained with HI labels.
准确从病况监测数据(CM)估算健康指数(HI)对于复杂系统中的可靠且可解释的预后护理和健康管理(PHM)至关重要。在大多数情况下,复杂系统在不同的运行条件下运行,可能表现出不同的故障模式,因此从CM数据中无监督地推断HI是一个重要的挑战。为了克服这一挑战,已经提出了结合前知识 about degradation with deep learning models的混合模型。然而,以前提出的用于HI估计的混合模型通常依赖于系统特定信息,限制了它们在其他系统上的可迁移性。在本文中,我们提出了一种无监督的混合方法用于HI估计,将降解的一般知识融入到卷积自编码器的模型结构和求解算法中,增强了它在各种系统上的适用性。所提出方法的有效性在两个不同领域的案例研究中得到了证明:涡轮喷气发动机和锂离子电池。结果表明,与基于残余的方法的竞争对手相比,所提出的方法在HI质量和它们对剩余使用寿命(RUL)预测的实用性方面都表现优异。案例研究还强调了与使用HI标签进行有监督训练的监督模型具有可比较的性能。
https://arxiv.org/abs/2405.04990
Diffusion probabilistic models (DPMs) have exhibited significant effectiveness in computer vision tasks, particularly in image generation. However, their notable performance heavily relies on labelled datasets, which limits their application in medical images due to the associated high-cost annotations. Current DPM-related methods for lesion detection in medical imaging, which can be categorized into two distinct approaches, primarily rely on image-level annotations. The first approach, based on anomaly detection, involves learning reference healthy brain representations and identifying anomalies based on the difference in inference results. In contrast, the second approach, resembling a segmentation task, employs only the original brain multi-modalities as prior information for generating pixel-level annotations. In this paper, our proposed model - discrepancy distribution medical diffusion (DDMD) - for lesion detection in brain MRI introduces a novel framework by incorporating distinctive discrepancy features, deviating from the conventional direct reliance on image-level annotations or the original brain modalities. In our method, the inconsistency in image-level annotations is translated into distribution discrepancies among heterogeneous samples while preserving information within homogeneous samples. This property retains pixel-wise uncertainty and facilitates an implicit ensemble of segmentation, ultimately enhancing the overall detection performance. Thorough experiments conducted on the BRATS2020 benchmark dataset containing multimodal MRI scans for brain tumour detection demonstrate the great performance of our approach in comparison to state-of-the-art methods.
扩散概率模型(DPMs)在计算机视觉任务中表现出了显著的有效性,特别是在图像生成方面。然而,它们显著的性能很大程度上依赖于带有标签的数据集,这限制了它们在医学图像中的应用,因为相关的高成本标注。目前,与DPM相关的医学影像病变检测方法,可以分为两种不同的方法,主要依赖于图像级的标注。基于异常检测的方法包括学习参考健康的大脑表示并基于推理结果差异识别异常。相反,类似于分割任务的方法仅使用原始的大脑多模态作为生成像素级标注的前提信息。在本文中,我们提出的模型——差异分布医学扩散(DDMD)用于脑部MRI的病变检测,引入了一种新颖的框架,通过包含独特的差异特征,与传统直接依赖图像级标注或原始大脑模态的方法相比,有所区别。在我们的方法中,图像级注释的不一致性转化为异质样本之间的分布不一致性,同时保留信息在同一样本内。这种性质保留了像素级的不确定性,并促进了隐式集成的 segmentation,最终提高了整体检测性能。在包含多模态MRI扫描的大脑肿瘤检测基准数据集BRATS2020上进行的详细实验证明,与最先进的方法相比,我们的方法具有出色的性能。
https://arxiv.org/abs/2405.04974
Long text understanding is important yet challenging for natural language processing. A long article or document usually contains many redundant words that are not pertinent to its gist and sometimes can be regarded as noise. With recent advances of abstractive summarization, we propose our \emph{Gist Detector} to leverage the gist detection ability of a summarization model and integrate the extracted gist into downstream models to enhance their long text understanding ability. Specifically, Gist Detector first learns the gist detection knowledge distilled from a summarization model, and then produces gist-aware representations to augment downstream models. We evaluate our method on three different tasks: long document classification, distantly supervised open-domain question answering, and non-parallel text style transfer. The experimental results show that our method can significantly improve the performance of baseline models on all tasks.
长文本理解对于自然语言处理来说很重要,但也具有挑战性。一篇长文章或文档通常包含许多与文章主旨不相关的冗余单词,有时可以被视为噪音。随着最近摘要性总结的进步,我们提出了我们的\emph{长文本理解检测器},利用摘要模型的摘要检测能力,并将在下游模型中整合提取的摘要以提高其长文本理解能力。具体来说,Gist Detector首先从摘要模型中学习摘要检测知识,然后产生具有摘要意识的表示来增强下游模型。我们在三个不同的任务上评估我们的方法:长文档分类、远离监督的问题回答和非平行文本风格转移。实验结果表明,我们的方法可以在所有任务上显著提高基线模型的性能。
https://arxiv.org/abs/2405.04955
Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However, existing public datasets primarily consist of images without anomalies, limiting the practical application of AD methods in production settings. To address this challenge, we present (1) the Valeo Anomaly Dataset (VAD), a novel real-world industrial dataset comprising 5000 images, including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset, we introduce (2) Segmentation-based Anomaly Detector (SegAD). First, SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next, SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier, yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available.
在工业生产线上自动化视觉检查对于提高产品质量至关重要,异常检测(AD)方法作为这一目的的有力工具显得至关重要。然而,现有的公共数据集主要包含没有异常的图像,这限制了AD方法在生产环境中的实际应用。为解决这个问题,我们提出了(1)Valeo异常数据集(VAD),这是一个由5000个图像组成的新兴工业现实数据集,包括2000个具有超过20个亚类的具有挑战性的真实缺陷实例。承认传统AD方法在这个数据集上挣扎,我们引入了(2)基于分段的异常检测器(SegAD)。首先,SegAD利用异常图和分割图计算局部统计。接下来,SegAD将这些统计量作为输入特征输入到Boosted Random Forest(BRF)分类器中,产生最终的异常得分。我们的SegAD在VAD (+2.1% AUROC)和VisA数据集 (+0.4% AUROC)上实现了最先进的性能。代码和模型公开可用。
https://arxiv.org/abs/2405.04953
Facial feature tracking is essential in imaging ballistocardiography for accurate heart rate estimation and enables motor degradation quantification in Parkinson's disease through skin feature tracking. While deep convolutional neural networks have shown remarkable accuracy in tracking tasks, they typically require extensive labeled data for supervised training. Our proposed pipeline employs a convolutional stacked autoencoder to match image crops with a reference crop containing the target feature, learning deep feature encodings specific to the object category in an unsupervised manner, thus reducing data requirements. To overcome edge effects making the performance dependent on crop size, we introduced a Gaussian weight on the residual errors of the pixels when calculating the loss function. Training the autoencoder on facial images and validating its performance on manually labeled face and hand videos, our Deep Feature Encodings (DFE) method demonstrated superior tracking accuracy with a mean error ranging from 0.6 to 3.3 pixels, outperforming traditional methods like SIFT, SURF, Lucas Kanade, and the latest transformers like PIPs++ and CoTracker. Overall, our unsupervised learning approach excels in tracking various skin features under significant motion conditions, providing superior feature descriptors for tracking, matching, and image registration compared to both traditional and state-of-the-art supervised learning methods.
面部特征跟踪在球面心电图成像中至关重要,因为它能准确估计心脏率,并且通过皮肤特征跟踪在帕金森病患者中实现运动降解量化。虽然深度卷积神经网络在跟踪任务中表现出惊人的准确性,但通常需要大量的有标签数据进行监督训练。我们提出的方案采用卷积堆叠自编码器将图像块与包含目标特征的参考块匹配,无监督地学习特定于物体类别的深度特征编码,从而减少了数据需求。为了克服边缘效果,使性能取决于图像大小,我们在计算损失函数时对像素残差应用高斯权重。在面部图像上训练自编码器并验证其性能,我们的Deep Feature Encodings(DFE)方法在平均误差范围内从0.6到3.3像素,超越了传统方法(如SIFT,SURF,Lucas Kanade)和最先进的变压器(如PIPs++和CoTracker),展示了卓越的跟踪精度。总的来说,我们的无监督学习方法在重大运动条件下 excels于跟踪各种皮肤特征,为跟踪、匹配和图像配准提供卓越的性能,与传统和最先进的监督学习方法相比。
https://arxiv.org/abs/2405.04943
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
文本到图像人物识别(ReID)根据文本描述检索行人图像。手动标注文本描述费时,限制了现有数据集中的规模,因此限制了ReID模型的泛化能力。因此,我们研究可迁移的文本到图像ReID问题,在这个问题上,我们在提出的 large-scale 数据库上训练一个模型,然后直接部署到各种数据集上进行评估。我们通过多模态大型语言模型(MLLMs)获得了大量训练数据。此外,我们解决了利用获得的文本描述的两个关键挑战。首先,一个 MLLM 倾向于生成具有相似结构的描述,导致模型过拟合特定的句法模式。因此,我们提出了一种新颖的方法,使用 MLLMs 根据各种模板给图像 caption。这些模板是在与大型语言模型(LLM)的多轮对话中获得的。因此,我们可以构建一个具有多样文本描述的大型数据集。其次,一个 MLLM 可能产生错误的描述。因此,我们引入了一种新颖的方法,该方法会自动识别描述中与图像不匹配的单词。这个方法基于文本和图像中所有补丁词向量的相似性。然后,我们在后续的训练 epoch 中将这些单词的概率增大,减轻了噪音文本描述的影响。实验结果表明,我们的方法显著提高了直接迁移文本到图像 ReID 的性能。利用预训练模型权重,我们在传统评估设置中也取得了最先进的性能。
https://arxiv.org/abs/2405.04940
The trustworthiness of AI applications has been the subject of recent research and is also addressed in the EU's recently adopted AI Regulation. The currently emerging foundation models in the field of text, speech and image processing offer completely new possibilities for developing AI applications. This whitepaper shows how the trustworthiness of an AI application developed with foundation models can be evaluated and ensured. For this purpose, the application-specific, risk-based approach for testing and ensuring the trustworthiness of AI applications, as developed in the 'AI Assessment Catalog - Guideline for Trustworthy Artificial Intelligence' by Fraunhofer IAIS, is transferred to the context of foundation models. Special consideration is given to the fact that specific risks of foundation models can have an impact on the AI application and must also be taken into account when checking trustworthiness. Chapter 1 of the white paper explains the fundamental relationship between foundation models and AI applications based on them in terms of trustworthiness. Chapter 2 provides an introduction to the technical construction of foundation models and Chapter 3 shows how AI applications can be developed based on them. Chapter 4 provides an overview of the resulting risks regarding trustworthiness. Chapter 5 shows which requirements for AI applications and foundation models are to be expected according to the draft of the European Union's AI Regulation and Chapter 6 finally shows the system and procedure for meeting trustworthiness requirements.
人工智能应用程序的可靠性一直是最近的研究主题,同时也被欧盟最近通过的AI法规所涉及。在文本、语音和图像处理领域目前正在崛起的新基础模型为开发人工智能应用程序提供了全新的可能性。这份白皮书展示了使用基础模型开发的人工智能应用程序的可靠性如何评估和确保。为此,将“AI评估目录 - 可信人工智能指南”中提出的针对应用程序特定、基于风险的方法转移到基础模型的背景下。特别关注基础模型特定的风险对AI应用程序的影响,在检查可靠性时也必须予以考虑。白皮书第一章解释了基于基础模型的AI应用程序和它们之间基于信用的基本关系。第二章介绍了基础模型的技术构建,第三章展示了如何基于它们开发AI应用程序。第四章提供了关于可信度风险的概述。第五章说明了根据欧盟AI法规草案预期应满足的AI应用程序和基础模型的要求。第六章最后展示了满足可信度要求的具体系统和程序。
https://arxiv.org/abs/2405.04937
Learning latent costs of transitions on graphs from trajectories demonstrations under various contextual features is challenging but useful for path planning. Yet, existing methods either oversimplify cost assumptions or scale poorly with the number of observed trajectories. This paper introduces DataSP, a differentiable all-to-all shortest path algorithm to facilitate learning latent costs from trajectories. It allows to learn from a large number of trajectories in each learning step without additional computation. Complex latent cost functions from contextual features can be represented in the algorithm through a neural network approximation. We further propose a method to sample paths from DataSP in order to reconstruct/mimic observed paths' distributions. We prove that the inferred distribution follows the maximum entropy principle. We show that DataSP outperforms state-of-the-art differentiable combinatorial solver and classical machine learning approaches in predicting paths on graphs.
学习从轨迹演示中观察到的转移的潜在成本是具有挑战性的,但有助于路径规划。然而,现有的方法要么过于简化成本假设,要么在观察到的轨迹数量上表现不佳。本文介绍了DataSP,一种不同可微的 all-to-all 短路径算法,以促进从轨迹中学习潜在成本。它允许在每次学习步骤中从大量轨迹中学习,而无需额外计算。通过神经网络近似的复杂上下文特征可以表示算法的潜在成本函数。我们进一步提出了一种从DataSP中采样路径的方法,以重构/模仿观察到的路径分布。我们证明了推断的分布符合最大熵原理。我们证明了DataSP在预测 graphs上的路径方面优于最先进的差分组合算法和经典机器学习方法。
https://arxiv.org/abs/2405.04923
Few-shot class-incremental learning (FSCIL) aims to acquire knowledge from novel classes with limited samples while retaining information about base classes. Existing methods address catastrophic forgetting and overfitting by freezing the feature extractor during novel-class learning. However, these methods usually tend to cause the confusion between base and novel classes, i.e., classifying novel-class samples into base classes. In this paper, we delve into this phenomenon to study its cause and solution. We first interpret the confusion as the collision between the novel-class and the base-class region in the feature space. Then, we find the collision is caused by the label-irrelevant redundancies within the base-class feature and pixel space. Through qualitative and quantitative experiments, we identify this redundancy as the shortcut in the base-class training, which can be decoupled to alleviate the collision. Based on this analysis, to alleviate the collision between base and novel classes, we propose a method for FSCIL named Redundancy Decoupling and Integration (RDI). RDI first decouples redundancies from base-class space to shrink the intra-base-class feature space. Then, it integrates the redundancies as a dummy class to enlarge the inter-base-class feature space. This process effectively compresses the base-class feature space, creating buffer space for novel classes and alleviating the model's confusion between the base and novel classes. Extensive experiments across benchmark datasets, including CIFAR-100, miniImageNet, and CUB-200-2011 demonstrate that our method achieves state-of-the-art performance.
少数样本分类增强学习(FSCIL)旨在通过有限的样本获得关于新类别的知识,同时保留基类信息的完整性。现有的方法通过在 novel-class 学习期间冻结特征提取器来解决灾难性遗忘和过拟合问题。然而,这些方法通常会导致基类和新类之间的混淆,即将 novel-class 样本归类为基类。在本文中,我们深入研究了这种现象的原因和解决方案。我们首先解释混淆为特征空间中 novel-class 和基类区域之间的碰撞。然后,我们发现碰撞是由基类特征和像素空间中的标签无关冗余引起的。通过定性和定量实验,我们确定这种冗余是基类训练中的快捷方式,可以解耦以减轻碰撞。根据这种分析,为了解决基类和新类之间的碰撞,我们提出了一个名为 Redundancy Decoupling and Integration(RDI)的 FSCIL 方法。RDI 首先从基类空间中解耦冗余以压缩内基类特征空间。然后,它将冗余作为模拟类来扩展基类特征空间。这一过程有效地压缩了基类特征空间,为 novel 类创建了缓冲区空间,减轻了模型在基类和 novel 类之间的混淆。在包括 CIFAR-100、miniImageNet 和 CUB-200-2011 等基准数据集的广泛实验中,我们的方法证明了其达到了最先进的性能水平。
https://arxiv.org/abs/2405.04918
Weakly supervised semantic segmentation (WSSS) aims at learning a semantic segmentation model with only image-level tags. Despite intensive research on deep learning approaches over a decade, there is still a significant performance gap between WSSS and full semantic segmentation. Most current WSSS methods always focus on a limited single image (pixel-wise) information while ignoring the valuable inter-image (semantic-wise) information. From this perspective, a novel end-to-end WSSS framework called DSCNet is developed along with two innovations: i) pixel-wise group contrast and semantic-wise graph contrast are proposed and introduced into the WSSS framework; ii) a novel dual-stream contrastive learning (DSCL) mechanism is designed to jointly handle pixel-wise and semantic-wise context information for better WSSS performance. Specifically, the pixel-wise group contrast learning (PGCL) and semantic-wise graph contrast learning (SGCL) tasks form a more comprehensive solution. Extensive experiments on PASCAL VOC and MS COCO benchmarks verify the superiority of DSCNet over SOTA approaches and baseline models.
弱监督语义分割(WSSS)旨在学习仅基于图像级别的标签的语义分割模型。尽管在过去的十年里对深度学习方法进行了广泛研究,但WSSS和完整语义分割之间的性能差距仍然很大。大多数当前的WSSS方法始终关注有限单个图像(像素级)信息,而忽略了宝贵的跨图像(语义级)信息。从这方面来看,与两个创新相结合,我们提出了一个名为DSCNet的新端到端WSSS框架:i)提出了像素级组内对比和语义级图内对比;ii)设计了一种新颖的双流对比学习(DSCL)机制,以更好地处理像素级和语义级上下文信息,从而提高WSSS性能。具体来说,像素级组内对比学习(PGCL)和语义级图内对比学习(SGCL)任务组成更全面的解决方案。在PASCAL VOC和MS COCO基准上进行的实验证实了DSCNet相对于当前最先进的方法的优越性。
https://arxiv.org/abs/2405.04913
Predicting the future trajectories of dynamic traffic actors is a cornerstone task in autonomous driving. Though existing notable efforts have resulted in impressive performance improvements, a gap persists in scene cognitive and understanding of the complex traffic semantics. This paper proposes Traj-LLM, the first to investigate the potential of using Large Language Models (LLMs) without explicit prompt engineering to generate future motion from agents' past/observed trajectories and scene semantics. Traj-LLM starts with sparse context joint coding to dissect the agent and scene features into a form that LLMs understand. On this basis, we innovatively explore LLMs' powerful comprehension abilities to capture a spectrum of high-level scene knowledge and interactive information. Emulating the human-like lane focus cognitive function and enhancing Traj-LLM's scene comprehension, we introduce lane-aware probabilistic learning powered by the pioneering Mamba module. Finally, a multi-modal Laplace decoder is designed to achieve scene-compliant multi-modal predictions. Extensive experiments manifest that Traj-LLM, fortified by LLMs' strong prior knowledge and understanding prowess, together with lane-aware probability learning, outstrips state-of-the-art methods across evaluation metrics. Moreover, the few-shot analysis further substantiates Traj-LLM's performance, wherein with just 50% of the dataset, it outperforms the majority of benchmarks relying on complete data utilization. This study explores equipping the trajectory prediction task with advanced capabilities inherent in LLMs, furnishing a more universal and adaptable solution for forecasting agent motion in a new way.
预测动态交通角色的未来轨迹是自动驾驶中的一个重要任务。尽管已经取得了很多显著的性能改进,但场景认知和理解复杂交通语义之间仍然存在差距。本文提出了Traj-LLM,第一个研究使用明确提示工程的大型语言模型(LLM)从代理商过去/观察到的轨迹和场景语义中生成未来运动的尝试。Traj-LLM从稀疏上下文联合编码开始分解代理商和场景特征为LLM可以理解的形式。在此基础上,我们创新地探讨了LLM的强大的理解能力,以捕捉高级场景知识和交互信息。通过模拟人类车道关注认知功能和增强Traj-LLM的场景理解,我们引入了由Mamba模块引导的具有雷达域注意的概率学习。最后,设计了一个多模态Laplace解码器,以实现场景兼容的多模态预测。大量实验证明,Traj-LLM在LLM的强烈先验知识和理解能力的支持下,与具有雷达域注意的概率学习相结合,超越了最先进的评估指标。此外,少数样本分析进一步证实了Traj-LLM的性能,其中只需使用50%的数据集,它就超越了大多数基于完整数据利用的基准。本研究探讨了将LLM的高级特性应用于轨迹预测任务,提供了一种更通用和可扩展的预测代理商运动的新方法。
https://arxiv.org/abs/2405.04909
Emotion recognition is an important part of affective computing. Extracting emotional cues from human gaits yields benefits such as natural interaction, a nonintrusive nature, and remote detection. Recently, the introduction of self-supervised learning techniques offers a practical solution to the issues arising from the scarcity of labeled data in the field of gait-based emotion recognition. However, due to the limited diversity of gaits and the incompleteness of feature representations for skeletons, the existing contrastive learning methods are usually inefficient for the acquisition of gait emotions. In this paper, we propose a contrastive learning framework utilizing selective strong augmentation (SSA) for self-supervised gait-based emotion representation, which aims to derive effective representations from limited labeled gait data. First, we propose an SSA method for the gait emotion recognition task, which includes upper body jitter and random spatiotemporal mask. The goal of SSA is to generate more diverse and targeted positive samples and prompt the model to learn more distinctive and robust feature representations. Then, we design a complementary feature fusion network (CFFN) that facilitates the integration of cross-domain information to acquire topological structural and global adaptive features. Finally, we implement the distributional divergence minimization loss to supervise the representation learning of the generally and strongly augmented queries. Our approach is validated on the Emotion-Gait (E-Gait) and Emilya datasets and outperforms the state-of-the-art methods under different evaluation protocols.
情感识别是情感计算的重要组成部分。从人类脚步中提取情感线索带来了诸如自然互动、非侵入性、远程检测等好处。最近,自监督学习技术的发展为基于脚步情感识别领域缺乏标注数据的问题提供了一个实际解决方案。然而,由于脚步动作的多样性有限和骨骼特征表示的不完整性,现有的对比学习方法通常对于获取有限标注数据的脚步情感识别效果不佳。在本文中,我们提出了一种使用选择性 strong augmentation (SSA) 的对比学习框架,用于自监督基于脚步情感表示,旨在从有限的标注数据中提取有效的情感表示。首先,我们提出了一种 SSA 方法来处理脚步情感识别任务,包括上半身抖动和随机时空掩码。SSA 的目标是生成更多多样化和针对性的正样本,并促使模型学习更具有特色和鲁棒性的特征表示。然后,我们设计了一个互补特征融合网络(CFFN),促进跨领域信息的整合以获取拓扑结构和全局自适应特征。最后,我们实现了一种分布差异最小化损失,用于指导一般和强烈 augmented 查询的特征学习。我们的方法在 Emotion-Gait 和 Emilya 数据集上的验证结果表明,它在不同评估协议下优于最先进的methods。
https://arxiv.org/abs/2405.04900
For autonomous robotics applications, it is crucial that robots are able to accurately measure their potential state and perceive their environment, including other agents within it (e.g., cobots interacting with humans). The redundancy of these measurements is important, as it allows for planning and execution of recovery protocols in the event of sensor failure or external disturbances. Visual estimation can provide this redundancy through the use of low-cost sensors and server as a standalone source of proprioception when no encoder-based sensing is available. Therefore, we estimate the configuration of the robot jointly with its pose, which provides a complete spatial understanding of the observed robot. We present GISR - a method for deep configuration and robot-to-camera pose estimation that prioritizes real-time execution. GISR is comprised of two modules: (i) a geometric initialization module, efficiently computing an approximate robot pose and configuration, and (ii) an iterative silhouette-based refinement module that refines the initial solution in only a few iterations. We evaluate our method on a publicly available dataset and show that GISR performs competitively with existing state-of-the-art approaches, while being significantly faster compared to existing methods of the same class. Our code is available at this https URL.
对于自主机器人应用,确保机器人能够准确测量其潜在状态并感知其环境(包括其内部的其他机器人,例如与人类交互的协作机器人),对冗余进行重要评估,以便在传感器故障或外部干扰的情况下执行恢复协议。视觉估计可以通过使用低成本传感器和服务器作为自包含姿态感觉器时提供冗余来实现。因此,我们与姿态一起估计机器人的配置,这提供了对观察到的机器人的完整空间理解。我们提出了GISR - 一个注重实时执行的机器人配置和机器人-相机姿态估计方法。GISR由两个模块组成:(i)一个几何初始化模块,高效计算出近似的机器人姿态和配置;(ii)一个迭代轮廓基于平滑的优化模块,仅在几次迭代后对初始解决方案进行平滑。我们在公开可用的数据集上评估我们的方法,并证明了GISR与现有高级方法具有竞争力,同时比相同类型的现有方法速度更快。我们的代码可在此处访问:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=11375
https://arxiv.org/abs/2405.04890
Unified multi-model representation spaces are the foundation of multimodal understanding and generation. However, the billions of model parameters and catastrophic forgetting problems make it challenging to further enhance pre-trained unified spaces. In this work, we propose Molecule-Space, an idea that treats multimodal representation spaces as "molecules", and augments pre-trained unified space by integrating knowledge from extra expert spaces via "molecules space reactions". Specifically, we introduce two kinds of basic space reactions: 1) Space Displacement Reaction and 2) Space Combination Reaction. Based on these defined basic reactions, we design Complex Sequential & Parallel Reactions to effectively integrate multiple spaces simultaneously. Benefiting from the modularization concept, we further propose a coarse-to-fine customized inference strategy to flexibly adjust the enhanced unified space for different purposes. Experimentally, we fuse the audio-image-text space of ImageBind with the image-text and audio-text expert spaces. The resulting space outperforms ImageBind on 5 downstream tasks across 9 datasets. Moreover, via customized inference, it even surpasses the used image-text and audio-text expert spaces.
统一的多模态表示空间是多模态理解和生成的基础。然而,数十亿个模型参数和灾难性遗忘问题使得进一步增强预训练的统一空间变得具有挑战性。在这项工作中,我们提出了Molecule-Space,一种将多模态表示空间视为“分子”的想法,并通过“分子空间反应”将额外的专家空间的知识整合到预训练的统一空间中。具体来说,我们引入了两种基本空间反应:1)空间位移反应和2)空间组合反应。基于这些定义的基本反应,我们设计了一系列复杂序列与并行反应,以同时有效地整合多个空间。通过模块化概念,我们进一步提出了一个粗略到细化的自定义推理策略,以灵活地调整增强的统一空间的不同目的。实验证明,我们将图像-文本空间与图像-文本和音频-文本专家空间融合,得到的系统在9个数据集的5个下游任务上超过了ImageBind。此外,通过自定义推理,它甚至超过了使用的图像-文本和音频-文本专家空间。
https://arxiv.org/abs/2405.04883
With the proliferation of Audio Language Model (ALM) based deepfake audio, there is an urgent need for effective detection methods. Unlike traditional deepfake audio generation, which often involves multi-step processes culminating in vocoder usage, ALM directly utilizes neural codec methods to decode discrete codes into audio. Moreover, driven by large-scale data, ALMs exhibit remarkable robustness and versatility, posing a significant challenge to current audio deepfake detection (ADD) models. To effectively detect ALM-based deepfake audio, we focus on the mechanism of the ALM-based audio generation method, the conversion from neural codec to waveform. We initially construct the Codecfake dataset, an open-source large-scale dataset, including two languages, millions of audio samples, and various test conditions, tailored for ALM-based audio detection. Additionally, to achieve universal detection of deepfake audio and tackle domain ascent bias issue of original SAM, we propose the CSAM strategy to learn a domain balanced and generalized minima. Experiment results demonstrate that co-training on Codecfake dataset and vocoded dataset with CSAM strategy yield the lowest average Equal Error Rate (EER) of 0.616% across all test conditions compared to baseline models.
随着基于Audio Language Model (ALM)的深度伪造音频的普及,有效的检测方法至关重要。与传统的深度伪造音频生成,往往涉及多步过程并最终使用语音合成器,ALM直接利用神经编解码器直接将离散代码解码成音频。此外,受到大规模数据的影响,ALMs表现出非凡的稳健性和多样性,对当前的音频深度伪造检测(ADD)模型构成了重大挑战。为了有效地检测基于ALM的深度伪造音频,我们专注于ALM基于音频生成的方法,从神经编解码器到波形的转换机制。我们最初构建了Codecfake数据集,一个开源的大型数据集,包括两种语言、数百万个音频样本以及各种测试条件,专门为基于ALM的音频检测定制。此外,为了实现对深度伪造音频的普遍检测,并解决原始SAM中的域升偏见问题,我们提出了CSAM策略,以学习一个域平衡和通用的最小值。实验结果表明,在Codecfake数据集和语音合成器数据集上进行CSAM策略的协同训练,在所有测试条件下的平均等误率(EER)最低,仅为0.616%。
https://arxiv.org/abs/2405.04880
Split Federated Learning (SFL) is a distributed machine learning framework which strategically divides the learning process between a server and clients and collaboratively trains a shared model by aggregating local models updated based on data from distributed clients. However, data heterogeneity and partial client participation result in label distribution skew, which severely degrades the learning performance. To address this issue, we propose SFL with Concatenated Activations and Logit Adjustments (SCALA). Specifically, the activations from the client-side models are concatenated as the input of the server-side model so as to centrally adjust label distribution across different clients, and logit adjustments of loss functions on both server-side and client-side models are performed to deal with the label distribution variation across different subsets of participating clients. Theoretical analysis and experimental results verify the superiority of the proposed SCALA on public datasets.
Split Federated Learning (SFL)是一种分布式机器学习框架,它将学习过程战略性地划分为服务器和客户端,并通过聚合基于分布式客户端的数据来共同训练一个共享模型。然而,数据异质性和部分客户端参与导致标签分布偏斜,这严重地削弱了学习效果。为解决这个问题,我们提出了SCALA。具体来说,客户端模型的激活被连接在一起作为服务器的输入,以便在不同的客户端上集中调整标签分布,同时对服务器和客户端模型的损失函数进行对数调整,以处理不同客户端子集中的标签分布变化。理论分析和实验结果证实了所提出的SCALA在公共数据集上的优越性。
https://arxiv.org/abs/2405.04875
Prompt-based methods have gained increasing attention on NLP and shown validity on many downstream tasks. Many works have focused on mining these methods' potential for knowledge extraction, but few explore their ability to make logical reasoning. In this work, we focus on the effectiveness of the prompt-based methods on first-order logical reasoning and find that the bottleneck lies in logical negation. Based on our analysis, logical negation tends to result in spurious correlations to negative answers, while propositions without logical negation correlate to positive answers. To solve the problem, we propose a simple but effective method, Negation Augmenting and Negation Debiasing (NAND), which introduces negative propositions to prompt-based methods without updating parameters. Specifically, these negative propositions can counteract spurious correlations by providing "not" for all instances so that models cannot make decisions only by whether expressions contain a logical negation. Experiments on three datasets show that NAND not only solves the problem of calibrating logical negation but also significantly enhances prompt-based methods of logical reasoning without model retraining.
基于提示的方法在自然语言处理(NLP)领域引起了越来越多的关注,并在许多下游任务上表现出了有效性。许多工作重点关注这些方法的潜在知识提取能力,但很少探讨它们进行逻辑推理的能力。在这篇论文中,我们关注基于提示的方法在第一级逻辑推理上的效果,并发现瓶颈在于逻辑否定。根据我们的分析,逻辑否定往往会导致负回答的伪相关性,而没有逻辑否定的命题则与正回答相关。为了解决这个问题,我们提出了一个简单而有效的方法:否定增强和否定偏差(NAND)。具体来说,这些负命题可以通过为所有实例提供“不是”来对抗伪相关性,使模型不能仅根据表达式中是否包含逻辑否定来做出决策。在三个数据集上的实验表明,NAND不仅解决了调整逻辑否定的问题,而且还显著增强了基于提示的逻辑推理方法,而不需要重新训练模型。
https://arxiv.org/abs/2405.04872
Rooting in the scarcity of most attributes, realistic pedestrian attribute datasets exhibit unduly skewed data distribution, from which two types of model failures are delivered: (1) label imbalance: model predictions lean greatly towards the side of majority labels; (2) semantics imbalance: model is easily overfitted on the under-represented attributes due to their insufficient semantic diversity. To render perfect label balancing, we propose a novel framework that successfully decouples label-balanced data re-sampling from the curse of attributes co-occurrence, i.e., we equalize the sampling prior of an attribute while not biasing that of the co-occurred others. To diversify the attributes semantics and mitigate the feature noise, we propose a Bayesian feature augmentation method to introduce true in-distribution novelty. Handling both imbalances jointly, our work achieves best accuracy on various popular benchmarks, and importantly, with minimal computational budget.
翻译:在大多数属性的稀缺性上,现实步行属性数据集表现出不健康的数据分布,其中两种模型故障被交付:(1)标签不平衡:模型预测极大地倾向于多数标签的一侧;(2)语义不平衡:由于它们语义多样性的不足,模型很容易过拟合在代表性不足的属性上。为了实现完美的标签平衡,我们提出了一个新框架,该框架成功地将标签平衡数据重新采样与属性共现的诅咒解耦,即在平衡一个属性的采样优先级时,不偏袒其他共现属性的采样优先级。为了丰富属性的语义并减轻特征噪声,我们提出了一个贝叶斯特征增强方法来引入真正的分布新奇。共同处理两种不平衡,我们的工作在各种流行基准上实现了最佳准确度,并且重要的是,具有最小的计算开销。
https://arxiv.org/abs/2405.04858
Traffic predictions play a crucial role in intelligent transportation systems. The rapid development of IoT devices allows us to collect different kinds of data with high correlations to traffic predictions, fostering the development of efficient multi-modal traffic prediction models. Until now, there are few studies focusing on utilizing advantages of multi-modal data for traffic predictions. In this paper, we introduce a novel temporal attentive cross-modality transformer model for long-term traffic predictions, namely xMTrans, with capability of exploring the temporal correlations between the data of two modalities: one target modality (for prediction, e.g., traffic congestion) and one support modality (e.g., people flow). We conducted extensive experiments to evaluate our proposed model on traffic congestion and taxi demand predictions using real-world datasets. The results showed the superiority of xMTrans against recent state-of-the-art methods on long-term traffic predictions. In addition, we also conducted a comprehensive ablation study to further analyze the effectiveness of each module in xMTrans.
交通预测在智能交通系统中扮演着至关重要的角色。物联网设备的快速发展使我们能够收集到与交通预测高度相关的各种数据,推动了多模态交通预测模型的研发。到目前为止,很少有研究关注利用多模态数据的优势来进行交通预测。在本文中,我们提出了一种新颖的时间注意力交叉模态变换器模型,称为xMTrans,具有探索两个模态(目标模态和支撑模态,例如交通拥堵)之间的时间关联的能力。我们通过使用真实世界数据对交通拥堵和出租车需求进行预测来评估我们所提出的模型。结果表明,与最先进的最近方法相比,xMTrans在长期交通预测方面具有优势。此外,我们还对xMTrans进行了全面的消融研究,以进一步分析每个模块的有效性。
https://arxiv.org/abs/2405.04841