Existing Transformers for monocular 3D human shape and pose estimation typically have a quadratic computation and memory complexity with respect to the feature length, which hinders the exploitation of fine-grained information in high-resolution features that is beneficial for accurate reconstruction. In this work, we propose an SMPL-based Transformer framework (SMPLer) to address this issue. SMPLer incorporates two key ingredients: a decoupled attention operation and an SMPL-based target representation, which allow effective utilization of high-resolution features in the Transformer. In addition, based on these two designs, we also introduce several novel modules including a multi-scale attention and a joint-aware attention to further boost the reconstruction performance. Extensive experiments demonstrate the effectiveness of SMPLer against existing 3D human shape and pose estimation methods both quantitatively and qualitatively. Notably, the proposed algorithm achieves an MPJPE of 45.2 mm on the Human3.6M dataset, improving upon Mesh Graphormer by more than 10% with fewer than one-third of the parameters. Code and pretrained models are available at this https URL.
现有的单目3D人体形状和姿态估计器通常具有关于特征长度的二次计算和内存复杂度,这会阻碍在分辨率高的特征中发掘微小信息的有利程度,从而影响准确重建。在这项工作中,我们提出了一个基于SMPL的Transformer框架(SMPLer)来解决这一问题。SMPLer包括两个关键组件:分离的注意力操作和基于SMPL的目标表示,允许在Transformer中有效利用高分辨率特征。此外,根据这两个设计,我们还引入了几个新颖的模块,包括多尺度注意力和联合注意,以进一步提高重建性能。大量实验证明,SMPLer在现有的人体形状和姿态估计方法上具有有效性,无论是定量还是定性。值得注意的是,与Mesh Graphormer相比,所提出的算法在Human3.6M数据集上实现了更高的MPJPE,同时参数数量不到三分之一。代码和预训练模型可在此处访问的URL中获取。
https://arxiv.org/abs/2404.15276
Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse negative samples. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
医疗视觉-语言预训练(Med-VLP)建立了视觉内容从医学图像和相关的文本描述之间的联系。现有的Med-VLP方法主要集中在描述单个身体部位的2D图像,特别是胸部X光片。在本文中,我们将Med-VLP的视野扩展到包括3D图像,特别是全身情景,通过使用包含CT图像和报告的多模态数据集。与2D版本相比,3D VLP需要有效地从显著稀疏的3D成像表示中捕捉关键语义信息。本文我们引入了CT-GLIP(基于CT的 grounded 语言-图像预训练),一种新颖的方法,用于构建器官级别的图像-文本对以增强多模态对比学习,将 grounded visual features 与精确的诊断文本对齐。此外,我们还开发了一个异常情况词典,以增加对比学习中的多样负样本。我们的方法,在包括17,702名患者跨越104个器官的44,011个器官级别视觉-文本对的多模态CT数据集上进行训练,能够以零散的方式识别器官和异常情况。CT-GLIP的性能在一个包括1,130名患者的独立测试集上进行了验证,重点关注7个器官中最常见的异常情况。实验结果表明,在我们的模型在零散和微调场景下超过了标准CLIP框架,使用了CNN和ViT架构。
https://arxiv.org/abs/2404.15272
Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.
近年来,指令跟随模型的进步使得用户与模型之间的交互更加友好和高效,拓宽了其应用范围。在图形设计中,非专业用户通常由于技能和资源有限,难以创建视觉上吸引人的布局。在这项工作中,我们引入了一个新颖的多模态指令跟随布局规划框架,允许用户通过指定画布大小和设计目的,轻松地将视觉元素排版到定制布局中,如书籍封面、海报、宣传册或菜单。我们开发了三个布局推理任务来训练模型理解并执行布局指令。在两个基准测试上的实验证明,我们的方法不仅简化了非专业用户的设计流程,而且超越了少样本GPT-4V模型的性能,在Crello上的mIoU值较高。这一进步突出了多模态指令跟随模型的潜力,可以自动化和简化设计过程,为各种视觉丰富的文档提供了一种易于设计的解决方案。
https://arxiv.org/abs/2404.15271
We study interactive learning of language agents based on user edits made to the agent's output. In a typical setting such as writing assistants, the user interacts with a language agent to generate a response given a context, and may optionally edit the agent response to personalize it based on their latent preference, in addition to improving the correctness. The edit feedback is naturally generated, making it a suitable candidate for improving the agent's alignment with the user's preference, and for reducing the cost of user edits over time. We propose a learning framework, PRELUDE that infers a description of the user's latent preference based on historic edit data and using it to define a prompt policy that drives future response generation. This avoids fine-tuning the agent, which is costly, challenging to scale with the number of users, and may even degrade its performance on other tasks. Furthermore, learning descriptive preference improves interpretability, allowing the user to view and modify the learned preference. However, user preference can be complex and vary based on context, making it challenging to learn. To address this, we propose a simple yet effective algorithm named CIPHER that leverages a large language model (LLM) to infer the user preference for a given context based on user edits. In the future, CIPHER retrieves inferred preferences from the k-closest contexts in the history, and forms an aggregate preference for response generation. We introduce two interactive environments -- summarization and email writing, for evaluation using a GPT-4 simulated user. We compare with algorithms that directly retrieve user edits but do not learn descriptive preference, and algorithms that learn context-agnostic preference. On both tasks, CIPHER achieves the lowest edit distance cost and learns preferences that show significant similarity to the ground truth preferences
我们研究基于用户对语言代理输出进行修改的交互式学习。在一个典型的场景(如写作助手)中,用户与语言代理交互以根据给定上下文生成响应,并且可以选项性地编辑代理的响应以根据他们的潜在偏好进行个性化定制,除了提高正确性外。编辑反馈自然生成,使其成为一个适合改进代理与用户偏好的合适候选者,并减少用户编辑的时间成本。我们提出了一个学习框架,PRELUDE,根据历史编辑数据推断用户的潜在偏好,并使用它定义一个提示策略来驱动未来的响应生成。这避免了微调代理,这是昂贵且难以扩展的。此外,学习描述性偏好提高了可解释性,使用户可以查看和修改学到的偏好。然而,用户偏好可能复杂多样,并根据上下文有所不同,这使得学习具有挑战性。为解决这个问题,我们提出了一个简单而有效的算法,名为CIPHER,它利用大型语言模型(LLM)根据用户编辑推断给定上下文的用户偏好。在未来的研究中,CIPHER从历史编辑数据中的k个最近上下文中提取推断的偏好,并形成反应生成的聚合偏好。我们引入了两个交互式环境——总结和电子邮件写作,用于使用GPT-4模拟的用户进行评估。我们比较了直接从用户编辑中检索算法,但不学习描述性偏好的算法,以及学习上下文无关偏好的算法。在两个任务上,CIPHER都实现了最低的编辑距离成本,并学习了与真实偏好具有显著相似性的偏好。
https://arxiv.org/abs/2404.15269
Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
基于视觉 transformer 的模型在图像分割任务中带来了显著的改进。尽管这些架构无论具体分割任务还是其他任务都提供了强大的功能,但它们在部署设备上的计算资源消耗较大。克服这一挑战的一种方法是将计算级别适配于输入图像的特定需求,而不是一刀切的方法。为此,我们引入了 ECO-M2F 或 EffiCient TransfOrmer Encoders(M2F-style 模型的编码器模块)来支持掩膜2形式模型的自适应计算级别。注意到 M2F-style 模型的编码器模块会引入高资源密集型计算,因此 ECO-M2F 为输入图像自适应选择编码器层数提供了一种策略。为了实现这一自适应能力,并提供性能和计算效率之间的平衡,我们提出了一个三步食谱。第一步是训练父架构以实现从编码器早期退出。第二步是创建一个所需的编码器层数取决于每个训练实例的派生数据集。第三步是使用上述派生数据集训练一个预测编码器层数以使用输入图像的编码器层数的 gating 网络。此外,为了改变计算准确性与计算效率之间的权衡,只需重复步骤2和3,这使得重新训练时间显著减少。在公开数据集上的实验表明,与建议的方法相比,所提出的方案在减轻预期编码器计算成本的同时保持性能,并适应各种用户计算资源,在架构配置方面具有灵活性,并且可以扩展到分割任务之外的对象检测。
https://arxiv.org/abs/2404.15244
Human decision-making often relies on visual information from multiple perspectives or views. In contrast, machine learning-based object recognition utilizes information from a single image of the object. However, the information conveyed by a single image may not be sufficient for accurate decision-making, particularly in complex recognition problems. The utilization of multi-view 3D representations for object recognition has thus far demonstrated the most promising results for achieving state-of-the-art performance. This review paper comprehensively covers recent progress in multi-view 3D object recognition methods for 3D classification and retrieval tasks. Specifically, we focus on deep learning-based and transformer-based techniques, as they are widely utilized and have achieved state-of-the-art performance. We provide detailed information about existing deep learning-based and transformer-based multi-view 3D object recognition models, including the most commonly used 3D datasets, camera configurations and number of views, view selection strategies, pre-trained CNN architectures, fusion strategies, and recognition performance on 3D classification and 3D retrieval tasks. Additionally, we examine various computer vision applications that use multi-view classification. Finally, we highlight key findings and future directions for developing multi-view 3D object recognition methods to provide readers with a comprehensive understanding of the field.
人类决策通常依赖于来自多个视角或视图的视觉信息。相比之下,基于机器学习的物体识别利用了一个物体的单张图像中的信息。然而,单个图像中传递的信息可能不足以实现准确的决策,尤其是在复杂识别问题中。因此,多视角 3D 表示用于物体识别已经证明为实现最先进的性能提供了最有前途的结果。 本文回顾了多视角 3D 物体识别方法在 3D 分化和检索任务中的最新进展。具体来说,我们关注基于深度学习和Transformer 的技术,因为它们得到了广泛应用并取得了最先进的成绩。我们提供了关于现有基于深度学习和Transformer 的多视角 3D 物体识别模型的详细信息,包括最常用的 3D 数据集、相机配置和视角数量、视角选择策略、预训练 CNN 架构、融合策略以及关于分类和检索任务的识别性能。此外,我们研究了各种使用多视角分类的计算机视觉应用。最后,我们重点关注了在开发多视角 3D 物体识别方法方面的一些关键发现和未来方向,以提供读者全面的了解该领域的理解。
https://arxiv.org/abs/2404.15224
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation expertise. With advances in LLMs, we hypothesize unlabelled data and a schema definition are sufficient for building a working task-oriented dialogue system, completely unsupervised. Using only (1) a well-defined API schema (2) a set of unlabelled dialogues between a user and agent, we develop a novel approach for inferring turn-level annotations as latent variables using a noisy channel model. We iteratively improve these pseudo-labels with expectation-maximization (EM), and use the inferred labels to train an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
基于任务的对话系统通常需要进行交互级别的注释,例如对话状态和每个步骤系统采取的行动。这些注释可能会产生费用,具有错误率,并且需要领域和注释专业知识。随着LLM的进步,我们假设无标签数据和数据定义足以构建一个无需监督的 task-oriented 对话系统。仅使用(1)定义良好的 API 模式和(2)用户和代理之间的无标签对话,我们提出了一种通过噪声信道模型推断回合级别注释的新方法。我们通过期望最大化(EM)迭代改进这些伪标签,并使用推断的标签来训练端到端对话代理。在 MultiWOZ 基准上评估我们的方法,我们的方法将 strong GPT-3.5 基线的对话成功率加倍。
https://arxiv.org/abs/2404.15219
The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
大规模视觉语言模型的快速发展在各种任务中展示了令人印象深刻的性能。然而,在医学领域中缺乏大量高质量的图像-文本数据,大大阻碍了大规模医疗视觉语言模型的开发。在这项工作中,我们提出了一个指导下的bootstrap策略,该策略利用图像和标签信息来构建视觉语言数据集。基于构建的数据集,我们开发了MedDr,一种通用医疗数据处理模型,能够处理各种医疗数据模式,包括放射学、病理学、皮肤病学、眼科和内窥镜。此外,在推理过程中,我们提出了一种简单但有效的检索增强医疗诊断策略,可以增强模型的泛化能力。在视觉问答、医学报告生成和医学图像诊断等大量实验中,证明了我们方法的优势。
https://arxiv.org/abs/2404.15127
We present a novel character control framework that effectively utilizes motion diffusion probabilistic models to generate high-quality and diverse character animations, responding in real-time to a variety of dynamic user-supplied control signals. At the heart of our method lies a transformer-based Conditional Autoregressive Motion Diffusion Model (CAMDM), which takes as input the character's historical motion and can generate a range of diverse potential future motions conditioned on high-level, coarse user control. To meet the demands for diversity, controllability, and computational efficiency required by a real-time controller, we incorporate several key algorithmic designs. These include separate condition tokenization, classifier-free guidance on past motion, and heuristic future trajectory extension, all designed to address the challenges associated with taming motion diffusion probabilistic models for character control. As a result, our work represents the first model that enables real-time generation of high-quality, diverse character animations based on user interactive control, supporting animating the character in multiple styles with a single unified model. We evaluate our method on a diverse set of locomotion skills, demonstrating the merits of our method over existing character controllers. Project page and source codes: this https URL
我们提出了一个新颖的角色控制框架,它有效地利用运动扩散概率模型来生成高质量和多样化的角色动画,能够实时响应各种动态用户提供的控制信号。我们方法的核心是基于Transformer的条件自回归运动扩散模型(CAMDM),它接收角色的历史运动,并可以根据高级、粗略的用户控制生成一系列可能的未来运动。为了满足实时控制器对多样性、可控性和计算效率的需求,我们包括几个关键的算法设计。这些设计包括分开的条件标记、无分类指导过去运动和启发式未来轨迹扩展,所有这些设计都是为了应对驯化运动扩散概率模型的挑战而设计的。因此,我们的工作代表了第一个基于用户交互控制生成高质量、多样化角色动画的模型,支持使用单一模型动画化角色多种风格。我们在多样化的运动技能集上评估我们的方法,证明了我们的方法在现有角色控制器上的优越性。项目页面和源代码:https:// this URL
https://arxiv.org/abs/2404.15121
Detecting social bots has evolved into a pivotal yet intricate task, aimed at combating the dissemination of misinformation and preserving the authenticity of online interactions. While earlier graph-based approaches, which leverage topological structure of social networks, yielded notable outcomes, they overlooked the inherent dynamicity of social networks -- In reality, they largely depicted the social network as a static graph and solely relied on its most recent state. Due to the absence of dynamicity modeling, such approaches are vulnerable to evasion, particularly when advanced social bots interact with other users to camouflage identities and escape detection. To tackle these challenges, we propose BotDGT, a novel framework that not only considers the topological structure, but also effectively incorporates dynamic nature of social network. Specifically, we characterize a social network as a dynamic graph. A structural module is employed to acquire topological information from each historical snapshot. Additionally, a temporal module is proposed to integrate historical context and model the evolving behavior patterns exhibited by social bots and legitimate users. Experimental results demonstrate the superiority of BotDGT against the leading methods that neglected the dynamic nature of social networks in terms of accuracy, recall, and F1-score.
检测社交机器人已经发展成为一项关键而复杂的任务,旨在打击网络传播错误信息和保留在线互动的真实性。虽然早期的基于图的 approach,利用社交网络的拓扑结构,取得了显著的成果,但它们忽视了社交网络的动态性--实际上,它们主要描绘社交网络作为一个静态图,仅依赖其最近的状态。由于缺乏动态性建模,这类方法容易受到逃避,尤其是在先进的社交机器人与其他用户交互以伪装身份并逃避检测时。为了解决这些挑战,我们提出了 BotDGT,一种新颖的框架,不仅考虑了拓扑结构,还有效融入了社交网络的动态特性。具体来说,我们将社交网络定义为一个动态图。采用一个结构模块从每个历史快照中获取拓扑信息。此外,还提出了一个时间模块,用于整合历史背景并建模社交机器人和合法用户的不断变化的行为模式。实验结果表明,BotDGT 在准确性、召回率和 F1 分数方面优于忽略社交网络动态性的主要方法。
https://arxiv.org/abs/2404.15070
Multimodal medical imaging plays a pivotal role in clinical diagnosis and research, as it combines information from various imaging modalities to provide a more comprehensive understanding of the underlying pathology. Recently, deep learning-based multimodal fusion techniques have emerged as powerful tools for improving medical image classification. This review offers a thorough analysis of the developments in deep learning-based multimodal fusion for medical classification tasks. We explore the complementary relationships among prevalent clinical modalities and outline three main fusion schemes for multimodal classification networks: input fusion, intermediate fusion (encompassing single-level fusion, hierarchical fusion, and attention-based fusion), and output fusion. By evaluating the performance of these fusion techniques, we provide insight into the suitability of different network architectures for various multimodal fusion scenarios and application domains. Furthermore, we delve into challenges related to network architecture selection, handling incomplete multimodal data management, and the potential limitations of multimodal fusion. Finally, we spotlight the promising future of Transformer-based multimodal fusion techniques and give recommendations for future research in this rapidly evolving field.
多模态医疗影像在临床诊断和研究中扮演着至关重要的角色,因为它结合了各种影像模态的信息,提供更全面的病理解剖学理解。近年来,基于深度学习的多模态融合技术已成为提高医学图像分类的强大工具。本文对基于深度学习的多模态融合在医学分类任务的发展进行了全面的分析。我们探讨了主要临床模态之间的互补关系,并提出了三种主要的融合方案:输入融合、中间融合(包括单层融合、层次融合和基于注意力的融合)和输出融合。通过评估这些融合技术的性能,我们提供了对各种多模态融合场景和应用领域的适用网络架构的洞察。此外,我们还深入探讨了与网络架构选择、处理不完整的多模态数据管理以及多模态融合的潜在限制相关的问题。最后,我们重点关注了基于Transformer的多模态融合技术的光明未来,并给未来在这个快速发展的领域的研究提出了建议。
https://arxiv.org/abs/2404.15022
Salient object detection (SOD) aims at finding the most salient objects in images and outputs pixel-level binary masks. Transformer-based methods achieve promising performance due to their global semantic understanding, crucial for identifying salient objects. However, these models tend to be large and require numerous training parameters. To better harness the potential of transformers for SOD, we propose a novel parameter-efficient fine-tuning method aimed at reducing the number of training parameters while enhancing the salient object detection capability. Our model, termed EXternal Prompt features Enhanced adapteR Tuning (ExPert), features an encoder-decoder structure with adapters and injectors interspersed between the layers of a frozen transformer encoder. The adapter modules adapt the pre-trained backbone to SOD while the injector modules incorporate external prompt features to enhance the awareness of salient objects. Comprehensive experiments demonstrate the superiority of our method. Surpassing former state-of-the-art (SOTA) models across five SOD datasets, ExPert achieves 0.215 mean absolute error (MAE) in ECSSD dataset with 80.2M trained parameters, 21% better than transformer-based SOTA model and 47% better than CNN-based SOTA model.
突出物体检测(SOD)旨在在图像和输出中找到最具突出的物体,并输出二进制掩码级像素级。基于Transformer的方法由于其全局语义理解,对于识别突出物体至关重要。然而,这些模型往往较大,并需要大量的训练参数。为了更好地利用Transformer的潜力进行SOD,我们提出了一种新参数高效的微调方法,旨在减少训练参数的数量,同时提高突出物体检测能力。我们的模型被称为External Prompt features Enhanced adapteR Tuning (ExPert),具有编码器-解码器结构,其中适配器模块将预训练的骨干网络调整为SOD,而注入器模块则包含外部提示特征,以增强对突出物体的意识。全面的实验证明了我们方法的优势。在五个SOD数据集上,ExPert超越了最先进的(SOTA)模型。在ECSSD数据集上,ExPert具有80.2M个训练参数,比基于Transformer的SOTA模型好21%,比基于CNN的SOTA模型好47%。
https://arxiv.org/abs/2404.15008
Explanations obtained from transformer-based architectures in the form of raw attention, can be seen as a class-agnostic saliency map. Additionally, attention-based pooling serves as a form of masking the in feature space. Motivated by this observation, we design an attention-based pooling mechanism intended to replace Global Average Pooling (GAP) at inference. This mechanism, called Cross-Attention Stream (CA-Stream), comprises a stream of cross attention blocks interacting with features at different network depths. CA-Stream enhances interpretability in models, while preserving recognition performance.
通过Transformer架构获得的解释,以原始注意形式表示,可以看作是一个类无关的显著性图。此外,基于注意的池化作为一种对特征空间进行遮蔽的形式。为了实现这个目标,我们设计了一个基于注意的池化机制,旨在在推理过程中取代全局平均池化(GAP)。这个机制被称为跨注意流(CA-Stream),它包括一系列与不同网络深度的特征交互的跨注意块。CA-Stream提高了模型的可解释性,同时保留了识别性能。
https://arxiv.org/abs/2404.14996
Plenty of existing work has analyzed the abilities of the transformer architecture by describing its representational capacity with formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
大量现有工作通过用计算模型描述Transformer架构的表示能力来分析Transformer架构的能力。然而,目前的研究主要集中在分析语言接受性方面。我们认为,这对语言模型的研究来说是一个不适合的问题,因为它们定义为字符串上的概率分布。在本文中,我们将关注Transformer LMs和$n$-gram LMs之间的关系,这是语言模型中一个简单且具有历史相关性的类。我们证明了,使用硬或稀疏注意机制的Transformer LMs可以准确地表示任何$n$-gram LM,这为我们提供了关于它们概率表示能力的一个具体下界。这为理解Transformer LMs如何表示字符串上的概率分布提供了一个初步步骤。
https://arxiv.org/abs/2404.14994
In biological tasks, data is rarely plentiful as it is generated from hard-to-gather measurements. Therefore, pre-training foundation models on large quantities of available data and then transfer to low-data downstream tasks is a promising direction. However, how to design effective foundation models for molecular learning remains an open question, with existing approaches typically focusing on models with large parameter capacities. In this work, we propose $\texttt{MiniMol}$, a foundational model for molecular learning with 10 million parameters. $\texttt{MiniMol}$ is pre-trained on a mix of roughly 3300 sparsely defined graph- and node-level tasks of both quantum and biological nature. The pre-training dataset includes approximately 6 million molecules and 500 million labels. To demonstrate the generalizability of $\texttt{MiniMol}$ across tasks, we evaluate it on downstream tasks from the Therapeutic Data Commons (TDC) ADMET group showing significant improvements over the prior state-of-the-art foundation model across 17 tasks. $\texttt{MiniMol}$ will be a public and open-sourced model for future research.
在生物任务中,数据很少是丰富的,因为它们是由难以收集的测量生成的。因此,将大量可用数据上的预训练大型基础模型转移到低数据量的下游任务是一个有前途的方向。然而,如何为分子学习设计有效的基模型仍然是一个未解决的问题,现有的方法通常关注具有大参数容量的模型。在这项工作中,我们提出了 $\texttt{MiniMol}$,一个具有100万个参数的分子学习基础模型。$\texttt{MiniMol}$ 在大约3300个稀疏定义的量子和生物自然水平上进行预训练。预训练数据集包括大约6000万分子和5000万标签。为了证明 $\texttt{MiniMol}$ 在不同任务上的泛化能力,我们在治疗数据共享(TDC)ADMET组中评估了它的下游任务,显示在17个任务上取得了显著的改善,超过了前状态的基模型。$\texttt{MiniMol}$ 将是一个为未来研究公开和开源的模型。
https://arxiv.org/abs/2404.14986
Object Re-Identification (Re-ID) aims to identify and retrieve specific objects from images captured at different places and times. Recently, object Re-ID has achieved great success with the advances of Vision Transformers (ViT). However, the effects of the global-local relation have not been fully explored in Transformers for object Re-ID. In this work, we first explore the influence of global and local features of ViT and then further propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID. We find that the features from last few layers of ViT already have a strong representational ability, and the global and local information can mutually enhance each other. Based on this fact, we propose a Global Aggregation Encoder (GAE) to utilize the class tokens of the last few Transformer layers and learn comprehensive global features effectively. Meanwhile, we propose the Local Multi-layer Fusion (LMF) which leverages both the global cues from GAE and multi-layer patch tokens to explore the discriminative local representations. Extensive experiments demonstrate that our proposed method achieves superior performance on four object Re-ID benchmarks.
物体识别(Re-ID)的目的是从不同时间和地点捕获的图像中识别和检索特定的物体。近年来,随着Vision Transformers (ViT)的进步,物体Re-ID已经取得了巨大的成功。然而,在Transformers中,对全局和局部关系的影响还没有完全被探索。在这项工作中,我们首先探索ViT的局部和全局特征对物体Re-ID的影响,然后进一步提出了一种名为全局-局部Transformer(GLTrans)的高性能物体Re-ID模型。我们发现,ViT最后几层的特征已经具有很强的表示能力,全局和局部信息可以相互增强。基于这个事实,我们提出了一种全局聚合编码器(GAE)来有效地利用最后几个Transformer层中的分类标签,并学习全面的全局特征。同时,我们还提出了一种局部多层融合(LMF),它利用来自GAE的全局线索和多层补丁token来探索具有区分性的局部表示。大量实验证明,我们提出的方法在四个物体Re-ID基准测试中实现了卓越的性能。
https://arxiv.org/abs/2404.14985
Panoramic distortion poses a significant challenge in 360 depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, which can result in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar re-projection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions.
全景失真在360度深度估计中是一个重大的挑战,特别是在北极和南极。现有的方法要么采用双投影融合策略来消除失真,要么通过建模长距离依赖来捕捉全局结构,从而可能导致清晰的结构或不足的局部感知。在本文中,我们提出了一种球形几何变换器,名为SGFormer,来解决上述问题,并在视觉变换器中创新地将球形几何先验集成到其中。为此,我们将变换器解码器重定向为球形先验解码器(称为SPDecoder),它努力保持解码过程中的球形结构的完整性。具体来说,我们利用极性投影、圆周旋转和曲线局部嵌入来保留等距离失真、连续性和表面距离的球形特征。此外,我们还提出了一个基于查询的全局条件位置嵌入,以补偿不同分辨率下的空间结构。它不仅提高了全局空间位置的感知,还 sharpens the depth structure across different patches。最后,我们在多个流行基准上进行了广泛的实验,证明了我们在现有解决方案中的优越性。
https://arxiv.org/abs/2404.14979
Existing Transformer-based models for point cloud analysis suffer from quadratic complexity, leading to compromised point cloud resolution and information loss. In contrast, the newly proposed Mamba model, based on state space models (SSM), outperforms Transformer in multiple areas with only linear complexity. However, the straightforward adoption of Mamba does not achieve satisfactory performance on point cloud tasks. In this work, we present Mamba3D, a state space model tailored for point cloud learning to enhance local feature extraction, achieving superior performance, high efficiency, and scalability potential. Specifically, we propose a simple yet effective Local Norm Pooling (LNP) block to extract local geometric features. Additionally, to obtain better global features, we introduce a bidirectional SSM (bi-SSM) with both a token forward SSM and a novel backward SSM that operates on the feature channel. Extensive experimental results show that Mamba3D surpasses Transformer-based counterparts and concurrent works in multiple tasks, with or without pre-training. Notably, Mamba3D achieves multiple SoTA, including an overall accuracy of 92.6% (train from scratch) on the ScanObjectNN and 95.1% (with single-modal pre-training) on the ModelNet40 classification task, with only linear complexity.
现有的基于Transformer的点云分析模型存在多项式复杂性,导致点云分辨率降低和信息损失。相比之下,新提出的Mamba模型(基于状态空间模型)在多个方面优于Transformer,具有仅线性复杂性。然而,直接采用Mamba模型在点云任务上并不能达到令人满意的表现。在这项工作中,我们提出了Mamba3D,一种专为点云学习而设计的状态空间模型,以提高局部特征提取,实现卓越的性能、高效率和可扩展性。具体来说,我们提出了一个简单而有效的局部归一化(LNP)模块来提取局部几何特征。此外,为了获得更好的全局特征,我们引入了一种双向状态空间模型(bi-SSM),包括一个标记向前状态空间模型和一个新颖的 backward SSM,该模型在特征通道上操作。大量的实验结果表明,Mamba3D在多项任务上超过了基于Transformer的模型以及当前的工作,包括从零开始训练的准确度。值得注意的是,Mamba3D在多个SoTA上实现了卓越的表现,包括ScanObjectNN上的 overall accuracy 为92.6%(从头开始训练)和ModelNet40分类任务上的95.1%(单模态预训练)。它具有仅线性复杂性。
https://arxiv.org/abs/2404.14966
Hyperspectral image classification is a challenging task due to the high dimensionality and complex nature of hyperspectral data. In recent years, deep learning techniques have emerged as powerful tools for addressing these challenges. This survey provides a comprehensive overview of the current trends and future prospects in hyperspectral image classification, focusing on the advancements from deep learning models to the emerging use of transformers. We review the key concepts, methodologies, and state-of-the-art approaches in deep learning for hyperspectral image classification. Additionally, we discuss the potential of transformer-based models in this field and highlight the advantages and challenges associated with these approaches. Comprehensive experimental results have been undertaken using three Hyperspectral datasets to verify the efficacy of various conventional deep-learning models and Transformers. Finally, we outline future research directions and potential applications that can further enhance the accuracy and efficiency of hyperspectral image classification. The Source code is available at this https URL.
超分辨率图像分类是一个具有高维度和复杂性的挑战性的任务。近年来,深度学习技术已成为解决这些挑战的强大工具。本文对当前 hyperspectral 图像分类的趋势和未来前景进行全面概述,重点关注从深度学习模型到新兴的 transformer 的应用。我们回顾了用于 hyperspectral 图像分类的深度学习中的关键概念、方法和最先进的策略。此外,我们讨论了基于 transformer 的模型的潜力,并强调了这些方法的优势和挑战。使用三个超分辨率数据集进行了全面实验,以验证各种传统深度学习模型和 transformer 的有效性。最后,我们概述了未来研究的方向和可能的应用,以进一步增强超分辨率图像分类的准确性和效率。源代码可在此处访问:https://www.example.com/。
https://arxiv.org/abs/2404.14955
Gestures are inherent to human interaction and often complement speech in face-to-face communication, forming a multimodal communication system. An important task in gesture analysis is detecting a gesture's beginning and end. Research on automatic gesture detection has primarily focused on visual and kinematic information to detect a limited set of isolated or silent gestures with low variability, neglecting the integration of speech and vision signals to detect gestures that co-occur with speech. This work addresses this gap by focusing on co-speech gesture detection, emphasising the synchrony between speech and co-speech hand gestures. We address three main challenges: the variability of gesture forms, the temporal misalignment between gesture and speech onsets, and differences in sampling rate between modalities. We investigate extended speech time windows and employ separate backbone models for each modality to address the temporal misalignment and sampling rate differences. We utilize Transformer encoders in cross-modal and early fusion techniques to effectively align and integrate speech and skeletal sequences. The study results show that combining visual and speech information significantly enhances gesture detection performance. Our findings indicate that expanding the speech buffer beyond visual time segments improves performance and that multimodal integration using cross-modal and early fusion techniques outperforms baseline methods using unimodal and late fusion methods. Additionally, we find a correlation between the models' gesture prediction confidence and low-level speech frequency features potentially associated with gestures. Overall, the study provides a better understanding and detection methods for co-speech gestures, facilitating the analysis of multimodal communication.
手势是人类交互的固有特征,通常在面对面交流中作为语言的补充,形成了一个多模态通信系统。手势分析的一个重要任务是检测手势的开始和结束。在手势检测研究中,主要集中于视觉和运动信息来检测具有较低变异性的一小部分孤立或无声手势,而忽视了语音和视觉信号的融合来检测与语言同时出现的手势。本文通过关注共词手势检测来填补这一空白,强调言语和共词手势之间的同步性。我们解决了三个主要挑战:手势形式的变异性,手势和言语发起时间之间的时序错位,以及模态之间的采样率差异。我们研究了扩展的言语时间窗口,并为每个模态使用单独的骨干模型来解决时序错位和采样率差异。我们利用跨模态和早期融合技术中的Transformer编码器来有效地对齐和整合言语和骨骼序列。研究结果表明,结合视觉和言语信息可以显著增强手势检测性能。我们的研究结果表明,将言语缓冲区扩展到视觉时间段以外可以提高性能,而跨模态和早期融合技术的使用优于使用单模态和晚期融合方法。此外,我们发现模型手势预测信心与可能与手势相关的小级别语音频率特征之间存在相关性。总体而言,研究为共词手势提供了更好的理解和检测方法,促进了多模态通信的分析。
https://arxiv.org/abs/2404.14952