We propose Lodge, a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture, and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion, which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast, the second-stage is the local diffusion, which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition, we propose a Foot Refine Block to optimize the contact between the feet and the ground, enhancing the physical realism of the motion. Our approach can parallelly generate dance sequences of extremely long length, striking a balance between global choreographic patterns and local motion quality and expressiveness. Extensive experiments validate the efficacy of our method.
我们提出了Lodge,一个能够根据给定音乐的极具长度的舞蹈序列生成网络。我们将Lodge设计成两级粗到细扩散架构,并提出了具有显著表现力的特征舞蹈原型作为两级扩散模型的中间表示。第一阶段是全局扩散,重点理解粗级别音乐舞蹈相关性和生成特征舞蹈原型。相反,第二阶段是局部扩散,在舞蹈原型的指导和舞蹈规则的指导下,并行生成详细的动作序列。此外,我们还提出了脚部精细化块以优化脚与地面之间的接触,提高了动作的物理真实性。我们的方法可以并行生成极其长度的舞蹈序列,在全局舞蹈模式和局部动作质量和表现力之间取得了平衡。 extensive实验验证了我们的方法的有效性。
https://arxiv.org/abs/2403.10518
Integrating Large Language Models (VLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta self modeling that can deduce robot morphology through proprioception (the internal sense of position and movement). Our study introduces a 12 DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance understanding of their physical embodiment and adaptability in real world scenarios.
将大型语言模型(VLMs)和视觉语言模型(VLMs)与机器人系统集成,使机器人能够处理和理解复杂的自然语言指令和视觉信息。然而,一个基本挑战 remains:为了充分利用这些进步,机器人必须对其物理 embodiment 具有深入的理解。AI 模型认知能力和物理 embodiment 的理解之间的差距导致了以下问题:机器人是否可以通过与环境的交互自主理解并适应其物理形态和功能?这个问题突出了开发无需依赖外部感官或预编程知识其结构的自我建模机器人的过渡。 在这里,我们提出了一种元自我建模,可以通过本体感觉(内部感觉位置和运动)来推断机器人的外形。我们的研究引入了一台12个自由度可重构的机器人,并随着一个包含200k个独特配置的多样数据集,系统性地研究了机器运动和机器人外形之间的关系。利用包括机器人签名编码器和一个配置解码器的深度神经网络模型,我们证明了我们的系统能够准确预测从本体感觉信号预测机器人配置。这项研究为机器人自建模领域做出了贡献,旨在增强其在现实场景中理解和适应能力。
https://arxiv.org/abs/2403.10496
Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
音频视觉情感识别 (ER) 在视频中有巨大的潜力。它有效地利用了视觉和听觉模式之间的相互依赖关系。这项工作提出了一种新型的音频-视觉情感识别系统,采用基于关键的跨注意力的联合多模态Transformer架构。这个框架旨在利用视频中的音频和视觉提示(面部表情和语调模式)的互补性质,从而比仅依赖单一模态时获得更好的性能。所提出的模型利用单独的骨干网络来捕捉每个模式内部的时序依赖(音频和视觉)。接着,联合多模态Transformer架构整合了每个模态的单独嵌入,使得模型能够有效捕捉跨模态(在音频和视觉之间)和模态内的(每个模式内)关系。在具有挑战性的Affwild2数据集上进行的广泛评估证明,与基线和最先进的ER方法相比,所提出的模型在ER任务中显著表现出更好的性能。
https://arxiv.org/abs/2403.10488
Machine Learning (ML) is becoming increasingly important for IoT-based applications. However, the dynamic and ad-hoc nature of many IoT ecosystems poses unique challenges to the efficacy of ML algorithms. One such challenge is data incompleteness, which is manifested as missing sensor readings. Many factors, including sensor failures and/or network disruption, can cause data incompleteness. Furthermore, most IoT systems are severely power-constrained. It is important that we build IoT-based ML systems that are robust against data incompleteness while simultaneously being energy efficient. This paper presents an empirical study of SECOE - a recent technique for alleviating data incompleteness in IoT - with respect to its energy bottlenecks. Towards addressing the energy bottlenecks of SECOE, we propose ENAMLE - a proactive, energy-aware technique for mitigating the impact of concurrent missing data. ENAMLE is unique in the sense that it builds an energy-aware ensemble of sub-models, each trained with a subset of sensors chosen carefully based on their correlations. Furthermore, at inference time, ENAMLE adaptively alters the number of the ensemble of models based on the amount of missing data rate and the energy-accuracy trade-off. ENAMLE's design includes several novel mechanisms for minimizing energy consumption while maintaining accuracy. We present extensive experimental studies on two distinct datasets that demonstrate the energy efficiency of ENAMLE and its ability to alleviate sensor failures.
机器学习(ML)在基于物联网(IoT)的应用中越来越重要。然而,许多物联网生态系统的动态和临时性特点对 ML 算法的有效性提出了独特的挑战。其中一项挑战是数据不完整性,这表现为传感器读数的缺失。许多因素,包括传感器故障和/或网络中断,可以导致数据不完整性。此外,大多数 IoT 系统都非常耗电。重要的是我们要构建能够应对数据不完整性的 IoT 基 ML 系统,同时实现能源效率。本文对 SECOE - 一种最近用于减轻 IoT 中数据不完整性的技术 - 进行了实证研究,以研究其能源瓶颈。为了解决 SECOE 的能源瓶颈,我们提出了 ENAMLE - 一种主动、能源意识的技术,用于减轻同时缺失数据的影响。ENAMLE 独特的方面在于它基于选择仔细基于其相关性的子模型的能量感知构建了一个能源感知的集成。此外,在推理时间,ENAMLE 根据缺失数据率和服务器准确性之间的权衡动态地调整了模型的数量。ENAMLE 的设计包括几个新颖的机制,用于最小化能量消耗,同时保持准确性。我们研究了两个不同的数据集,证明了 ENAMLE 的能源效率以及其减轻传感器故障的能力。
https://arxiv.org/abs/2403.10371
Recent progress in human shape learning, shows that neural implicit models are effective in generating 3D human surfaces from limited number of views, and even from a single RGB image. However, existing monocular approaches still struggle to recover fine geometric details such as face, hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper, we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM, a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships, mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy, which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB, surface normals, point cloud or RGB-D data as input. In addition, we introduce ANIM-Real, a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera, and our protocol to fine-tune ANIM, enabling high-quality reconstruction from real-world human capture.
近年来在人体形状学习方面的进步表明,神经隐含模型在从有限视角下生成三维人体表面以及甚至单个RGB图像时非常有效。然而,现有的单目方法仍然很难从仅有的几个视角下恢复细粒度几何细节,如面部、手部或布料皱纹。它们还容易产生深度模糊,导致沿着相机光学轴的扭曲几何。在本文中,我们探讨了在重建过程中将深度观察纳入其中的好处,通过引入ANIM,一种在单目RGB-D图像上重构任意3D人体形状的新方法,前所未有的准确度。我们的模型从多分辨率像素对齐和体素对齐的特征中学习几何细节,以利用深度信息并实现空间关系,减轻深度模糊。我们进一步通过引入深度监督策略来提高重构形状的质量,提高距离场估计点在重构表面上的准确性。实验证明,ANIM在用作输入的RGB、表面法线、点云或RGB-D数据上优于最先进的论文。此外,我们还引入了ANIM-Real,一个新的多模态数据集,包括高质量扫描和消费级RGB-D相机,以及调整ANIM的协议,使其从现实世界的人类捕捉中实现高质量重构。
https://arxiv.org/abs/2403.10357
Accelerating dynamic MRI is essential for enhancing clinical applications, such as adaptive radiotherapy, and improving patient comfort. Traditional deep learning (DL) approaches for accelerated dynamic MRI reconstruction typically rely on predefined or random subsampling patterns, applied uniformly across all temporal phases. This standard practice overlooks the potential benefits of leveraging temporal correlations and lacks the adaptability required for case-specific subsampling optimization, which holds the potential for maximizing reconstruction quality. Addressing this gap, we present a novel end-to-end framework for adaptive dynamic MRI subsampling and reconstruction. Our pipeline integrates a DL-based adaptive sampler, generating case-specific dynamic subsampling patterns, trained end-to-end with a state-of-the-art 2D dynamic reconstruction network, namely vSHARP, which effectively reconstructs the adaptive dynamic subsampled data into a moving image. Our method is assessed using dynamic cine cardiac MRI data, comparing its performance against vSHARP models that employ common subsampling trajectories, and pipelines trained to optimize dataset-specific sampling schemes alongside vSHARP reconstruction. Our results indicate superior reconstruction quality, particularly at high accelerations.
加速动态MRI对于增强临床应用,如自适应放疗,以及提高患者舒适度具有关键作用。传统的深度学习(DL)方法加速动态MRI重构通常依赖于预定义的或随机的子采样模式,在所有时间阶段上均匀应用。这种标准化做法忽视了利用时间关联的潜力,并缺乏针对具体子采样优化的适应性,这有可能提高重建质量。 为解决这一空白,我们提出了一个用于自适应动态MRI子采样和重构的端到端框架。我们的管道包括基于DL的适应采样器,生成特定病例的动态子采样模式,与最先进的2D动态重建网络——vSHARP——端到端训练,有效将适应动态子采样数据还原为动态图像。我们对该方法使用动态电影心脏MRI数据进行了评估,将其性能与采用常见子采样轨迹的vSHARP模型的性能进行了比较,以及与vSHARP重构的训练数据特定的采样方案相关的 pipeline。我们的结果表明,在高速运动下,重建质量具有优势,尤其是在高加速运动下。
https://arxiv.org/abs/2403.10346
Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs are key-value memories \cite{geva2020transformer}, which are vital in modern Transformer architectures. In this paper, we conduct an empirical study to explore the potential of attention mechanisms without using FFN and provide novel structures to demonstrate that removing FFN is flexible for image restoration. Specifically, we propose Continuous Scaling Attention (\textbf{CSAttn}), a method that computes attention continuously in three stages without using FFN. To achieve competitive performance, we propose a series of key components within the attention. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance. We apply our \textbf{CSAttn} to several image restoration tasks and show that our model can outperform CNN-based and Transformer-based image restoration approaches.
变换器在图像修复任务中已经证明了其有效性。现有的变换器架构通常包括两个基本组件:多头自注意力和前馈网络(FFN)。前者捕捉长距离像素关系,而后者使模型能够学习数据中的复杂模式和关系。以前的研究表明,FFN是关键值记忆器 \cite{geva2020transformer},这对于现代变换器架构至关重要。在本文中,我们进行了一项实证研究,探讨了没有使用FFN的注意机制的潜力,并为图像修复提供了一些新的结构,表明去除FFN是灵活的。具体来说,我们提出了连续缩放注意(CSAttn)的方法,这是一种在三个阶段连续计算注意的方法,而没有使用FFN。为了实现竞争力的性能,我们在注意力的各个方面提出了关键组件。我们的设计使我们对注意机制更加深入地了解,并揭示了某些简单的操作可能会显著影响模型性能。我们将我们的CSAttn应用于多个图像修复任务,并证明了我们的模型可以超过基于CNN和基于Transformer的图像修复方法。
https://arxiv.org/abs/2403.10336
We present a knowledge integration framework (called KIF) that uses Wikidata as a lingua franca to integrate heterogeneous knowledge bases. These can be triplestores, relational databases, CSV files, etc., which may or may not use the Wikidata dialect of RDF. KIF leverages Wikidata's data model and vocabulary plus user-defined mappings to expose a unified view of the integrated bases while keeping track of the context and provenance of their statements. The result is a virtual knowledge base which behaves like an "extended Wikidata" and which can be queried either through an efficient filter interface or using SPARQL. We present the design and implementation of KIF, discuss how we have used it to solve a real integration problem in the domain of chemistry (involving Wikidata, PubChem, and IBM CIRCA), and present experimental results on the performance and overhead of KIF.
我们提出了一个知识整合框架(称为KIF),它使用Wikidata作为通用的语言来整合异构知识库。这些可以是三元组建模、关系数据库、CSV文件等,可能或不使用Wikidata方言RDF。KIF利用Wikidata的数据模型和词汇,以及用户定义的映射,呈现出整合基体的统一视图,同时跟踪它们的陈述的上下文和来源。结果是一个像“扩展Wikidata”一样的行为的虚拟知识库,可以通过高效的筛选界面或使用SPARQL进行查询。我们介绍了KIF的设计和实现,讨论了我们在化学领域(包括Wikidata、PubChem和IBM CIRCA)如何使用它解决实际整合问题,并提供了KIF的性能和开销的实验结果。
https://arxiv.org/abs/2403.10304
Exploring and mining subtle yet distinctive features between sub-categories with similar appearances is crucial for fine-grained visual categorization (FGVC). However, less effort has been devoted to assessing the quality of extracted visual representations. Intuitively, the network may struggle to capture discriminative features from low-quality samples, which leads to a significant decline in FGVC performance. To tackle this challenge, we propose a weakly supervised Context-Semantic Quality Awareness Network (CSQA-Net) for FGVC. In this network, to model the spatial contextual relationship between rich part descriptors and global semantics for capturing more discriminative details within the object, we design a novel multi-part and multi-scale cross-attention (MPMSCA) module. Before feeding to the MPMSCA module, the part navigator is developed to address the scale confusion problems and accurately identify the local distinctive regions. Furthermore, we propose a generic multi-level semantic quality evaluation module (MLSQE) to progressively supervise and enhance hierarchical semantics from different levels of the backbone network. Finally, context-aware features from MPMSCA and semantically enhanced features from MLSQE are fed into the corresponding quality probing classifiers to evaluate their quality in real-time, thus boosting the discriminability of feature representations. Comprehensive experiments on four popular and highly competitive FGVC datasets demonstrate the superiority of the proposed CSQA-Net in comparison with the state-of-the-art methods.
探索并挖掘子分类中类似但细微的特征对于细粒度视觉分类(FGVC)至关重要。然而,用于提取视觉表示的精力相对较少。直观地,网络可能很难从低质量样本中提取出有用的特征,导致FGVC性能显著下降。为解决这个问题,我们提出了一个弱监督的上下文语义质量意识网络(CSQA-Net)用于FGVC。在网络中,为了在对象中捕捉更细致的语义细节,我们设计了一个新颖的多部分多尺度 cross-attention (MPMSCA) 模块。在将输入传递给 MPMSCA 模块之前,我们开发了部分导航器来解决尺度混淆问题,并准确地识别出局部特有的区域。此外,我们提出了一个通用的多级语义质量评估模块(MLSQE)来逐步监督和增强网络基础知识中的层次语义。最后,将 MPMSCA 和 MLSQE 的上下文感知特征输入到相应的质量探针分类器中,以实时评估它们的质量,从而提高特征表示的歧视性。在四个受欢迎且具有高度竞争性的 FGVC 数据集上进行全面的实验证明,与最先进的方法相比,所提出的 CSQA-Net 具有优越性。
https://arxiv.org/abs/2403.10298
Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50\% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: this https URL.
基于经典结构视觉定位方法具有高精度,但存储、速度和隐私方面存在牺牲。一种最近的创新,关键点场景坐标回归(KSCR)命名为D2S,通过利用图注意力网络增强关键点关系并使用简单的多层感知器(MLP)预测其3D坐标,从而解决这些问题。然后通过PnP+RANSAC确定相机姿态,利用已建立的2D-3D对应关系。虽然KSCR在多个基准测试中实现了竞争力的结果,与像HLoc这样的最先进图像检索方法相媲美,但当数据样本有限时,由于深度学习模型对广泛数据的高度依赖,其性能受到限制。为了解决这个问题,本文提出了一种通过使用神经辐射场(NeRF)进行关键点描述符的流程来解决这个挑战。通过生成新的姿态并将其输入已训练的NeRF模型以生成新的视图,我们的方法在数据稀疏环境中提高了KSCR的泛化能力。所提出的系统可以通过提高50\%的定位精度,仅用很少的时间来合成数据,显著提高局部定位精度。此外,它的模块化设计允许将多个NeRF集成到一个 versatile和高效的视觉定位解决方案中。实现可通过以下链接公开获得:https://this URL。
https://arxiv.org/abs/2403.10297
In this study, we address the intricate challenge of multi-task dense prediction, encompassing tasks such as semantic segmentation, depth estimation, and surface normal estimation, particularly when dealing with partially annotated data (MTPSL). The complexity arises from the absence of complete task labels for each training image. Given the inter-related nature of these pixel-wise dense tasks, our focus is on mining and capturing cross-task relationships. Existing solutions typically rely on learning global image representations for global cross-task image matching, imposing constraints that, unfortunately, sacrifice the finer structures within the images. Attempting local matching as a remedy faces hurdles due to the lack of precise region supervision, making local alignment a challenging endeavor. The introduction of Segment Anything Model (SAM) sheds light on addressing local alignment challenges by providing free and high-quality solutions for region detection. Leveraging SAM-detected regions, the subsequent challenge lies in aligning the representations within these regions. Diverging from conventional methods that directly learn a monolithic image representation, our proposal involves modeling region-wise representations using Gaussian Distributions. Aligning these distributions between corresponding regions from different tasks imparts higher flexibility and capacity to capture intra-region structures, accommodating a broader range of tasks. This innovative approach significantly enhances our ability to effectively capture cross-task relationships, resulting in improved overall performance in partially supervised multi-task dense prediction scenarios. Extensive experiments conducted on two widely used benchmarks underscore the superior effectiveness of our proposed method, showcasing state-of-the-art performance even when compared to fully supervised methods.
在这项研究中,我们解决了多任务密集预测的复杂挑战,包括语义分割、深度估计和表面法线估计等任务,尤其是在处理部分标注数据(MTPSL)时。复杂性源于每个训练图像缺少完整的任务标签。鉴于这些像素级密集任务之间的相关性,我们的重点是挖掘和捕捉跨任务关系。现有的解决方案通常依赖于全局图像表示学习全局跨任务图像匹配,并强加一些限制,不幸的是,牺牲了图像中的更细结构。尝试局部匹配作为解决方法遇到了障碍,因为缺乏精确的区域指导,使得局部对齐成为一个具有挑战性的任务。SAM模型的引入使通过为区域检测提供免费且高质量解决方案来解决局部对齐问题。利用SAM检测到的区域,接下来的挑战在于对齐这些区域内的表示。从传统的直接学习单体图像表示的方法中脱离出来,我们的方案使用高斯分布对区域进行建模。在对应任务的不同区域之间对齐这些分布赋予了更高的灵活性和容量,适应了更广泛的任务。这种创新方法显著增强了我们捕捉跨任务关系的能力,在部分监督的多任务密集预测场景中实现了整体性能的提高。在两个广泛使用的基准上进行的大量实验都证实了所提出方法的优势,即使与完全监督方法相比,表现也相当卓越。
https://arxiv.org/abs/2403.10252
This paper explores the problem of continual learning (CL) of vision-language models (VLMs) in open domains, where the models need to perform continual updating and inference on a streaming of datasets from diverse seen and unseen domains with novel classes. Such a capability is crucial for various applications in open environments, e.g., AI assistants, autonomous driving systems, and robotics. Current CL studies mostly focus on closed-set scenarios in a single domain with known classes. Large pre-trained VLMs like CLIP have demonstrated superior zero-shot recognition ability, and a number of recent studies leverage this ability to mitigate catastrophic forgetting in CL, but they focus on closed-set CL in a single domain dataset. Open-domain CL of large VLMs is significantly more challenging due to 1) large class correlations and domain gaps across the datasets and 2) the forgetting of zero-shot knowledge in the pre-trained VLMs in addition to the knowledge learned from the newly adapted datasets. In this work we introduce a novel approach, termed CoLeCLIP, that learns an open-domain CL model based on CLIP. It addresses these challenges by a joint learning of a set of task prompts and a cross-domain class vocabulary. Extensive experiments on 11 domain datasets show that CoLeCLIP outperforms state-of-the-art methods for open-domain CL under both task- and class-incremental learning settings.
本文探讨了在开放领域中,视觉语言模型(VLMs)的持续学习(CL)问题,这些模型需要对来自不同可见和不可见领域的数据流进行持续更新和推理,并且需要学习新的类。这种能力对于各种开放环境中的应用程序(例如AI助手、自动驾驶系统和机器人)至关重要。目前,大多数CL研究都集中在单个域内的已知类闭包场景。像CLIP这样的大预训练VLM已经证明了卓越的零散拍摄识别能力,并且一些最近的研究利用这种能力来减轻CL中的灾难性遗忘,但他们集中在单个域数据集中的闭包CL。 大VLMs在开放域中的CL挑战非常大,因为数据集之间存在大的类别相关性和领域差异,以及预训练VLM在从新适应的数据集中学习新知识时会遗忘零散拍摄知识。在本文中,我们引入了一种新方法,称为CoLeCLIP,它基于CLIP学习开放域CL模型。它通过联合学习一组任务提示和跨领域类词汇表来解决这些挑战。在11个领域数据集上的广泛实验表明,CoLeCLIP在任务和类增益学习设置下都优于最先进的开放域CL方法。
https://arxiv.org/abs/2403.10245
In recent years, there has been a gradual increase in the performance of Complementary Metal Oxide Semiconductor (CMOS) cameras. These cameras have gained popularity as a viable alternative to charge-coupled device (CCD) cameras in a wide range of applications. One particular application is the CMOS camera installed in small space telescopes. However, the limited power and spatial resources available on satellites present challenges in maintaining ideal observation conditions, including temperature and radiation environment. Consequently, images captured by CMOS cameras are susceptible to issues such as dark current noise and defective pixels. In this paper, we introduce a data-driven framework for mitigating dark current noise and bad pixels for CMOS cameras. Our approach involves two key steps: pixel clustering and function fitting. During pixel clustering step, we identify and group pixels exhibiting similar dark current noise properties. Subsequently, in the function fitting step, we formulate functions that capture the relationship between dark current and temperature, as dictated by the Arrhenius law. Our framework leverages ground-based test data to establish distinct temperature-dark current relations for pixels within different clusters. The cluster results could then be utilized to estimate the dark current noise level and detect bad pixels from real observational data. To assess the effectiveness of our approach, we have conducted tests using real observation data obtained from the Yangwang-1 satellite, equipped with a near-ultraviolet telescope and an optical telescope. The results show a considerable improvement in the detection efficiency of space-based telescopes.
近年来,互补金属氧化半导体(CMOS)相机的表现逐渐提高。这些相机在各种应用中作为替代电荷耦合设备(CCD)相机而受到欢迎。其中一个应用是安装在小型空间望远镜上的CMOS相机。然而,卫星上可用的有限能量和空间资源使维持理想的观测条件具有挑战性,包括温度和辐射环境。因此,CMOS相机捕获的图像可能会出现诸如暗电流噪声和缺陷像素等问题。在本文中,我们介绍了一个数据驱动的框架,用于减轻CMOS相机的暗电流噪声和坏像素。我们的方法包括两个关键步骤:像素聚类和函数拟合。在像素聚类步骤中,我们识别并分组表现出类似暗电流噪声特征的像素。接着,在函数拟合步骤中,我们定义了描述暗电流与温度之间关系的函数,这是由Arrhenius方程所确定的。我们的框架利用地面测试数据建立了不同聚类内像素之间的显著温度-暗电流关系。聚类结果可以 then被用于估计空间望远镜的暗电流噪声水平并检测坏像素从实际观测数据中。为了评估我们方法的有效性,我们利用杨王1卫星获取的实时观测数据进行了测试。结果显示,空间望远镜的检测效率得到了显著提高。
https://arxiv.org/abs/2403.10206
If a person firmly believes in a non-factual statement, such as "The Earth is flat", and argues in its favor, there is no inherent intention to deceive. As the argumentation stems from genuine belief, it may be unlikely to exhibit the linguistic properties associated with deception or lying. This interplay of factuality, personal belief, and intent to deceive remains an understudied area. Disentangling the influence of these variables in argumentation is crucial to gain a better understanding of the linguistic properties attributed to each of them. To study the relation between deception and factuality, based on belief, we present the DeFaBel corpus, a crowd-sourced resource of belief-based deception. To create this corpus, we devise a study in which participants are instructed to write arguments supporting statements like "eating watermelon seeds can cause indigestion", regardless of its factual accuracy or their personal beliefs about the statement. In addition to the generation task, we ask them to disclose their belief about the statement. The collected instances are labelled as deceptive if the arguments are in contradiction to the participants' personal beliefs. Each instance in the corpus is thus annotated (or implicitly labelled) with personal beliefs of the author, factuality of the statement, and the intended deceptiveness. The DeFaBel corpus contains 1031 texts in German, out of which 643 are deceptive and 388 are non-deceptive. It is the first publicly available corpus for studying deception in German. In our analysis, we find that people are more confident in the persuasiveness of their arguments when the statement is aligned with their belief, but surprisingly less confident when they are generating arguments in favor of facts. The DeFaBel corpus can be obtained from this https URL
如果一个人坚定地相信一个不真实的说法,比如"地球是平的",并为此进行辩护,那么其中并没有欺骗的故意。论证的根源是真诚的信念,因此它可能不会表现出与欺骗或谎言相关的语言特性。这些事实、个人信念和欺骗意图之间的相互作用是一个未被充分研究的问题。为了更好地理解每个人所拥有的欺骗或谎言的语言特性,分离这些变量在论证中的影响至关重要。基于信念研究欺骗与事实之间的关系,我们提出了DeFaBel数据集,这是一个基于信念的众筹资源,涵盖基于信念的欺骗。为了创建这个数据集,我们设计了一个研究,让参与者被指示写支持诸如"吃西瓜子会引起消化不良"等说法的论据,无论这些说法的事实准确性或个人信念如何。除了生成任务之外,我们还要求他们披露对这些说法的信念。收集到的实例被标签为欺骗,如果论据与参与者的个人信念相矛盾。每个实例在数据集中都被标注(或含蓄地标记)为作者的信念、陈述的事实性和欺骗意图。DeFaBel数据集包含1031篇德语文本,其中643篇是欺骗性的,388篇是非欺骗性的。它是研究德国语中欺骗的第一份公开可用数据集。在我们的分析中,我们发现,当陈述与一个人的信念相符时,人们对于论点的说服力会增强,但令人惊讶的是,当他们在支持事实时产生论据时,他们的信心会减少。DeFaBel数据集可以从这个链接获取:
https://arxiv.org/abs/2403.10185
Lifted inference exploits symmetries in probabilistic graphical models by using a representative for indistinguishable objects, thereby speeding up query answering while maintaining exact answers. Even though lifting is a well-established technique for the task of probabilistic inference in relational domains, it has not yet been applied to the task of causal inference. In this paper, we show how lifting can be applied to efficiently compute causal effects in relational domains. More specifically, we introduce parametric causal factor graphs as an extension of parametric factor graphs incorporating causal knowledge and give a formal semantics of interventions therein. We further present the lifted causal inference algorithm to compute causal effects on a lifted level, thereby drastically speeding up causal inference compared to propositional inference, e.g., in causal Bayesian networks. In our empirical evaluation, we demonstrate the effectiveness of our approach.
提升推理利用了概率图式模型的对称性,通过使用可区分物体的代表,从而在保持精确答案的同时加速查询回答。尽管提升在关系领域中的概率推理已经是成熟的技术,但它尚未应用于因果推理任务。在本文中,我们展示了如何将提升应用于计算关系领域中因果效应的高效方法。具体来说,我们引入参数化因果因素图作为参数化因素图的扩展,其中包含因果知识,并给出了该因素图中的干预的正式语义。我们进一步介绍了用于计算提升水平上因果效应的提升因果推理算法,从而大大加速了因果推理,与命题推理(如因果贝叶斯网络)相比,情况更加严重。在我们的实证评估中,我们证明了我们方法的有效性。
https://arxiv.org/abs/2403.10184
Event cameras offer high temporal resolution and dynamic range with minimal motion blur, making them promising for object detection tasks. While Spiking Neural Networks (SNNs) are a natural match for event-based sensory data and enable ultra-energy efficient and low latency inference on neuromorphic hardware, Artificial Neural Networks (ANNs) tend to display more stable training dynamics and faster convergence resulting in greater task performance. Hybrid SNN-ANN approaches are a promising alternative, enabling to leverage the strengths of both SNN and ANN architectures. In this work, we introduce the first Hybrid Attention-based SNN-ANN backbone for object detection using event cameras. We propose a novel Attention-based SNN-ANN bridge module to capture sparse spatial and temporal relations from the SNN layer and convert them into dense feature maps for the ANN part of the backbone. Experimental results demonstrate that our proposed method surpasses baseline hybrid and SNN-based approaches by significant margins, with results comparable to existing ANN-based methods. Extensive ablation studies confirm the effectiveness of our proposed modules and architectural choices. These results pave the way toward a hybrid SNN-ANN architecture that achieves ANN like performance at a drastically reduced parameter budget. We implemented the SNN blocks on digital neuromorphic hardware to investigate latency and power consumption and demonstrate the feasibility of our approach.
活动相机具有高时间分辨率和高动态范围,最小化运动模糊,使它们成为物体检测任务的有趣选择。虽然Spiking Neural Networks(SNNs)是事件基于感觉数据的自然匹配,并在类神经形态硬件上实现超能效和低延迟推理,人工神经网络(ANNs)往往表现出更稳定的训练动态和更快的收敛,导致任务性能更大。混合SNN-ANN方法是一个有前途的替代方案,使能够利用SNN和ANN架构的优势。在这项工作中,我们 introduce了第一个基于事件摄像头的混合注意力SNN-ANN骨干网络用于物体检测。我们提出了一种新颖的注意力为基础的SNN-ANN桥模块,从SNN层中捕获稀疏的空间和时间关系,并将它们转换为ANN部分的密集特征图。实验结果表明,与基线杂交和SNN-based方法相比,我们的方法在显著的范围内超过了基线,具有与现有ANN-based方法相似的结果。我们进行了广泛的消融研究,证实了我们提出的模块和架构选择的有效性。这些结果为实现参数预算下实现类似ANN性能的混合SNN-ANN架构铺平了道路。我们在数字类神经形态硬件上实现SNN模块,以研究延迟和功耗,并证明了我们的方法的实现是可行的。
https://arxiv.org/abs/2403.10173
User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
用户界面(UI)理解是一个越来越热门的话题,并且在过去的几年里,主要集中在Web和移动应用程序上。在本文中,我们提出了一个更难的任务:计算机UI理解。旨在促进此领域的研究,我们生成了一个数据集,其中用户会执行一系列操作,并且每张图片都显示了此时桌面内容。我们还提出了一个由合成样本生成管道和视频中的图像对比学习方法组成的框架。我们利用图像特征的自然条件树状关系,在处理多个部分任务的同时,对表示的学习进行正则化。实验结果表明,与之前提出的分层多标签对比损失相比,所提出的框架在细粒度UI分类上表现优异。
https://arxiv.org/abs/2403.10170
Deep learning (DL) models have been advancing automatic medical image analysis on various modalities, including echocardiography, by offering a comprehensive end-to-end training pipeline. This approach enables DL models to regress ejection fraction (EF) directly from 2D+time echocardiograms, resulting in superior performance. However, the end-to-end training pipeline makes the learned representations less explainable. The representations may also fail to capture the continuous relation among echocardiogram clips, indicating the existence of spurious correlations, which can negatively affect the generalization. To mitigate this issue, we propose CoReEcho, a novel training framework emphasizing continuous representations tailored for direct EF regression. Our extensive experiments demonstrate that CoReEcho: 1) outperforms the current state-of-the-art (SOTA) on the largest echocardiography dataset (EchoNet-Dynamic) with MAE of 3.90 & R2 of 82.44, and 2) provides robust and generalizable features that transfer more effectively in related downstream tasks. The code is publicly available at this https URL.
深度学习(DL)模型已经在各种模式上 advances automatic medical image analysis,包括超声心动图,通过提供端到端的训练管道。这种方法使得DL模型可以从2D+时间超声心动图中直接回归射血分数(EF),从而实现卓越的性能。然而,端到端的训练管道使学习到的表示变得难以解释。表示也可能无法捕捉超声心动图片段之间的连续关系,表明存在伪相关,这可能对泛化产生负面影响。为了减轻这个问题,我们提出了CoReEcho,一种关注于直接EF回归的连续表示的训练框架。我们的大量实验证明,CoReEcho:1) 在echoNet-Dynamic超声心动图数据集(EchoNet-Dynamic)上超越了当前最先进的水平(SOTA),其均方误差(MAE)为3.90,相关方差(R2)为82.44;2) 提供了稳健且具有更好通用性的特征,在相关下游任务上传递更有效地。代码公开在這個 https URL 上。
https://arxiv.org/abs/2403.10164
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
音频文本检索(ATR)是一种从音频片段(A2T)中检索相关标题,反之从文本中检索相关音频片段(T2A)的方法,最近吸引了大量研究关注。现有的方法通常将来自每个模态的信息聚合为单个向量进行匹配,但这会牺牲局部细节,并且很难捕捉模态之间的精细关系。此外,当前的ATR数据集缺乏全面的对齐信息,简单的二元对比学习标签忽视了样本之间的细微语义差异的测量。为了应对这些挑战,我们提出了一个新颖的ATR框架,全面捕捉不同角度和更细粒度的多模态信息之间的匹配关系。具体来说,我们引入了一种细粒度对齐方法,通过多尺度过程从局部到全局层次结构,捕捉细致的跨模态关系。此外,我们还开创性地应用了跨模态相似性一致性,利用内部模态相似关系作为软监督,以提高更复杂对齐的准确性。大量实验验证了我们的方法的有效性,在AudioCaps数据集上比前方法至少提高了3.9%(T2A)/6.9%(A2T)的R@1,在Clotho数据集上比前方法至少提高了2.9%(T2A)/5.4%(A2T)的R@1。
https://arxiv.org/abs/2403.10146
Human-centered dynamic scene understanding plays a pivotal role in enhancing the capability of robotic and autonomous systems, in which Video-based Human-Object Interaction (V-HOI) detection is a crucial task in semantic scene understanding, aimed at comprehensively understanding HOI relationships within a video to benefit the behavioral decisions of mobile robots and autonomous driving systems. Although previous V-HOI detection models have made significant strides in accurate detection on specific datasets, they still lack the general reasoning ability like human beings to effectively induce HOI relationships. In this study, we propose V-HOI Multi-LLMs Collaborated Reasoning (V-HOI MLCR), a novel framework consisting of a series of plug-and-play modules that could facilitate the performance of current V-HOI detection models by leveraging the strong reasoning ability of different off-the-shelf pre-trained large language models (LLMs). We design a two-stage collaboration system of different LLMs for the V-HOI task. Specifically, in the first stage, we design a Cross-Agents Reasoning scheme to leverage the LLM conduct reasoning from different aspects. In the second stage, we perform Multi-LLMs Debate to get the final reasoning answer based on the different knowledge in different LLMs. Additionally, we devise an auxiliary training strategy that utilizes CLIP, a large vision-language model to enhance the base V-HOI models' discriminative ability to better cooperate with LLMs. We validate the superiority of our design by demonstrating its effectiveness in improving the prediction accuracy of the base V-HOI model via reasoning from multiple perspectives.
人类中心化的动态场景理解在增强机器人和自动驾驶系统的能力方面发挥了关键作用,其中基于视频的人-对象交互(V-HOI)检测是语义场景理解中至关重要的一环,旨在全面理解视频中HOI关系,以帮助移动机器人和自动驾驶系统的行为决策。尽管以前基于视频的人-对象交互检测模型已经在特定的数据集上取得了显著的进展,但它们仍然缺乏像人类一样的一般推理能力,无法有效诱导HOI关系。在本文中,我们提出了V-HOI Multi-LLMs Collaborated Reasoning(V-HOI MLCR),一种由一系列可插拔的模块组成的全新框架,通过利用不同预训练大型语言模型(LLMs)的强推理能力,以改善现有V-HOI检测模型的性能。我们为V-HOI任务设计了两个阶段的协作系统。具体来说,在第一阶段,我们设计了一个跨代理商推理方案,以利用LLM从不同方面进行推理。在第二阶段,我们进行了Multi-LLMs Debate,以根据不同LLM的知识得出最终推理答案。此外,我们还设计了一种辅助训练策略,利用CLIP(一个大型视觉-语言模型)增强基于V-HOI的基线模型的判别能力,从而更好地与LLM合作。我们通过从多个角度进行推理,证明了我们设计的优越性,并通过提高基线V-HOI模型的预测准确性来证实了这一点。
https://arxiv.org/abs/2403.10107