Autonomous vehicles are gradually entering city roads today, with the help of high-definition maps (HDMaps). However, the reliance on HDMaps prevents autonomous vehicles from stepping into regions without this expensive digital infrastructure. This fact drives many researchers to study online HDMap generation algorithms, but the performance of these algorithms at far regions is still unsatisfying. We present P-MapNet, in which the letter P highlights the fact that we focus on incorporating map priors to improve model performance. Specifically, we exploit priors in both SDMap and HDMap. On one hand, we extract weakly aligned SDMap from OpenStreetMap, and encode it as an additional conditioning branch. Despite the misalignment challenge, our attention-based architecture adaptively attends to relevant SDMap skeletons and significantly improves performance. On the other hand, we exploit a masked autoencoder to capture the prior distribution of HDMap, which can serve as a refinement module to mitigate occlusions and artifacts. We benchmark on the nuScenes and Argoverse2 datasets. Through comprehensive experiments, we show that: (1) our SDMap prior can improve online map generation performance, using both rasterized (by up to $+18.73$ $\rm mIoU$) and vectorized (by up to $+8.50$ $\rm mAP$) output representations. (2) our HDMap prior can improve map perceptual metrics by up to $6.34\%$. (3) P-MapNet can be switched into different inference modes that covers different regions of the accuracy-efficiency trade-off landscape. (4) P-MapNet is a far-seeing solution that brings larger improvements on longer ranges. Codes and models are publicly available at this https URL.
自动驾驶汽车今天正在逐步进入城市道路,得益于高清晰度地图(HDMaps)的帮助。然而,对HDMaps的依赖使得自动驾驶汽车无法进入没有这种昂贵数字基础设施的区域。这一点导致许多研究人员研究在线HDMap生成算法,但這些算法的性能在远距离地区仍然不令人满意。我们提出了P-MapNet,其中字母P突出了我们专注于结合地图先验以提高模型性能的事实。具体来说,我们利用SDMap和HDMap的预分布。一方面,我们提取了OpenStreetMap中弱对齐的SDMap,并将其编码为额外的条件分支。尽管存在对齐挑战,但我们的自适应架构会适应性地关注相关的SDMap骨架,显著提高性能。另一方面,我们利用掩码自动编码器来捕捉HDMap的先验分布,这可以作为减少遮挡和伪影的优化模块。我们在nuScenes和Argoverse2数据集上进行基准测试。通过全面的实验,我们证明了:(1)我们的SDMap先验可以提高在线地图生成性能,无论是通过平面化(最高+18.73 $\rm mIoU$)还是向量化(最高+8.50 $\rm mAP$)输出表示。(2)我们的HDMap先验可以提高地图感知指标,最多+6.34%。(3)P-MapNet可以切换到不同的推理模式,涵盖准确性与效率权衡曲线的不同区域。(4)P-MapNet是一个具有远见性的解决方案,在更远的距离上带来更大的改进。代码和模型可以从该https URL公开获取。
https://arxiv.org/abs/2403.10521
Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter- and intra-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subsequently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.
音频视觉情感识别 (ER) 在视频中有巨大的潜力。它有效地利用了视觉和听觉模式之间的相互依赖关系。这项工作提出了一种新型的音频-视觉情感识别系统,采用基于关键的跨注意力的联合多模态Transformer架构。这个框架旨在利用视频中的音频和视觉提示(面部表情和语调模式)的互补性质,从而比仅依赖单一模态时获得更好的性能。所提出的模型利用单独的骨干网络来捕捉每个模式内部的时序依赖(音频和视觉)。接着,联合多模态Transformer架构整合了每个模态的单独嵌入,使得模型能够有效捕捉跨模态(在音频和视觉之间)和模态内的(每个模式内)关系。在具有挑战性的Affwild2数据集上进行的广泛评估证明,与基线和最先进的ER方法相比,所提出的模型在ER任务中显著表现出更好的性能。
https://arxiv.org/abs/2403.10488
Image segmentation is one of the most fundamental problems in computer vision and has drawn a lot of attentions due to its vast applications in image understanding and autonomous driving. However, designing effective and efficient segmentation neural architectures is a labor-intensive process that may require lots of trials by human experts. In this paper, we address the challenge of integrating multi-head self-attention into high resolution representation CNNs efficiently, by leveraging architecture search. Manually replacing convolution layers with multi-head self-attention is non-trivial due to the costly overhead in memory to maintain high resolution. By contrast, we develop a multi-target multi-branch supernet method, which not only fully utilizes the advantages of high-resolution features, but also finds the proper location for placing multi-head self-attention module. Our search algorithm is optimized towards multiple objective s (e.g., latency and mIoU) and capable of finding architectures on Pareto frontier with arbitrary number of branches in a single search. We further present a series of model via Hybrid Convolutional-Transformer Architecture Search (HyCTAS) method that searched for the best hybrid combination of light-weight convolution layers and memory-efficient self-attention layers between branches from different resolutions and fuse to high resolution for both efficiency and effectiveness. Extensive experiments demonstrate that HyCTAS outperforms previous methods on semantic segmentation task. Code and models are available at \url{this https URL}.
图像分割是计算机视觉中最基本的问题之一,由于其在图像理解和自动驾驶中的广泛应用,因此受到了很多关注。然而,设计有效且高效的分割神经架构是一个劳动密集的过程,可能需要许多专家的人工作尝试。在本文中,我们通过利用架构搜索解决了将多头自注意力集成到高分辨率表示CNNs中的问题,通过构建多目标多分支超级网络。通过手动替换卷积层为多头自注意力,由于需要高昂的内存开销来维持高分辨率,因此这是不可能的。相反,我们开发了一种多目标多分支超级网络方法,不仅充分利用了高分辨率特征的优势,而且发现了放置多头自注意力的适当位置。我们的搜索算法针对多个目标(如延迟和mIoU)进行优化,可以在一个搜索中找到架构在帕累托前沿的任意数量分支上的最优架构。我们还通过HyCTAS方法展示了一系列模型,该方法在寻找不同分辨率分支的最佳轻量级卷积层和内存高效的自注意力层之间进行了搜索,将高分辨率与效率进行了平衡。大量实验证明,HyCTAS在语义分割任务上优于以前的算法。代码和模型可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2403.10413
Encouraged by the growing availability of pre-trained 2D diffusion models, image-to-3D generation by leveraging Score Distillation Sampling (SDS) is making remarkable progress. Most existing methods combine novel-view lifting from 2D diffusion models which usually take the reference image as a condition while applying hard L2 image supervision at the reference view. Yet heavily adhering to the image is prone to corrupting the inductive knowledge of the 2D diffusion model leading to flat or distorted 3D generation frequently. In this work, we reexamine image-to-3D in a novel perspective and present Isotropic3D, an image-to-3D generation pipeline that takes only an image CLIP embedding as input. Isotropic3D allows the optimization to be isotropic w.r.t. the azimuth angle by solely resting on the SDS loss. The core of our framework lies in a two-stage diffusion model fine-tuning. Firstly, we fine-tune a text-to-3D diffusion model by substituting its text encoder with an image encoder, by which the model preliminarily acquires image-to-image capabilities. Secondly, we perform fine-tuning using our Explicit Multi-view Attention (EMA) which combines noisy multi-view images with the noise-free reference image as an explicit condition. CLIP embedding is sent to the diffusion model throughout the whole process while reference images are discarded once after fine-tuning. As a result, with a single image CLIP embedding, Isotropic3D is capable of generating multi-view mutually consistent images and also a 3D model with more symmetrical and neat content, well-proportioned geometry, rich colored texture, and less distortion compared with existing image-to-3D methods while still preserving the similarity to the reference image to a large extent. The project page is available at this https URL. The code and models are available at this https URL.
随着预训练的2D扩散模型可用性的增加,利用Score Distillation Sampling(SDS)进行图像到3D生成的方法取得了显著的进展。大多数现有方法结合了来自2D扩散模型的新视图提升,通常在参考图像应用硬L2图像监督。然而,过于依赖图像可能导致2D扩散模型的归纳知识受到污染,从而导致平面或畸变的3D生成频繁出现。在本文中,我们重新审视了图像到3D,并提出了Isotropic3D,一种仅接受图像CLIP嵌入的图像到3D生成管道。Isotropic3D允许优化在仅基于SDS损失的情况下实现对称。我们的框架的核心是基于扩散模型的两级微调。首先,我们通过用图像编码器替换文本编码器来微调文本到3D扩散模型,从而初步获得图像到图像的功能。其次,我们使用我们的显式多视图注意(EMA)进行微调,将噪音多视角图像与无噪音的参考图像作为显式条件。在整个过程中,CLIP嵌入被发送到扩散模型中,而参考图像在微调后就被丢弃了。因此,使用单个图像CLIP嵌入,Isotropic3D能够生成多视角相互一致的图像,以及具有更对称和平衡的内容、均匀的几何形状、丰富的色彩纹理和更少的畸变的3D模型,与现有的图像到3D方法相比,仍然保留了与参考图像的相似性。项目页面可以在该https URL上找到。代码和模型可以在这个https URL上找到。
https://arxiv.org/abs/2403.10395
Leveraging Transformer attention has led to great advancements in HDR deghosting. However, the intricate nature of self-attention introduces practical challenges, as existing state-of-the-art methods often demand high-end GPUs or exhibit slow inference speeds, especially for high-resolution images like 2K. Striking an optimal balance between performance and latency remains a critical concern. In response, this work presents PASTA, a novel Progressively Aggregated Spatio-Temporal Alignment framework for HDR deghosting. Our approach achieves effectiveness and efficiency by harnessing hierarchical representation during feature distanglement. Through the utilization of diverse granularities within the hierarchical structure, our method substantially boosts computational speed and optimizes the HDR imaging workflow. In addition, we explore within-scale feature modeling with local and global attention, gradually merging and refining them in a coarse-to-fine fashion. Experimental results showcase PASTA's superiority over current SOTA methods in both visual quality and performance metrics, accompanied by a substantial 3-fold (x3) increase in inference speed.
利用Transformer关注力在HDR去雾中取得了巨大的进展。然而,自注意力的复杂性引入了实际挑战,因为现有的最先进的方法通常需要高端GPU或表现出较慢的推理速度,特别是对于高分辨率图像(如2K)。在性能和延迟之间实现最佳平衡仍然是一个关键问题。因此,本文提出了一种名为PASTA的新型可逐步聚合时空对齐框架用于HDR去雾。我们的方法通过在特征扭曲过程中利用分层的表示来获得有效性和效率。通过在分层结构中使用不同的粒度,我们的方法大大提高了计算速度并优化了HDR成像工作流程。此外,我们还研究了与局部和全局注意力相关的自适应特征建模,在粗到细的粒度上逐渐合并和优化它们。实验结果表明,PASTA在视觉质量和性能指标方面优于当前最先进的方法,伴随着推理速度的大幅提升(x3)。
https://arxiv.org/abs/2403.10376
The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: this https URL.
自动驾驶领域吸引了相当大的关注,尤其是在从多个相机直接推断鸟视图(BEV)中的3D对象的方法。一些尝试还探索了利用单个图像中的2D检测器来提高3D检测的性能。然而,这些方法依赖于两个阶段的处理过程,其中仅在关键词选择或查询初始化时利用2D检测结果。在本文中,我们提出了一个名为SimPB的单一模型,该模型同时从多个相机检测2D物体和3D物体。为实现这一目标,我们引入了一个由多个鸟视图2D检测层和几个3D检测层组成的混合编码器。为了不断更新和优化2D和3D结果之间的交互,我们提出了动态查询分配模块和自适应查询聚合模块。此外,我们还使用了查询组注意来加强每个相机组内2D查询之间的互动。在实验中,我们在 nuScenes 数据集上评估了我们的方法,并展示了对于2D和3D检测任务的积极结果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.10353
Transformers have demonstrated their effectiveness in image restoration tasks. Existing Transformer architectures typically comprise two essential components: multi-head self-attention and feed-forward network (FFN). The former captures long-range pixel dependencies, while the latter enables the model to learn complex patterns and relationships in the data. Previous studies have demonstrated that FFNs are key-value memories \cite{geva2020transformer}, which are vital in modern Transformer architectures. In this paper, we conduct an empirical study to explore the potential of attention mechanisms without using FFN and provide novel structures to demonstrate that removing FFN is flexible for image restoration. Specifically, we propose Continuous Scaling Attention (\textbf{CSAttn}), a method that computes attention continuously in three stages without using FFN. To achieve competitive performance, we propose a series of key components within the attention. Our designs provide a closer look at the attention mechanism and reveal that some simple operations can significantly affect the model performance. We apply our \textbf{CSAttn} to several image restoration tasks and show that our model can outperform CNN-based and Transformer-based image restoration approaches.
变换器在图像修复任务中已经证明了其有效性。现有的变换器架构通常包括两个基本组件:多头自注意力和前馈网络(FFN)。前者捕捉长距离像素关系,而后者使模型能够学习数据中的复杂模式和关系。以前的研究表明,FFN是关键值记忆器 \cite{geva2020transformer},这对于现代变换器架构至关重要。在本文中,我们进行了一项实证研究,探讨了没有使用FFN的注意机制的潜力,并为图像修复提供了一些新的结构,表明去除FFN是灵活的。具体来说,我们提出了连续缩放注意(CSAttn)的方法,这是一种在三个阶段连续计算注意的方法,而没有使用FFN。为了实现竞争力的性能,我们在注意力的各个方面提出了关键组件。我们的设计使我们对注意机制更加深入地了解,并揭示了某些简单的操作可能会显著影响模型性能。我们将我们的CSAttn应用于多个图像修复任务,并证明了我们的模型可以超过基于CNN和基于Transformer的图像修复方法。
https://arxiv.org/abs/2403.10336
Exploring and mining subtle yet distinctive features between sub-categories with similar appearances is crucial for fine-grained visual categorization (FGVC). However, less effort has been devoted to assessing the quality of extracted visual representations. Intuitively, the network may struggle to capture discriminative features from low-quality samples, which leads to a significant decline in FGVC performance. To tackle this challenge, we propose a weakly supervised Context-Semantic Quality Awareness Network (CSQA-Net) for FGVC. In this network, to model the spatial contextual relationship between rich part descriptors and global semantics for capturing more discriminative details within the object, we design a novel multi-part and multi-scale cross-attention (MPMSCA) module. Before feeding to the MPMSCA module, the part navigator is developed to address the scale confusion problems and accurately identify the local distinctive regions. Furthermore, we propose a generic multi-level semantic quality evaluation module (MLSQE) to progressively supervise and enhance hierarchical semantics from different levels of the backbone network. Finally, context-aware features from MPMSCA and semantically enhanced features from MLSQE are fed into the corresponding quality probing classifiers to evaluate their quality in real-time, thus boosting the discriminability of feature representations. Comprehensive experiments on four popular and highly competitive FGVC datasets demonstrate the superiority of the proposed CSQA-Net in comparison with the state-of-the-art methods.
探索并挖掘子分类中类似但细微的特征对于细粒度视觉分类(FGVC)至关重要。然而,用于提取视觉表示的精力相对较少。直观地,网络可能很难从低质量样本中提取出有用的特征,导致FGVC性能显著下降。为解决这个问题,我们提出了一个弱监督的上下文语义质量意识网络(CSQA-Net)用于FGVC。在网络中,为了在对象中捕捉更细致的语义细节,我们设计了一个新颖的多部分多尺度 cross-attention (MPMSCA) 模块。在将输入传递给 MPMSCA 模块之前,我们开发了部分导航器来解决尺度混淆问题,并准确地识别出局部特有的区域。此外,我们提出了一个通用的多级语义质量评估模块(MLSQE)来逐步监督和增强网络基础知识中的层次语义。最后,将 MPMSCA 和 MLSQE 的上下文感知特征输入到相应的质量探针分类器中,以实时评估它们的质量,从而提高特征表示的歧视性。在四个受欢迎且具有高度竞争性的 FGVC 数据集上进行全面的实验证明,与最先进的方法相比,所提出的 CSQA-Net 具有优越性。
https://arxiv.org/abs/2403.10298
Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50\% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: this https URL.
基于经典结构视觉定位方法具有高精度,但存储、速度和隐私方面存在牺牲。一种最近的创新,关键点场景坐标回归(KSCR)命名为D2S,通过利用图注意力网络增强关键点关系并使用简单的多层感知器(MLP)预测其3D坐标,从而解决这些问题。然后通过PnP+RANSAC确定相机姿态,利用已建立的2D-3D对应关系。虽然KSCR在多个基准测试中实现了竞争力的结果,与像HLoc这样的最先进图像检索方法相媲美,但当数据样本有限时,由于深度学习模型对广泛数据的高度依赖,其性能受到限制。为了解决这个问题,本文提出了一种通过使用神经辐射场(NeRF)进行关键点描述符的流程来解决这个挑战。通过生成新的姿态并将其输入已训练的NeRF模型以生成新的视图,我们的方法在数据稀疏环境中提高了KSCR的泛化能力。所提出的系统可以通过提高50\%的定位精度,仅用很少的时间来合成数据,显著提高局部定位精度。此外,它的模块化设计允许将多个NeRF集成到一个 versatile和高效的视觉定位解决方案中。实现可通过以下链接公开获得:https://this URL。
https://arxiv.org/abs/2403.10297
Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.
真实世界医疗场景中的时间序列数据通常表现出长距离依赖关系,并且观察到的间隔是非均匀的。在这种情况下,传统基于序列的循环模型很难。为了克服这种,研究人员用基于神经网络的运动方程模型来建模非均匀采样数据,并使用基于Transformer的架构来处理长距离依赖关系。尽管这两种方法都取得了成功,但它们的输入序列中等长度和高维数据需要非常高的计算成本。为了减轻这种成本,我们引入了Rough Transformer,这是一种Transformer模型的变体,它在输入序列的连续时间表示上运行,并大大降低了计算成本,这对解决医疗场景中常见的长距离依赖关系非常重要。 特别是,我们提出了多视角签名注意,它使用路径签名来增强基本的注意力,并捕捉输入数据中的局部和全局依赖关系,同时保持对序列长度和采样周期的变化鲁棒。我们发现,Rough Transformers在 synthetic 和 real-world time-series 任务上的表现始终优于它们的普通注意力 counterparts,而使用的时间和内存资源却大大减少。
https://arxiv.org/abs/2403.10288
Large-scale applications of Visual Place Recognition (VPR) require computationally efficient approaches. Further, a well-balanced combination of data-based and training-free approaches can decrease the required amount of training data and effort and can reduce the influence of distribution shifts between the training and application phases. This paper proposes a runtime and data-efficient hierarchical VPR pipeline that extends existing approaches and presents novel ideas. There are three main contributions: First, we propose Local Positional Graphs (LPG), a training-free and runtime-efficient approach to encode spatial context information of local image features. LPG can be combined with existing local feature detectors and descriptors and considerably improves the image-matching quality compared to existing techniques in our experiments. Second, we present Attentive Local SPED (ATLAS), an extension of our previous local features approach with an attention module that improves the feature quality while maintaining high data efficiency. The influence of the proposed modifications is evaluated in an extensive ablation study. Third, we present a hierarchical pipeline that exploits hyperdimensional computing to use the same local features as holistic HDC-descriptors for fast candidate selection and for candidate reranking. We combine all contributions in a runtime and data-efficient VPR pipeline that shows benefits over the state-of-the-art method Patch-NetVLAD on a large collection of standard place recognition datasets with 15$\%$ better performance in VPR accuracy, 54$\times$ faster feature comparison speed, and 55$\times$ less descriptor storage occupancy, making our method promising for real-world high-performance large-scale VPR in changing environments. Code will be made available with publication of this paper.
大规模应用视觉空间识别(VPR)需要计算效率高的方法。此外,基于数据和训练free方法的平衡组合可以减少所需的训练数据和精力,并减少训练和应用阶段之间分布漂移的影响。本文提出了一种运行时和数据效率高的分层VPR管道,扩展了现有方法并提出了新的思路。主要有三个贡献:首先,我们提出了局部位置图(LPG),一种无需训练且具有高计算效率的编码局部图像特征空间的方法。LPG可以与现有的局部特征检测器和描述器相结合,显著提高了图像匹配质量,与现有技术相比我们的实验结果。其次,我们介绍了自适应局部SPED(ATLAS),这是一种在局部特征方法上添加注意模块,以提高特征质量的同时保持高数据效率的扩展。ATLAS对所提出的修改影响的评估进行了广泛的消融研究。最后,我们提出了一个分层的VPR管道,利用多维计算来使用相同的局部特征作为整体高维HDC描述符进行快速候选选择和候选排序。我们将所有贡献集成到一个运行时和数据效率高的VPR管道中,该管道在标准place recognition数据集上的VPR准确率比Patch-NetVLAD高15%,特征比较速度快54倍,描述符存储占用减少55倍,使得我们的方法对于在变化环境中实现真实世界的宏观VPR具有前景。代码将在论文发表后公开提供。
https://arxiv.org/abs/2403.10283
The swift evolution of Large-scale Models (LMs), either language-focused or multi-modal, has garnered extensive attention in both academy and industry. But despite the surge in interest in this rapidly evolving area, there are scarce systematic reviews on their capabilities and potential in distinct impactful scenarios. This paper endeavours to help bridge this gap, offering a thorough examination of the current landscape of LM usage in regards to complex game playing scenarios and the challenges still open. Here, we seek to systematically review the existing architectures of LM-based Agents (LMAs) for games and summarize their commonalities, challenges, and any other insights. Furthermore, we present our perspective on promising future research avenues for the advancement of LMs in games. We hope to assist researchers in gaining a clear understanding of the field and to generate more interest in this highly impactful research direction. A corresponding resource, continuously updated, can be found in our GitHub repository.
大规模模型(LMs)的快速演变,无论是语言集中还是多模态,都引起了学术界和产业界的广泛关注。然而,尽管对这一快速发展的领域兴趣浓厚,但关于它们在各自显著影响场景下的能力和潜力方面的系统综述仍然很少。本文旨在弥合这一空白,对基于游戏的LM使用现状及其所面临的挑战进行全面审查。在这里,我们试图系统地审查基于游戏的LM代理(LMAs)的现有架构,总结它们的共同点、挑战以及任何其他见解。此外,我们还提出了未来在游戏领域推动LM发展的有前途的研究方向。我们希望帮助研究人员对这一领域有一个清晰的认识,并产生对该具有重大影响的研究方向的更多兴趣。相应的资源,持续更新,可以在我们的GitHub存储库中找到。
https://arxiv.org/abs/2403.10249
Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view inconsistency or lack of geometric fidelity. To overcome these challenges, we propose an orthogonal plane decomposition mechanism to extract 3D geometric features from the 2D input, enabling the generation of consistent multi-view images. Moreover, we further accelerate the state-of-the-art Gaussian Splatting incorporating epipolar attention to fuse images from different viewpoints. We demonstrate that FDGaussian generates images with high consistency across different views and reconstructs high-quality 3D objects, both qualitatively and quantitatively. More examples can be found at our website this https URL.
从单视图图像中重建详细的三维物体仍然是一个具有挑战性的任务,因为可用信息有限。在本文中,我们引入了FDGaussian,一种新的两阶段框架,用于从单视图图像中重建详细的三维物体。最近的方法通常利用预训练的2D扩散模型生成输入图像的合理新视图,但它们要么遇到多视角不一致问题,要么缺乏几何一致性。为了克服这些挑战,我们提出了一个正交平面分解机制,从2D输入中提取3D几何特征,从而生成一致的多视角图像。此外,我们通过引入 epipolar 注意机制进一步加速最先进的高斯膨胀,将来自不同视角的图像融合在一起。我们证明了FDGaussian在不同的视图中具有高度的一致性,并成功重建了高质量的三维物体,无论是定性的还是定量的。更多例子可以在我们的网站上找到,网址是https://this URL。
https://arxiv.org/abs/2403.10242
In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// this http URL .
在最近的研究中,大量精力投入到了开箱对象检测任务中,旨在超越训练过程中指定的有限类别的限制,在推理过程中检测由任意类名描述的对象。与传统的物体检测相比,开箱对象检测大大扩展了物体检测类别。然而,它依赖于使用预训练的视觉和自然语言模型计算图像区域与一组任意类别的相似度。这表明,尽管该任务具有开箱式的特点,但在推理阶段仍然需要预定义的物体类别。这引发了这样一个问题:如果在推理阶段我们没有确切的物体类别知识,会怎么样?在本文中,我们将这种新设置称为生成开放对象检测,这是一种更一般和实用的问题。为了解决这个问题,我们将物体检测建模为生成问题,并提出了名为GenerateU的简单框架,可以检测密集物体并以自由形式生成它们的名称。特别是,我们使用Deformable DETR作为区域建议生成器,将语言模型翻译为视觉区域的对象名称。为了评估自由形式物体检测任务,我们引入了一种定量评估方法来量化生成结果的性能。广泛的实验证明,我们的GenerateU在零散检测方面具有很强的表现。例如,在LVIS数据集上,我们的GenerateU与开箱式物体检测方法GLIP的性能相当,尽管在推理过程中GenerateU没有看到类名。代码可在此处访问:https:// this http URL 。
https://arxiv.org/abs/2403.10191
Event cameras offer high temporal resolution and dynamic range with minimal motion blur, making them promising for object detection tasks. While Spiking Neural Networks (SNNs) are a natural match for event-based sensory data and enable ultra-energy efficient and low latency inference on neuromorphic hardware, Artificial Neural Networks (ANNs) tend to display more stable training dynamics and faster convergence resulting in greater task performance. Hybrid SNN-ANN approaches are a promising alternative, enabling to leverage the strengths of both SNN and ANN architectures. In this work, we introduce the first Hybrid Attention-based SNN-ANN backbone for object detection using event cameras. We propose a novel Attention-based SNN-ANN bridge module to capture sparse spatial and temporal relations from the SNN layer and convert them into dense feature maps for the ANN part of the backbone. Experimental results demonstrate that our proposed method surpasses baseline hybrid and SNN-based approaches by significant margins, with results comparable to existing ANN-based methods. Extensive ablation studies confirm the effectiveness of our proposed modules and architectural choices. These results pave the way toward a hybrid SNN-ANN architecture that achieves ANN like performance at a drastically reduced parameter budget. We implemented the SNN blocks on digital neuromorphic hardware to investigate latency and power consumption and demonstrate the feasibility of our approach.
活动相机具有高时间分辨率和高动态范围,最小化运动模糊,使它们成为物体检测任务的有趣选择。虽然Spiking Neural Networks(SNNs)是事件基于感觉数据的自然匹配,并在类神经形态硬件上实现超能效和低延迟推理,人工神经网络(ANNs)往往表现出更稳定的训练动态和更快的收敛,导致任务性能更大。混合SNN-ANN方法是一个有前途的替代方案,使能够利用SNN和ANN架构的优势。在这项工作中,我们 introduce了第一个基于事件摄像头的混合注意力SNN-ANN骨干网络用于物体检测。我们提出了一种新颖的注意力为基础的SNN-ANN桥模块,从SNN层中捕获稀疏的空间和时间关系,并将它们转换为ANN部分的密集特征图。实验结果表明,与基线杂交和SNN-based方法相比,我们的方法在显著的范围内超过了基线,具有与现有ANN-based方法相似的结果。我们进行了广泛的消融研究,证实了我们提出的模块和架构选择的有效性。这些结果为实现参数预算下实现类似ANN性能的混合SNN-ANN架构铺平了道路。我们在数字类神经形态硬件上实现SNN模块,以研究延迟和功耗,并证明了我们的方法的实现是可行的。
https://arxiv.org/abs/2403.10173
Audio-text retrieval (ATR), which retrieves a relevant caption given an audio clip (A2T) and vice versa (T2A), has recently attracted much research attention. Existing methods typically aggregate information from each modality into a single vector for matching, but this sacrifices local details and can hardly capture intricate relationships within and between modalities. Furthermore, current ATR datasets lack comprehensive alignment information, and simple binary contrastive learning labels overlook the measurement of fine-grained semantic differences between samples. To counter these challenges, we present a novel ATR framework that comprehensively captures the matching relationships of multimodal information from different perspectives and finer granularities. Specifically, a fine-grained alignment method is introduced, achieving a more detail-oriented matching through a multiscale process from local to global levels to capture meticulous cross-modal relationships. In addition, we pioneer the application of cross-modal similarity consistency, leveraging intra-modal similarity relationships as soft supervision to boost more intricate alignment. Extensive experiments validate the effectiveness of our approach, outperforming previous methods by significant margins of at least 3.9% (T2A) / 6.9% (A2T) R@1 on the AudioCaps dataset and 2.9% (T2A) / 5.4% (A2T) R@1 on the Clotho dataset.
音频文本检索(ATR)是一种从音频片段(A2T)中检索相关标题,反之从文本中检索相关音频片段(T2A)的方法,最近吸引了大量研究关注。现有的方法通常将来自每个模态的信息聚合为单个向量进行匹配,但这会牺牲局部细节,并且很难捕捉模态之间的精细关系。此外,当前的ATR数据集缺乏全面的对齐信息,简单的二元对比学习标签忽视了样本之间的细微语义差异的测量。为了应对这些挑战,我们提出了一个新颖的ATR框架,全面捕捉不同角度和更细粒度的多模态信息之间的匹配关系。具体来说,我们引入了一种细粒度对齐方法,通过多尺度过程从局部到全局层次结构,捕捉细致的跨模态关系。此外,我们还开创性地应用了跨模态相似性一致性,利用内部模态相似关系作为软监督,以提高更复杂对齐的准确性。大量实验验证了我们的方法的有效性,在AudioCaps数据集上比前方法至少提高了3.9%(T2A)/6.9%(A2T)的R@1,在Clotho数据集上比前方法至少提高了2.9%(T2A)/5.4%(A2T)的R@1。
https://arxiv.org/abs/2403.10146
Self-report measures (e.g., Likert scales) are widely used to evaluate subjective health perceptions. Recently, the visual analog scale (VAS), a slider-based scale, has become popular owing to its ability to precisely and easily assess how people feel. These data can be influenced by the response style (RS), a user-dependent systematic tendency that occurs regardless of questionnaire instructions. Despite its importance, especially in between-individual analysis, little attention has been paid to handling the RS in the VAS (denoted as response profile (RP)), as it is mainly used for within-individual monitoring and is less affected by RP. However, VAS measurements often require repeated self-reports of the same questionnaire items, making it difficult to apply conventional methods on a Likert scale. In this study, we developed a novel RP characterization method for various types of repeatedly measured VAS data. This approach involves the modeling of RP as distributional parameters ${\theta}$ through a mixture of RS-like distributions, and addressing the issue of unbalanced data through bootstrap sampling for treating repeated measures. We assessed the effectiveness of the proposed method using simulated pseudo-data and an actual dataset from an empirical study. The assessment of parameter recovery showed that our method accurately estimated the RP parameter ${\theta}$, demonstrating its robustness. Moreover, applying our method to an actual VAS dataset revealed the presence of individual RP heterogeneity, even in repeated VAS measurements, similar to the findings of the Likert scale. Our proposed method enables RP heterogeneity-aware VAS data analysis, similar to Likert-scale data analysis.
自我报告测量(例如,利克特量表)广泛用于评估主观健康状况。最近,视觉模拟量表(VAS)作为一种滑动基量表,由于其能精确且容易地评估人们的感觉而变得非常受欢迎。这些数据可能受到回答风格(RS)的影响,这是一种用户依赖的系统倾向,无论问卷说明如何都会发生。尽管在个体分析中具有重要意义,尤其是在利克特量表之间的分析中,但很少有人关注在VAS中处理RS(表示为响应轮廓(RP)),因为主要用途是用于个体监测,对RS的影响较小。然而,VAS测量通常需要对相同问卷项目的重复自我报告,这使得在利克特量表上应用传统方法变得困难。在本研究中,我们为各种类型的重复测量VAS数据开发了一种新颖的RP刻画方法。这种方法通过混合RS类似分布建模RP,并通过bootstrap采样解决数据不平衡问题。我们使用模拟伪数据和实际研究数据来评估所提出的方法的有效性。参数恢复评估显示,我们的方法准确估计了RP参数θ,证明了其稳健性。此外,将我们的方法应用于实际VAS数据集揭示了即使在重复VAS测量中,个体RP异质性仍然存在,这与利克特量表的发现相似。我们提出的方法使RP异质性可意识到的VAS数据分析成为可能,类似于利克特量表数据分析。
https://arxiv.org/abs/2403.10136
Landslides are one of the most destructive natural disasters in the world, posing a serious threat to human life and safety. The development of foundation models has provided a new research paradigm for large-scale landslide detection. The Segment Anything Model (SAM) has garnered widespread attention in the field of image segmentation. However, our experiment found that SAM performed poorly in the task of landslide segmentation. We propose TransLandSeg, which is a transfer learning approach for landslide semantic segmentation based on a vision foundation model (VFM). TransLandSeg outperforms traditional semantic segmentation models on both the Landslide4Sense dataset and the Bijie landslide dataset. Our proposed adaptive transfer learning (ATL) architecture enables the powerful segmentation capability of SAM to be transferred to landslide detection by training only 1.3% of the number of the parameters of SAM, which greatly improves the training efficiency of the model. Finally we also conducted ablation experiments on models with different ATL structures, concluded that the deployment location and residual connection of ATL play an important role in TransLandSeg accuracy improvement.
滑坡是世界上最具破坏性的自然灾害之一,对人类生命和安全构成严重威胁。基于基础模型的滑坡检测研究新范式已经取得了一定的进展。在图像分割领域,Segment Anything Model (SAM) 已经引起了广泛的关注。然而,我们的实验发现,SAM在滑坡分割任务中的表现不佳。我们提出了TransLandSeg,一种基于视觉基础模型(VFM)的滑坡语义分割传输学习方法。TransLandSeg在Landslide4Sense数据集和Bijie landslide数据集上都优于传统的语义分割模型。我们提出的自适应传输学习(ATL)架构通过仅训练SAM的1.3%参数,实现了SAM强大的分割能力,大大提高了模型的训练效率。最后,我们还对具有不同ATL结构模型的模型进行了消融实验,结果表明,ATL的部署位置和残差连接在TransLandSeg准确性改进中发挥着重要作用。
https://arxiv.org/abs/2403.10127
Early diagnosis of Alzheimer's Disease (AD) is very important for following medical treatments, and eye movements under special visual stimuli may serve as a potential non-invasive biomarker for detecting cognitive abnormalities of AD patients. In this paper, we propose an Depth-induced saliency comparison network (DISCN) for eye movement analysis, which may be used for diagnosis the Alzheimers disease. In DISCN, a salient attention module fuses normal eye movements with RGB and depth maps of visual stimuli using hierarchical salient attention (SAA) to evaluate comprehensive saliency maps, which contain information from both visual stimuli and normal eye movement behaviors. In addition, we introduce serial attention module (SEA) to emphasis the most abnormal eye movement behaviors to reduce personal bias for a more robust result. According to our experiments, the DISCN achieves consistent validity in classifying the eye movements between the AD patients and normal controls.
早诊断阿尔茨海默病(AD)对后续医疗治疗非常重要,而特殊视觉刺激下的眼动可能成为非侵入性检测AD患者认知异常的潜在指标。在本文中,我们提出了一个深度引导的显著性比较网络(DISCN)用于眼动分析,该网络可用于诊断AD。在DISCN中,显著性关注模块将正常眼动与视觉刺激的RGB和深度地图融合在一起,使用分层显著性注意(SAA)评估全面显著性图,包含来自视觉刺激和正常眼动行为的更多信息。此外,我们还引入了序列注意模块(SEA)来强调AD患者中最异常的眼动行为,以减少个人偏见,获得更稳健的结果。根据我们的实验结果,DISCN在将AD患者与正常控制者的眼动之间进行分类时具有一致的准确性。
https://arxiv.org/abs/2403.10124
In the wake of the global spread of monkeypox, accurate disease recognition has become crucial. This study introduces an improved SE-InceptionV3 model, embedding the SENet module and incorporating L2 regularization into the InceptionV3 framework to enhance monkeypox disease detection. Utilizing the Kaggle monkeypox dataset, which includes images of monkeypox and similar skin conditions, our model demonstrates a noteworthy accuracy of 96.71% on the test set, outperforming conventional methods and deep learning models. The SENet modules channel attention mechanism significantly elevates feature representation, while L2 regularization ensures robust generalization. Extensive experiments validate the models superiority in precision, recall, and F1 score, highlighting its effectiveness in differentiating monkeypox lesions in diverse and complex cases. The study not only provides insights into the application of advanced CNN architectures in medical diagnostics but also opens avenues for further research in model optimization and hyperparameter tuning for enhanced disease recognition. this https URL
在全球猴痘传播的背景下,准确的疾病识别变得至关重要。这项研究引入了一个改进的SE-InceptionV3模型,包括嵌入SENet模块和将L2正则化集成到InceptionV3框架中,以提高猴痘疾病检测的精度。利用Kaggle猴痘数据集,该数据集包括猴痘和其他皮肤病的图像,我们的模型在测试集上的准确率为96.71%,超越了传统方法和深度学习模型。SENet模块的通道注意力机制显著提高了特征表示,而L2正则化确保了鲁棒的泛化能力。大量的实验证实了该模型的精确度、召回率和F1分数的优越性,突出了其在不同复杂情况下区分猴痘病变的有效性。这项研究不仅为医疗诊断中高级CNN架构的应用提供了见解,而且为进一步研究模型优化和超参数调整以提高疾病识别打开了道路。您可以通过以下链接查看该研究:https://www.kaggle.com/intel-health/monkeypox-detection
https://arxiv.org/abs/2403.10087