The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.
大型语言模型(LLMs)采用的变换器结构作为一种特殊的深度神经网络(DNNs),具有关注机制,在识别和突出输入数据中最相关方面具有优势。这种能力在解决各种通信挑战方面尤其有益,尤其是在语义通信领域,正确编码相关数据至关重要,尤其是在带宽受限的系统中。在这项工作中,我们专门使用视觉变换器来压缩和简洁地表示输入图像,以保留在整个传输过程中的语义信息。通过使用变换器固有的注意机制,我们创建了一个注意力掩码。这个掩码有效地 prioritize 了图像中关键部分的传输,确保重建阶段关注由掩码突出显示的关键对象。我们的方法显著提高了语义通信的质量,通过根据数据语义信息内容编码数据的不同部分来优化带宽利用率,从而提高整体效率。我们使用TinyImageNet数据集来评估我们提出的框架的有效性,重点关注重建质量和准确性。我们的评估结果表明,根据预定压缩率,我们的框架成功地保留了语义信息,即使只有部分编码数据被传输。
https://arxiv.org/abs/2405.01521
Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of developing efficient segmentation models to accelerate the adoption of AI in clinical practice.
计算机辅助分割方法可以帮助医疗人员提高诊断结果。虽然像UNet及其变体这样的最近进展显示出前景,但它们面临着一个关键挑战:平衡准确性和计算效率。UNet中的浅层编码器架构通常很难捕捉关键的空间特征,导致不准确和稀疏分割。为了应对这个局限,我们提出了一个新颖的移动UNet(PAM-UNet)架构。PAM-UNet中的倒置残差(IR)块有助于保持轻量级框架,而逐层的PLA(渐进式洪注意力)通过将注意力指向感兴趣区域在合成过程中进行定向,促进了精确分割。我们的方法将准确性和速度优先考虑,实现了74.65的均IoU和82.87的 dice分数,同时仅在LiTS 2017数据集上需要每秒1.32个浮点运算(FLOPs)。这些结果强调了开发高效的分割模型以加速人工智能在临床实践中的采用的重要性。
https://arxiv.org/abs/2405.01503
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at this https URL
大规模文本到图像(T2I)扩散模型基于文本提示表现出显著的生成能力。基于T2I扩散模型,文本指导图像编辑研究旨在通过改变文本提示来用户操纵生成的图像。然而,现有的图像编辑技术容易导致编辑超过预期的目标区域,主要原因是跨注意图的准确性。为了解决这个问题,我们提出了局部化注意的逆置(LocInv)技术,该技术利用分词图或边界框作为额外的局部化先验来优化扩散过程的降噪阶段中的跨注意图。通过动态更新文本输入中相应的名词词位的 tokens,我们使得跨注意图与文本提示中的正确名词和形容词词首 closely 对齐。基于这种技术,我们在特定物体上实现精细图像编辑,同时防止对其他区域的不必要修改。我们的方法LocInv,基于公开可用的Stable Diffusion,在COCO数据集的子集上进行了广泛的评估,并且无论是在数量上还是在质量上,都取得了卓越的结果。代码将发布在https://这个 URL上。
https://arxiv.org/abs/2405.01496
Is the Text to Motion model robust? Recent advancements in Text to Motion models primarily stem from more accurate predictions of specific actions. However, the text modality typically relies solely on pre-trained Contrastive Language-Image Pretraining (CLIP) models. Our research has uncovered a significant issue with the text-to-motion model: its predictions often exhibit inconsistent outputs, resulting in vastly different or even incorrect poses when presented with semantically similar or identical text inputs. In this paper, we undertake an analysis to elucidate the underlying causes of this instability, establishing a clear link between the unpredictability of model outputs and the erratic attention patterns of the text encoder module. Consequently, we introduce a formal framework aimed at addressing this issue, which we term the Stable Text-to-Motion Framework (SATO). SATO consists of three modules, each dedicated to stable attention, stable prediction, and maintaining a balance between accuracy and robustness trade-off. We present a methodology for constructing an SATO that satisfies the stability of attention and prediction. To verify the stability of the model, we introduced a new textual synonym perturbation dataset based on HumanML3D and KIT-ML. Results show that SATO is significantly more stable against synonyms and other slight perturbations while keeping its high accuracy performance.
Text to Motion(T2M)模型的稳健性如何?最近,T2M模型的进步主要源于更精确地预测特定动作。然而,文本模态通常仅依赖于预训练的对比性语言-图像预训练(CLIP)模型。我们的研究揭示了一个与T2M模型相关的重大问题:其预测经常表现出不一致的输出,导致在语义上相似或相同的文本输入上呈现出截然不同或甚至错误的姿势。在本文中,我们进行了分析,阐明了这种不稳定性背后的原因,并建立了模型输出不可预测性与文本编码器模块的异常关注模式之间的明确联系。因此,我们引入了一个旨在解决这一问题的形式框架,我们称之为稳定T2M框架(SATO)。SATO包括三个模块,每个模块都致力于稳定性注意力和稳定性预测,并在准确性和稳健性之间保持平衡。我们提出了一个构建SATO的方法,满足注意力和预测的稳定性。为了验证模型的稳定性,我们引入了基于HumanML3D和KIT-ML的新文本同义词扰动数据集。结果表明,SATO在对抗同义词和其他微小扰动时表现出显著的稳定性,同时保持其高准确率性能。
https://arxiv.org/abs/2405.01461
For recent diffusion-based generative models, maintaining consistent content across a series of generated images, especially those containing subjects and complex details, presents a significant challenge. In this paper, we propose a new way of self-attention calculation, termed Consistent Self-Attention, that significantly boosts the consistency between the generated images and augments prevalent pretrained diffusion-based text-to-image models in a zero-shot manner. To extend our method to long-range video generation, we further introduce a novel semantic space temporal motion prediction module, named Semantic Motion Predictor. It is trained to estimate the motion conditions between two provided images in the semantic spaces. This module converts the generated sequence of images into videos with smooth transitions and consistent subjects that are significantly more stable than the modules based on latent spaces only, especially in the context of long video generation. By merging these two novel components, our framework, referred to as StoryDiffusion, can describe a text-based story with consistent images or videos encompassing a rich variety of contents. The proposed StoryDiffusion encompasses pioneering explorations in visual story generation with the presentation of images and videos, which we hope could inspire more research from the aspect of architectural modifications. Our code is made publicly available at this https URL.
对于基于扩散的生成模型,特别是在包含主题和复杂细节的图像中保持一致的内容,这是一个具有挑战性的任务。在本文中,我们提出了一种新的自注意力计算方法,称为一致自注意力,它显著提高了生成图像和预训练扩散-基于文本-图像模型的一致性。为了将我们的方法扩展到长距离视频生成,我们进一步引入了一个名为语义运动预测器的 novel 语义空间时间运动预测模块。它被训练估计提供给两个给定图像之间的语义空间运动情况。这个模块将生成的图像序列转换为具有平滑过渡和一致主题的视频。与仅基于潜在空间的方法相比,尤其是在长视频生成方面,这个模块更加稳定。通过合并这两个新颖组件,我们的框架(称为StoryDiffusion)可以描述一个文本为基础的故事,包括一系列包含丰富内容的一致图像或视频。该故事性扩散框架在视觉故事生成方面具有开拓性的探索,通过展示图像和视频,我们希望激发更多从架构修改方面的研究。我们的代码现在公开可用,在https://这个URL上。
https://arxiv.org/abs/2405.01434
Large 2D vision-language models (2D-LLMs) have gained significant attention by bridging Large Language Models (LLMs) with images using a simple projector. Inspired by their success, large 3D point cloud-language models (3D-LLMs) also integrate point clouds into LLMs. However, directly aligning point clouds with LLM requires expensive training costs, typically in hundreds of GPU-hours on A100, which hinders the development of 3D-LLMs. In this paper, we introduce MiniGPT-3D, an efficient and powerful 3D-LLM that achieves multiple SOTA results while training for only 27 hours on one RTX 3090. Specifically, we propose to align 3D point clouds with LLMs using 2D priors from 2D-LLMs, which can leverage the similarity between 2D and 3D visual information. We introduce a novel four-stage training strategy for modality alignment in a cascaded way, and a mixture of query experts module to adaptively aggregate features with high efficiency. Moreover, we utilize parameter-efficient fine-tuning methods LoRA and Norm fine-tuning, resulting in only 47.8M learnable parameters, which is up to 260x fewer than existing methods. Extensive experiments show that MiniGPT-3D achieves SOTA on 3D object classification and captioning tasks, with significantly cheaper training costs. Notably, MiniGPT-3D gains an 8.12 increase on GPT-4 evaluation score for the challenging object captioning task compared to ShapeLLM-13B, while the latter costs 160 total GPU-hours on 8 A800. We are the first to explore the efficient 3D-LLM, offering new insights to the community. Code and weights are available at this https URL.
大型的2D视觉语言模型(2D-LLMs)通过简单地使用投影器将大型语言模型(LLMs)与图像相连接,已经引起了 significant 的关注。受到他们的成功启发,大型3D点云语言模型(3D-LLMs)也 将点云集成到 LLMs 中。然而,直接将点云与 LLM 对齐需要昂贵的训练成本,通常在 A100 上需要数百个 GPU-小时,这阻碍了 3D-LLMs 的开发。在本文中,我们介绍了 MiniGPT-3D,一种高效且强大的3D-LLM,在仅训练27小时的情况下实现了多个SOTA结果。具体来说,我们提出了一种使用来自2D-LLMs的2D先验来对齐3D点云与LLM的方法,可以利用2D和3D视觉信息的相似性。我们还提出了一个新颖的四阶段模态对齐训练策略,以及一个混合查询专家模块以高效地适应性地聚合特征。此外,我们还利用参数高效的微调方法 LoRA 和 Norm 微调,实现了仅47.8M可学习参数,比现有方法少260倍。 extensive实验证明,MiniGPT-3D在3D物体分类和文本摘要任务上实现了SOTA,具有显著的训练成本优势。值得注意的是,与ShapeLLM-13B相比,MiniGPT-3D在具有挑战性的物体文本摘要任务上获得了8.12的提高,而后者需要160个总共的GPU-小时,在8个A800上。我们是第一个探索高效3D-LLM,为社区提供了新的见解。代码和权重可以从该https URL获取。
https://arxiv.org/abs/2405.01413
Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.
近年来在数据蒸馏领域的研究旨在通过生成一个压缩合成数据集来最小化训练成本,该数据集包含了较大真实数据集中的信息。这些方法最终旨在实现与整个原始数据集训练出的模型具有相似的测试准确度。之前在特征匹配和分布匹配方面的研究表明,在蒸馏过程中没有产生双层优化成本,但取得了显著的成果。尽管这些方法在节省训练成本方面具有令人满意的效率,但它们在下游性能方面存在微小的改进,对上下文信息的提取有限,并且模型扩展性较差。为了应对这些挑战,我们在数据蒸馏领域提出了ATtentiOn Mixer(ATOM)模块,利用混合通道和空间注意在特征匹配过程中高效地蒸馏大型数据集。空间注意可以帮助根据类在各自图像上的一致定位来指导学习过程,实现从更广泛的感受野进行蒸馏。同时,通道注意可以捕捉类本身相关的上下文信息,从而使合成图像对训练更加有用。通过整合这两种注意,我们的ATOM模块在各种计算机视觉数据集上的表现都超过了之前的水平,包括CIFAR10/100和TinyImagenet。值得注意的是,我们的方法在图像数量较低的情况下显著提高了性能,从而增强了其潜力。此外,我们还保持了在神经架构搜索等方面的改进。
https://arxiv.org/abs/2405.01373
Action recognition has become one of the popular research topics in computer vision. There are various methods based on Convolutional Networks and self-attention mechanisms as Transformers to solve both spatial and temporal dimensions problems of action recognition tasks that achieve competitive performances. However, these methods lack a guarantee of the correctness of the action subject that the models give attention to, i.e., how to ensure an action recognition model focuses on the proper action subject to make a reasonable action prediction. In this paper, we propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos using Directed Gromov-Wasserstein Discrepancy. Furthermore, our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets. Therefore, the contributions in this work are three-fold. Firstly, we introduce the multi-view attention consistency to solve the problem of reasonable prediction in action recognition. Secondly, we define a new metric for multi-view consistent attention using Directed Gromov-Wasserstein Discrepancy. Thirdly, we built an action recognition model based on Video Transformers and Neural Radiance Fields. Compared to the recent action recognition methods, the proposed approach achieves state-of-the-art results on three large-scale datasets, i.e., Jester, Something-Something V2, and Kinetics-400.
动作识别已成为计算机视觉领域的一个热门研究课题。基于卷积神经网络和自注意力机制的Transformer方法作为一种解决动作识别任务的空间和时间维度问题的方法,实现了竞争力的性能。然而,这些方法缺乏对动作主体正确性的保证,即如何确保模型关注到正确的动作主体进行合理的动作预测。在本文中,我们提出了一种多视角注意力一致性方法,通过计算来自不同视角的动作视频的两个注意力的相似度来计算。此外,我们的方法将神经元辐射场(Neural Radiance Field)的思想应用于在单视图数据集上训练时,在训练过程中 implicitly 渲染来自新视角的特征。因此,本工作的贡献是三重的。首先,我们将多视角注意力一致性引入到动作识别问题中,解决了合理预测的问题。其次,我们定义了一个新的多视角一致性指标,基于指向Gromov-Wasserstein差异。第三,我们基于视频Transformer和神经元辐射场构建了一个动作识别模型。与最近的动作识别方法相比,所提出的方法在三个大型数据集(即Jester、Something-Something V2和Kinetics-400)上实现了最先进的性能。
https://arxiv.org/abs/2405.01337
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (InterCTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID). Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DID (ECAPA-TDNN). An optimal InterCTC setting is initially established using a Conformer encoder. This setting is then used to train a model with an E-branchformer encoder and the performance of both architectures are compared. A multi-task fine-tuning approach is adopted for language model (LM) shallow fusion. The experiments yielded an improvement in DID accuracy of 10.8% relative to a baseline ECAPA-TDNN, and WER performance approaching the TDNN-HMM model. This multi-task approach emerges as a promising strategy for Irish low-resource ASR and DID.
本文探讨了使用经过中间CTC(InterCTC)训练的混合编码器-解码器模型(Hybrid CTC/Attention encoder-decoder)在爱尔兰(盖尔语)低资源 speech recognition(ASR)和 dialect identification(DID)任务中的应用。结果与目前最佳训练的 ASR(TDNN-HMM)和 DID(ECAPA-TDNN)模型进行了比较。首先,通过使用 Conformer 编码器建立了一个最优的 InterCTC 设置。然后,使用 E-branchfinder 编码器训练了一个模型,并比较了两种架构的性能。为语言模型(LM)采用多任务微调。实验结果表明,与基线 ECAPA-TDNN相比,DID 准确度提高了 10.8%,而 WER 性能接近于 TDNN-HMM 模型。这种多任务方法在爱尔兰低资源 ASR 和 DID 任务中具有前景。
https://arxiv.org/abs/2405.01293
This paper introduces a trajectory prediction model tailored for autonomous driving, focusing on capturing complex interactions in dynamic traffic scenarios without reliance on high-definition maps. The model, termed MFTraj, harnesses historical trajectory data combined with a novel dynamic geometric graph-based behavior-aware module. At its core, an adaptive structure-aware interactive graph convolutional network captures both positional and behavioral features of road users, preserving spatial-temporal intricacies. Enhanced by a linear attention mechanism, the model achieves computational efficiency and reduced parameter overhead. Evaluations on the Argoverse, NGSIM, HighD, and MoCAD datasets underscore MFTraj's robustness and adaptability, outperforming numerous benchmarks even in data-challenged scenarios without the need for additional information such as HD maps or vectorized maps. Importantly, it maintains competitive performance even in scenarios with substantial missing data, on par with most existing state-of-the-art models. The results and methodology suggest a significant advancement in autonomous driving trajectory prediction, paving the way for safer and more efficient autonomous systems.
本文提出了一种专为自动驾驶设计的轨迹预测模型,重点关注在动态交通场景中捕捉复杂交互,而不依赖于高清晰度地图。该模型被称为MFTraj,利用了历史轨迹数据与新颖动态几何图形的结合。在其核心,自适应结构感知交互图卷积网络捕捉道路用户的定位和行为特征,保留空间-时间复杂性。通过线性注意机制的增强,该模型实现计算效率和参数开销降低。在Argoverse、NGSIM、HighD和MoCAD数据集上的评估表明,MFTraj的稳健性和适应性得到了充分证明,即使在数据稀疏的场景中,也能在大多数基准模型之上表现优异。重要的是,在具有大量缺失数据的情况下,它仍具有竞争力的性能,与大多数现有自动驾驶系统相当。结果和 methodology 表明,在自动驾驶轨迹预测方面取得了显著的进展,为更安全和高效的自动驾驶系统铺平了道路。
https://arxiv.org/abs/2405.01266
Bone segmentation is an essential step for the preoperative planning of fracture trauma surgery. The automated segmentation of fractured bone from computed tomography (CT) scans remains challenging, due to the large differences of fractures in position and morphology, and also the inherent anatomical characteristics of different bone structures. To alleviate these issues, we propose a cross-scale attention mechanism as well as a surface supervision strategy for fractured bone segmentation in CT. Specifically, a cross-scale attention mechanism is introduced to effectively aggregate the features among different scales to provide more powerful fracture representation. Moreover, a surface supervision strategy is employed, which explicitly constrains the network to pay more attention to the bone boundary. The efficacy of the proposed method is evaluated on a public dataset containing CT scans with hip fractures. The evaluation metrics are Dice similarity coefficient (DSC), average symmetric surface distance (ASSD), and Hausdorff distance (95HD). The proposed method achieves an average DSC of 93.36%, ASSD of 0.85mm, 95HD of 7.51mm. Our method offers an effective fracture segmentation approach for the pelvic CT examinations, and has the potential to be used for improving the segmentation performance of other types of fractures.
骨段分离是骨折创伤手术前规划中不可或缺的一步。从计算机断层扫描(CT)扫描中自动分割骨折骨仍然具有挑战性,因为骨折在位置和形态上存在较大差异,以及不同骨结构固有的解剖特征。为了减轻这些问题,我们提出了跨尺度关注机制和用于骨折骨段分割的表面监督策略。具体来说,引入了跨尺度关注机制,通过有效聚合不同尺度下的特征来提供更能力的骨折表示。此外,还使用了表面监督策略,该策略明确约束网络更加关注骨边界。在公共数据集上对髋部骨折进行评估时,所提出方法的有效性指标包括Dice相似系数(DSC)、平均对称表面距离(ASSD)和Hausdorff距离(95HD)。我们的方法实现了平均DSC为93.36%,ASSD为0.85mm,95HD为7.51mm。我们的方法为盆骨CT检查提供了有效的骨折段分离方法,有望成为用于提高其他类型骨折分割性能的方法。
https://arxiv.org/abs/2405.01204
Transformer-based entropy models have gained prominence in recent years due to their superior ability to capture long-range dependencies in probability distribution estimation compared to convolution-based methods. However, previous transformer-based entropy models suffer from a sluggish coding process due to pixel-wise autoregression or duplicated computation during inference. In this paper, we propose a novel transformer-based entropy model called GroupedMixer, which enjoys both faster coding speed and better compression performance than previous transformer-based methods. Specifically, our approach builds upon group-wise autoregression by first partitioning the latent variables into groups along spatial-channel dimensions, and then entropy coding the groups with the proposed transformer-based entropy model. The global causal self-attention is decomposed into more efficient group-wise interactions, implemented using inner-group and cross-group token-mixers. The inner-group token-mixer incorporates contextual elements within a group while the cross-group token-mixer interacts with previously decoded groups. Alternate arrangement of two token-mixers enables global contextual reference. To further expedite the network inference, we introduce context cache optimization to GroupedMixer, which caches attention activation values in cross-group token-mixers and avoids complex and duplicated computation. Experimental results demonstrate that the proposed GroupedMixer yields the state-of-the-art rate-distortion performance with fast compression speed.
近年来,基于Transformer的熵模型因其在概率分布估计中捕获长距离依赖的优越性能而备受关注。然而,之前的基于Transformer的熵模型由于在推理过程中进行像素级自回归或重复计算而变得拖沓。在本文中,我们提出了一种名为GroupedMixer的新颖基于Transformer的熵模型,该模型比之前的基于Transformer的方法具有更快的编码速度和更好的压缩性能。具体来说,我们的方法首先沿着空间维度将潜在变量分组,然后利用提出的基于Transformer的熵模型对分组进行熵编码。全局因果自注意力的分解采用更有效的组内和跨组token-mixer实现。组内token-mixer包含组内的上下文元素,而跨组token-mixer与之前解码的组进行交互。通过交替排列两个token-mixer,可以实现全局上下文参考。为了进一步加速网络推理,我们引入了GroupedMixer的上下文缓存优化,将跨组token-mixer中的注意力激活值缓存起来,并避免复杂和重复计算。实验结果表明,与具有快速压缩速度的顶级性能相比,GroupedMixer具有最先进的速率失真性能。
https://arxiv.org/abs/2405.01170
Graph Transformer, due to its global attention mechanism, has emerged as a new tool in dealing with graph-structured data. It is well recognized that the global attention mechanism considers a wider receptive field in a fully connected graph, leading many to believe that useful information can be extracted from all the nodes. In this paper, we challenge this belief: does the globalizing property always benefit Graph Transformers? We reveal the over-globalizing problem in Graph Transformer by presenting both empirical evidence and theoretical analysis, i.e., the current attention mechanism overly focuses on those distant nodes, while the near nodes, which actually contain most of the useful information, are relatively weakened. Then we propose a novel Bi-Level Global Graph Transformer with Collaborative Training (CoBFormer), including the inter-cluster and intra-cluster Transformers, to prevent the over-globalizing problem while keeping the ability to extract valuable information from distant nodes. Moreover, the collaborative training is proposed to improve the model's generalization ability with a theoretical guarantee. Extensive experiments on various graphs well validate the effectiveness of our proposed CoBFormer.
Graph Transformer由于其全局注意力机制,已成为处理图状数据的新工具。很明显,全局注意力机制在完全连接图中考虑了更广泛的感受野,导致许多人认为可以从所有节点中提取有用信息。在本文中,我们挑战这个信念:全局化特性始终对Graph Transformers有利吗?我们通过提供实证证据和理论分析来揭示Graph Transformer中的过度全局化问题,即当前的注意力机制过于关注那些距离较远的节点,而实际上包含大部分有用信息的地方,即距离较近的节点被削弱了。然后我们提出了一种名为CoBFormer的协同训练Graph Transformer,包括聚类内和聚类间的Transformer,以防止过度全局化问题,同时保持从距离较远的节点中提取有用信息的能力。此外,还提出了一个理论保证来提高模型的泛化能力。我们在各种图中进行了大量实验,验证了我们的CoBFormer的有效性。
https://arxiv.org/abs/2405.01102
3D Swin Transformer (3D-ST) known for its hierarchical attention and window-based processing, excels in capturing intricate spatial relationships within images. Spatial-spectral Transformer (SST), meanwhile, specializes in modeling long-range dependencies through self-attention mechanisms. Therefore, this paper introduces a novel method: an attentional fusion of these two transformers to significantly enhance the classification performance of Hyperspectral Images (HSIs). What sets this approach apart is its emphasis on the integration of attentional mechanisms from both architectures. This integration not only refines the modeling of spatial and spectral information but also contributes to achieving more precise and accurate classification results. The experimentation and evaluation of benchmark HSI datasets underscore the importance of employing disjoint training, validation, and test samples. The results demonstrate the effectiveness of the fusion approach, showcasing its superiority over traditional methods and individual transformers. Incorporating disjoint samples enhances the robustness and reliability of the proposed methodology, emphasizing its potential for advancing hyperspectral image classification.
3D Swin Transformer(3D-ST)以其层次化的注意力和基于窗口的处理而闻名,在图像中捕捉复杂的空间关系方面表现出色。同时,Spatial-spectral Transformer(SST)专门通过自注意力机制建模长距离依赖关系。因此,本文提出了一种新颖的方法:这两个变压器的注意合并,可以显著提高超分辨率图像(HSIs)的分类性能。这种方法的特点在于其强调将两种架构的注意力机制进行整合。这种整合不仅精炼了空间和光谱信息的建模,还促进了更精确和准确的分类结果的实现。基准HSI数据集的实验和评估强调了对使用离散训练、验证和测试样本的重要性。结果表明,融合方法的有效性得到了展示,其优越性超过了传统方法和单独的变压器。纳入离散样本增强了所提出方法的可行性、可靠性和潜在的提高超分辨率图像分类的能力。
https://arxiv.org/abs/2405.01095
In 2021, the pioneering work on TypeNet showed that keystroke dynamics verification could scale to hundreds of thousands of users with minimal performance degradation. Recently, the KVC-onGoing competition has provided an open and robust experimental protocol for evaluating keystroke dynamics verification systems of such scale, including considerations of algorithmic fairness. This article describes Type2Branch, the model and techniques that achieved the lowest error rates at the KVC-onGoing, in both desktop and mobile scenarios. The novelty aspects of the proposed Type2Branch include: i) synthesized timing features emphasizing user behavior deviation from the general population, ii) a dual-branch architecture combining recurrent and convolutional paths with various attention mechanisms, iii) a new loss function named Set2set that captures the global structure of the embedding space, and iv) a training curriculum of increasing difficulty. Considering five enrollment samples per subject of approximately 50 characters typed, the proposed Type2Branch achieves state-of-the-art performance with mean per-subject EERs of 0.77% and 1.03% on evaluation sets of respectively 15,000 and 5,000 subjects for desktop and mobile scenarios. With a uniform global threshold for all subjects, the EERs are 3.25% for desktop and 3.61% for mobile, outperforming previous approaches by a significant margin.
在2021年,TypeNet的先驱工作表明,键击动态验证可以扩展到数百万用户,且性能损失非常小。最近,KVC-onGoing competition提供了一个公开且鲁棒的实验协议,用于评估这类规模的键击动态验证系统,包括算法的公平性考虑。本文描述了在KVC-onGoing上取得最低错误率的模型和技巧,无论是桌面还是移动场景。所提出Type2Branch的新颖之处包括:i)合成时序特征强调用户行为与一般人群的偏差,ii)一种结合递归和卷积路径以及各种注意机制的双分支架构,iii)名为Set2set的新损失函数捕捉嵌入空间全局结构,iv)增加训练难度的训练计划。在考虑每个主题约为50个字符打字的用户的情况下,所提出的Type2Branch在桌面和移动场景的评估集上实现与最先进技术的均方误差(MSE)为0.77%和1.03%。对于桌面和移动场景,平均每个主题的评估集上的MSE分别为15,000和5,000,远优于之前的方法。 在所有主题上都使用统一的全局阈值时,桌面和移动场景上的MSE分别为3.25%和3.61%,显著地优于之前的解决方案。
https://arxiv.org/abs/2405.01088
Deep learning-based motion deblurring techniques have advanced significantly in recent years. This class of techniques, however, does not carefully examine the inherent flaws in blurry images. For instance, low edge and structural information are traits of blurry images. The high-frequency component of blurry images is edge information, and the low-frequency component is structure information. A blind motion deblurring network (MCMS) based on multi-category information and multi-scale stripe attention mechanism is proposed. Given the respective characteristics of the high-frequency and low-frequency components, a three-stage encoder-decoder model is designed. Specifically, the first stage focuses on extracting the features of the high-frequency component, the second stage concentrates on extracting the features of the low-frequency component, and the third stage integrates the extracted low-frequency component features, the extracted high-frequency component features, and the original blurred image in order to recover the final clear image. As a result, the model effectively improves motion deblurring by fusing the edge information of the high-frequency component and the structural information of the low-frequency component. In addition, a grouped feature fusion technique is developed so as to achieve richer, more three-dimensional and comprehensive utilization of various types of features at a deep level. Next, a multi-scale stripe attention mechanism (MSSA) is designed, which effectively combines the anisotropy and multi-scale information of the image, a move that significantly enhances the capability of the deep model in feature representation. Large-scale comparative studies on various datasets show that the strategy in this paper works better than the recently published measures.
近年来,基于深度学习的运动去噪技术已经取得了显著的进步。然而,这类技术并没有仔细研究模糊图像固有的缺陷。例如,低边缘和结构信息是模糊图像的特征。模糊图像的高频部分是边缘信息,低频部分是结构信息。提出了一种基于多类信息和多尺度条纹注意机制的盲运动去噪网络(MCMS)。根据高频和低频组件的特征,设计了一个三阶段编码器-解码器模型。具体来说,第一阶段关注提取高频组件的特征,第二阶段关注提取低频组件的特征,第三阶段将提取的低频组件特征、高频组件特征和原始模糊图像整合起来,以恢复最终的清晰图像。这样,通过融合高频组件边缘信息和低频组件结构信息,模型有效地提高了运动去噪能力。此外,还开发了一种聚类特征融合技术,以便在深度层面实现更丰富、更具三维性和全面性的各种特征的利用。接下来,设计了一个多尺度条纹注意机制(MSSA),该机制有效地结合了图像的各向异性和多尺度信息,这是深度模型在特征表示方面显著增强策略。各种数据集上的大规模比较研究结果表明,本文提出的策略比最近发表的措施效果更好。
https://arxiv.org/abs/2405.01083
Reconstructing a hand mesh from a single RGB image is a challenging task because hands are often occluded by objects. Most previous works attempted to introduce more additional information and adopt attention mechanisms to improve 3D reconstruction results, but it would increased computational complexity. This observation prompts us to propose a new and concise architecture while improving computational efficiency. In this work, we propose a simple and effective 3D hand mesh reconstruction network HandSSCA, which is the first to incorporate state space modeling into the field of hand pose estimation. In the network, we have designed a novel state space channel attention module that extends the effective sensory field, extracts hand features in the spatial dimension, and enhances hand regional features in the channel dimension. This design helps to reconstruct a complete and detailed hand mesh. Extensive experiments conducted on well-known datasets featuring challenging hand-object occlusions (such as FREIHAND, DEXYCB, and HO3D) demonstrate that our proposed HandSSCA achieves state-of-the-art performance while maintaining a minimal parameter count.
从单个RGB图像中重构手网格是一个具有挑战性的任务,因为手经常被物体遮挡。之前的工作尝试引入更多附加信息并采用注意机制来提高3D重建结果,但会增加计算复杂度。这个观察结果促使我们提出一种新而简洁的架构,同时提高计算效率。在本文中,我们提出了一个简单而有效的3D手网格重构网络HandSSCA,这是第一个将状态空间建模应用到手姿态估计领域的网络。在网络中,我们设计了一个新颖的状态空间通道关注模块,扩展了有效的感官场,提取了手部的空间维度,并在通道维度上增强了手部区域特征。这种设计有助于重构完整和详细的手网格。在已知具有挑战性手-物体遮挡的数据集(如FREIHAND、DEXYCB和HO3D)上进行的广泛实验证明,我们的HandSSCA网络在保持最小参数计数的同时实现了最先进的性能。
https://arxiv.org/abs/2405.01066
Change detection as an interdisciplinary discipline in the field of computer vision and remote sensing at present has been receiving extensive attention and research. Due to the rapid development of society, the geographic information captured by remote sensing satellites is changing faster and more complex, which undoubtedly poses a higher challenge and highlights the value of change detection tasks. We propose MFDS-Net: Multi-Scale Feature Depth-Supervised Network for Remote Sensing Change Detection with Global Semantic and Detail Information (MFDS-Net) with the aim of achieving a more refined description of changing buildings as well as geographic information, enhancing the localisation of changing targets and the acquisition of weak features. To achieve the research objectives, we use a modified ResNet_34 as backbone network to perform feature extraction and DO-Conv as an alternative to traditional convolution to better focus on the association between feature information and to obtain better training results. We propose the Global Semantic Enhancement Module (GSEM) to enhance the processing of high-level semantic information from a global perspective. The Differential Feature Integration Module (DFIM) is proposed to strengthen the fusion of different depth feature information, achieving learning and extraction of differential features. The entire network is trained and optimized using a deep supervision mechanism. The experimental outcomes of MFDS-Net surpass those of current mainstream change detection networks. On the LEVIR dataset, it achieved an F1 score of 91.589 and IoU of 84.483, on the WHU dataset, the scores were F1: 92.384 and IoU: 86.807, and on the GZ-CD dataset, the scores were F1: 86.377 and IoU: 76.021. The code is available at this https URL
目前,在计算机视觉和遥感领域,变化检测作为跨学科领域已经受到了广泛关注和研究。由于社会的快速发展,遥感卫星捕获的地理信息变化得更快、更复杂,无疑给变化检测任务带来了更高的挑战,同时也突出了变化检测任务的价值。我们提出了MFDS-Net:多尺度特征自监督网络用于远程 sensing变化检测全局语义和详细信息(MFDS-Net)作为目标,实现对变化建筑和地理信息的更精确描述,提高变化目标的定位,并获得弱特征的提取。为了实现研究目标,我们使用修改后的ResNet_34作为骨干网络进行特征提取,将DO-Conv作为传统卷积的替代方案,更好关注特征信息与特征信息的关联,以获得更好的训练结果。我们提出了全局语义增强模块(GSEM)用于从全局角度增强高级语义信息。差分特征融合模块(DFIM)提出了用于加强不同深度特征信息的融合,实现对差分特征的学习和提取。整个网络使用深度监督机制进行训练和优化。MFDS-Net的实验结果超越了当前主流变化检测网络。在LEVIR数据集上,其得分达到了91.589,IoU为84.483;在WHU数据集上,得分分别为F1: 92.384和IoU: 86.807;在GZ-CD数据集上,得分分别为F1: 86.377和IoU: 76.021。代码可在此https://url
https://arxiv.org/abs/2405.01065
Learning to solve vehicle routing problems (VRPs) has garnered much attention. However, most neural solvers are only structured and trained independently on a specific problem, making them less generic and practical. In this paper, we aim to develop a unified neural solver that can cope with a range of VRP variants simultaneously. Specifically, we propose a multi-task vehicle routing solver with mixture-of-experts (MVMoE), which greatly enhances the model capacity without a proportional increase in computation. We further develop a hierarchical gating mechanism for the MVMoE, delivering a good trade-off between empirical performance and computational complexity. Experimentally, our method significantly promotes the zero-shot generalization performance on 10 unseen VRP variants, and showcases decent results on the few-shot setting and real-world benchmark instances. We further provide extensive studies on the effect of MoE configurations in solving VRPs. Surprisingly, the hierarchical gating can achieve much better out-of-distribution generalization performance. The source code is available at: this https URL.
学习解决车辆路由问题(VRPs)已经引起了广泛关注。然而,大多数神经网络解决方案只能在特定问题上进行结构化和训练,使它们对其他问题不具有通用性和实用性。在本文中,我们旨在开发一个统一的神经网络解决方案,可以同时处理多种 VRP 变体。具体来说,我们提出了一个多任务车辆路由解决方案(MVMoE),极大地增强了模型能力,而不会增加计算成本。我们进一步开发了一个分层的 gate 机制,使得 MVMoE 可以实现良好的实证性能和计算复杂度的平衡。实验表明,我们的方法在未见过的 10 个 VRP 变体上显著促进了零散样本通用性能,在几见过的设置和现实世界的基准实例上的表现也相当不错。我们还对 MoE 配置对解决 VRP 的影响进行了广泛研究。令人惊讶的是,分层门控可以实现更好的离散样本通用性能。代码可在此处下载:https:// this URL。
https://arxiv.org/abs/2405.01029
Identifying layers within text-to-image models which control visual attributes can facilitate efficient model editing through closed-form updates. Recent work, leveraging causal tracing show that early Stable-Diffusion variants confine knowledge primarily to the first layer of the CLIP text-encoder, while it diffuses throughout the UNet.Extending this framework, we observe that for recent models (e.g., SD-XL, DeepFloyd), causal tracing fails in pinpointing localized knowledge, highlighting challenges in model editing. To address this issue, we introduce the concept of Mechanistic Localization in text-to-image models, where knowledge about various visual attributes (e.g., ``style", ``objects", ``facts") can be mechanistically localized to a small fraction of layers in the UNet, thus facilitating efficient model editing. We localize knowledge using our method LocoGen which measures the direct effect of intermediate layers to output generation by performing interventions in the cross-attention layers of the UNet. We then employ LocoEdit, a fast closed-form editing method across popular open-source text-to-image models (including the latest SD-XL)and explore the possibilities of neuron-level model editing. Using Mechanistic Localization, our work offers a better view of successes and failures in localization-based text-to-image model editing. Code will be available at \href{this https URL}{this https URL}.
通过识别在文本到图像模型中控制视觉属性的层,可以借助闭式公式更新实现高效模型编辑。最近的工作利用因果追踪展示了早期稳定扩散变体主要将知识限制在CLIP文本编码器的第一层,而扩散到UNet的各个层。拓展此框架,我们观察到对于最近的模型(例如SD-XL,DeepFloyd),因果追踪在精确指出局部知识方面表现不佳,揭示了模型编辑方面的挑战。为解决此问题,我们引入了在文本到图像模型中的机制化局部化概念,其中关于各种视觉属性(例如“风格”,“物体”,“事实”)的知识可以机制化地局部到UNet的少量层中,从而促进高效模型编辑。我们使用LocoGen方法来局部化知识,该方法通过在UNet的交叉注意层进行干预来衡量中间层对输出生成的直接影响。然后我们使用LocoEdit,一种流行的开放源代码文本到图像模型的快速闭式编辑方法(包括最新的SD-XL),探索神经元层面的模型编辑可能性。通过机制化局部化,我们的工作为基于局部化的文本到图像模型编辑提供了更好的视图,揭示了成功和失败。代码将 available at \href{this https URL}{this https URL}.
https://arxiv.org/abs/2405.01008