Current 3D reconstruction techniques struggle to infer unbounded scenes from a few images faithfully. Specifically, existing methods have high computational demands, require detailed pose information, and cannot reconstruct occluded regions reliably. We introduce 6Img-to-3D, an efficient, scalable transformer-based encoder-renderer method for single-shot image to 3D reconstruction. Our method outputs a 3D-consistent parameterized triplane from only six outward-facing input images for large-scale, unbounded outdoor driving scenarios. We take a step towards resolving existing shortcomings by combining contracted custom cross- and self-attention mechanisms for triplane parameterization, differentiable volume rendering, scene contraction, and image feature projection. We showcase that six surround-view vehicle images from a single timestamp without global pose information are enough to reconstruct 360$^{\circ}$ scenes during inference time, taking 395 ms. Our method allows, for example, rendering third-person images and birds-eye views. Our code is available at this https URL, and more examples can be found at our website here this https URL.
目前的三维重建技术很难从几张图像中忠实推断无限制的场景。具体来说,现有的方法具有高的计算需求,需要详细的姿态信息,并且无法可靠地重构遮挡区域。我们引入了6Img-to-3D,一种高效、可扩展的基于Transformer的单击图像到3D重建方法。我们的方法输出从仅六个外向 facing输入图像中得到的大规模无限制 outdoor driving 场景中的 3D 一致参数化三平面。我们通过结合收缩的自定义 cross- 和自注意机制来解决现有不足,实现不同纹理渲染、场景收缩和图像特征投影。我们证明了,在推理过程中仅使用一个时间戳的6个环绕视图车辆图像足以重构360$^{\circ}$的场景,需要395毫秒。我们的方法允许,例如,渲染三个人物图像和鸟瞰视图。我们的代码可以从此链接获得,更多例子可以在我们的网站 https://this.url 找到。
https://arxiv.org/abs/2404.12378
A substantial body of research has focused on developing systems that assist medical professionals during labor-intensive early screening processes, many based on convolutional deep-learning architectures. Recently, multiple studies explored the application of so-called self-attention mechanisms in the vision domain. These studies often report empirical improvements over fully convolutional approaches on various datasets and tasks. To evaluate this trend for medical imaging, we extend two widely adopted convolutional architectures with different self-attention variants on two different medical datasets. With this, we aim to specifically evaluate the possible advantages of additional self-attention. We compare our models with similarly sized convolutional and attention-based baselines and evaluate performance gains statistically. Additionally, we investigate how including such layers changes the features learned by these models during the training. Following a hyperparameter search, and contrary to our expectations, we observe no significant improvement in balanced accuracy over fully convolutional models. We also find that important features, such as dermoscopic structures in skin lesion images, are still not learned by employing self-attention. Finally, analyzing local explanations, we confirm biased feature usage. We conclude that merely incorporating attention is insufficient to surpass the performance of existing fully convolutional methods.
大量的研究集中在开发医疗专业人员劳动密集型早期筛查系统上,许多基于卷积深度学习架构。最近,多个研究探讨了在视觉领域应用所谓的自注意力机制。这些研究通常报告在各种数据集和任务上的实验改进。为了评估这一趋势在医学影像方面的效果,我们在两个不同的医疗数据集上扩展了两种广泛采用的卷积架构,采用不同的自注意力变体。这样,我们旨在特别评估附加自注意力的可能优势。我们比较了我们的模型与同样大小的卷积和关注基础的基线模型,并统计性能增长。此外,我们还研究了包括这些层的模型在训练过程中学习的特征的变化。在超参数搜索之后,与我们的预期相反,我们观察到平衡精度没有显著提高。我们也发现,使用自注意力机制学习的重要特征,如皮肤病变图像中的病理结构,仍然没有得到学习。最后,通过分析局部解释,我们证实了这种自注意力模型的偏见特征使用方式。我们得出结论,仅通过引入注意力和其不足之处,无法超越现有完全卷积方法的性能。
https://arxiv.org/abs/2404.12295
Physics-integrated generative modeling is a class of hybrid or grey-box modeling in which we augment the the data-driven model with the physics knowledge governing the data distribution. The use of physics knowledge allows the generative model to produce output in a controlled way, so that the output, by construction, complies with the physical laws. It imparts improved generalization ability to extrapolate beyond the training distribution as well as improved interpretability because the model is partly grounded in firm domain knowledge. In this work, we aim to improve the fidelity of reconstruction and robustness to noise in the physics integrated generative model. To this end, we use variational-autoencoder as a generative model. To improve the reconstruction results of the decoder, we propose to learn the latent posterior distribution of both the physics as well as the trainable data-driven components using planar normalizng flow. Normalizng flow based posterior distribution harnesses the inherent dynamical structure of the data distribution, hence the learned model gets closer to the true underlying data distribution. To improve the robustness of generative model against noise injected in the model, we propose a modification in the encoder part of the normalizing flow based VAE. We designed the encoder to incorporate scaled dot product attention based contextual information in the noisy latent vector which will mitigate the adverse effect of noise in the latent vector and make the model more robust. We empirically evaluated our models on human locomotion dataset [33] and the results validate the efficacy of our proposed models in terms of improvement in reconstruction quality as well as robustness against noise injected in the model.
物理集成生成建模是一种混合或灰色盒模型,其中我们通过添加指导数据分布的物理知识来增强数据驱动模型。利用物理知识可以使生成模型以可控的方式产生输出,从而使输出本质上符合物理定律。它赋予了扩展训练分布以外的高置信度能力,并提高了可解释性,因为模型的部分基础是固有领域知识。在这项工作中,我们旨在提高物理集成生成模型的重建精度和对噪声的鲁棒性。为此,我们使用变分自编码器作为生成模型。为了提高解码器的重建结果,我们提出了一种使用平滑正态分布来学习物理和可训练数据驱动组件的后验分布的计划。平滑正态分布的后验分布利用了数据分布的固有动态结构,因此所学习到的模型更接近于真实的数据分布。为了提高生成模型对模型内噪声的鲁棒性,我们在平滑正态分布的编码器部分进行了修改,基于上下文信息进行缩放点积注意。这将减轻噪声在 latent 向量上的不利影响,使模型更加鲁棒。我们对人类运动数据集 [33] 进行了实证评估,结果证实了我们在建模方面的提议,即提高重建质量和模型对噪声的鲁棒性。
https://arxiv.org/abs/2404.12267
Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs "women, power, female," concept induction produces high-level concepts such as "Criticism of traditional gender roles" and "Dismissal of women's concerns." We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM's concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.
数据分析师一直试图将无结构文本数据转化为有意义的概念。尽管常见,主题建模和聚类关注较低级别的关键词,需要进行大量解释性工作。我们引入了概念归纳,一种计算过程,它从无结构文本中产生高层次的概念,定义了明确的包括标准。对于一个包含有毒在线评论的 dataset,其中最先进的 BERTopic 模型输出“女性、权力、女性”,概念归纳产生了类似于“对传统性别角色批评”和“对女性关注的不屑”的高层次概念。我们介绍了 LLooM,一种利用大型语言模型迭代生成抽样文本并提出具有普遍性的人解释性概念的概念。然后将 LLooM 实例化到一个混合文本分析工具中,使分析员可以将注意力从解释主题转向进行理论驱动的分析。通过技术评估和四个分析场景(文献综述到内容审查),我们发现,LLooM 的概念在主题模型的先前艺术品质和数据覆盖方面有所提高。在专家案例研究中,LLooM 甚至帮助研究人员从熟悉的數據中发现新的见解,例如通过建议政治社交媒體數據中 previously unnoticed 的攻击姿态的概念。
https://arxiv.org/abs/2404.12259
The study of human emotions, traditionally a cornerstone in fields like psychology and neuroscience, has been profoundly impacted by the advent of artificial intelligence (AI). Multiple channels, such as speech (voice) and facial expressions (image), are crucial in understanding human emotions. However, AI's journey in multimodal emotion recognition (MER) is marked by substantial technical challenges. One significant hurdle is how AI models manage the absence of a particular modality - a frequent occurrence in real-world situations. This study's central focus is assessing the performance and resilience of two strategies when confronted with the lack of one modality: a novel multimodal dynamic modality and view selection and a cross-attention mechanism. Results on the RECOLA dataset show that dynamic selection-based methods are a promising approach for MER. In the missing modalities scenarios, all dynamic selection-based methods outperformed the baseline. The study concludes by emphasizing the intricate interplay between audio and video modalities in emotion prediction, showcasing the adaptability of dynamic selection methods in handling missing modalities.
人类情感的研究,传统上是一个心理学和神经科学领域的基石,受到了人工智能(AI)的深刻影响。多种渠道,如语音(声音)和面部表情(图像),对于理解人类情感至关重要。然而,AI在多模态情感识别(MER)方面的旅程充满了技术挑战。一个重要的挑战是AI模型如何处理特定模态的缺失 - 在现实情况中这是一种常见的情况。本研究的核心是对两种策略在遇到一种缺失模态时的表现和恢复力的评估:一种新颖的多模态动态模态和视图选择,以及跨注意机制。RECOLA数据集上的结果表明,基于动态选择的策略对于MER来说是一个有前景的方法。在缺失模态场景中,所有基于动态选择的策略都超过了基线。本研究结论强调了音频和视频模态在情感预测中的复杂相互作用,展示了动态选择方法在处理缺失模态的适应性。
https://arxiv.org/abs/2404.12251
Understanding how attention varies across individuals has significant scientific and societal impacts. However, existing visual scanpath models treat attention uniformly, neglecting individual differences. To bridge this gap, this paper focuses on individualized scanpath prediction (ISP), a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits, (2) an observer-centric feature integration approach that holistically combines visual features, task guidance, and observer-specific characteristics, and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets, model architectures, and visual tasks, offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability.
理解个体间注意力的变化对科学和社会具有重要影响。然而,现有的视觉扫描路径模型忽略了个体差异,将注意力统一处理。为了弥合这一空白,本文重点关注个性化扫描路径预测(ISP)这一新的关注建模任务,旨在准确预测不同个体在多样视觉任务中如何转移注意力。它提出了一个ISP方法,包括三个新颖的技术组件: (1)观察者编码器,用于描述和整合观察者的独特注意力特征; (2)以观察者为中心的特征整合方法,将视觉特征、任务指导以及观察者特定特征全面结合; (3)自适应聚焦优先级机制,根据个体观察者的注意力特征动态优先化扫描路径预测。 这些新颖组件使扫描路径模型能够有效解决不同观察者之间的注意力变化。我们的方法通常适用于各种数据集、模型架构和视觉任务,为将通用扫描路径模型转化为个性化的模型提供了全面工具。使用价值为基础和排名为基础的指标进行全面评估证实了该方法的有效性和可扩展性。
https://arxiv.org/abs/2404.12235
Generalizing to longer sentences is important for recent Transformer-based language models. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence language modeling, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible
翻译:将句子扩展到较长的句子对近年来基于Transformer的语言模型非常重要。除了操纵显式位置特征的算法之外,没有位置编码的Transformer的成功提供了克服挑战的新方法。在本文中,我们研究了NoPE的长度泛化特性。我们发现,尽管NoPE可以扩展到比通常使用的显式位置编码更长序列,但它仍然有有限的上下文长度。我们发现了NoPE泛化失败和注意力分布分心之间的关系。我们提出了一个参数高效的调整来搜索注意力的最佳温度超参数,这大大扩展了NoPE的上下文大小。对于长序列语言建模、伪序列检索任务和现实世界长序列任务,实验表明NoPE可以与最先进的上下文泛化算法竞争。源代码是公开可用的。
https://arxiv.org/abs/2404.12224
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。
https://arxiv.org/abs/2404.12210
Lexicon-based retrieval has gained siginificant popularity in text retrieval due to its efficient and robust performance. To further enhance performance of lexicon-based retrieval, researchers have been diligently incorporating state-of-the-art methodologies like Neural retrieval and text-level contrastive learning approaches. Nonetheless, despite the promising outcomes, current lexicon-based retrieval methods have received limited attention in exploring the potential benefits of feature context representations and term-level knowledge guidance. In this paper, we introduce an innovative method by introducing FEature Context and TErm-level Knowledge modules(FecTek). To effectively enrich the feature context representations of term weight, the Feature Context Module (FCM) is introduced, which leverages the power of BERT's representation to determine dynamic weights for each element in the embedding. Additionally, we develop a term-level knowledge guidance module (TKGM) for effectively utilizing term-level knowledge to intelligently guide the modeling process of term weight. Evaluation of the proposed method on MS Marco benchmark demonstrates its superiority over the previous state-of-the-art approaches.
基于词汇的检索在文本检索中取得了显著的流行, due其高效且鲁棒的性能。为了进一步提高基于词汇的检索的性能,研究人员一直在努力将最先进的方法如神经检索和文本级对比学习方法融入其中。然而,尽管取得了 promising 的结果,现有的基于词汇的检索方法在探索特征上下文表示和词级知识指导的潜在好处方面也受到了限制。在本文中,我们提出了一种创新的方法,即引入了FEature Context和TErm-level Knowledge模块(FecTek)。为了有效地丰富词级权重特征上下文的表示,引入了Feature Context模块(FCM),它利用了BERT表示的力量来确定每个元素嵌入的动态权重。此外,我们还开发了一个词级知识指导模块(TKGM),用于有效地利用词级知识指导模型的训练过程。在MS Marco基准上对所提出的方法进行评估,证明了其优越性超过以前的最先进方法。
https://arxiv.org/abs/2404.12152
Change detection (CD) from remote sensing (RS) images using deep learning has been widely investigated in the literature. It is typically regarded as a pixel-wise labeling task that aims to classify each pixel as changed or unchanged. Although per-pixel classification networks in encoder-decoder structures have shown dominance, they still suffer from imprecise boundaries and incomplete object delineation at various scenes. For high-resolution RS images, partly or totally changed objects are more worthy of attention rather than a single pixel. Therefore, we revisit the CD task from the mask prediction and classification perspective and propose MaskCD to detect changed areas by adaptively generating categorized masks from input image pairs. Specifically, it utilizes a cross-level change representation perceiver (CLCRP) to learn multiscale change-aware representations and capture spatiotemporal relations from encoded features by exploiting deformable multihead self-attention (DeformMHSA). Subsequently, a masked-attention-based detection transformers (MA-DETR) decoder is developed to accurately locate and identify changed objects based on masked attention and self-attention mechanisms. It reconstructs the desired changed objects by decoding the pixel-wise representations into learnable mask proposals and making final predictions from these candidates. Experimental results on five benchmark datasets demonstrate the proposed approach outperforms other state-of-the-art models. Codes and pretrained models are available online (this https URL).
利用深度学习从遥感图像中进行Change Detection(CD)的研究已经广泛展开。通常,它被视为一个像素级的标注任务,旨在将每个像素分类为发生改变或未发生改变。尽管在编码器-解码器结构中的每个像素分类网络已经表现出优势,但在各种场景中,它们仍然存在不精确的边界和对象不完整的外部边界。对于高分辨率的反射图像,部分或完全发生变化的对象更值得关注,而不是单个像素。因此,我们从掩膜预测和分类的角度重新审视了CD任务,并提出了MaskCD来通过自适应生成分类掩码来检测发生变化的部分。具体来说,它利用跨级变化表示器(CLCRP)来学习多尺度变化感知的表示,并利用变形多头自注意力(DeformMHSA)从编码特征中捕获语义关系。然后,开发了一个掩码注意力和自注意力的检测变压器(MA-DETR)解码器,用于准确地定位和识别发生变化的对象,基于掩码注意力和自注意机制。它通过将像素级表示解码为可学习掩码建议并做出最后预测来重构所需的变化对象。在五个基准数据集上的实验结果表明,与最先进的模型相比,所提出的方法表现出色。代码和预训练模型可在线获取(此https://)
https://arxiv.org/abs/2404.12081
Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained visual categorization~(FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance.
数据无感知知识蒸馏(DFKD)是一种解决与模型压缩、隐私和安全相关问题的有前途的方法,尤其是在涉及对类似类别的细粒度分类任务的实际应用中。虽然利用DFKD的现有方法已经取得了鼓舞人心的成就,但在实际应用中涉及细粒度分类任务时,得到的结果往往是不最优的。为了解决这个问题,我们提出了一个名为DFKD-FGVC的方法,将其扩展到细粒度视觉分类(FGVC)任务中。我们的方法利用注意力生成器、混合高阶注意力蒸馏和语义特征对比学习。具体来说,我们在生成器中引入了一个空间级的注意力机制,以合成具有更多细节的判别部分的精细图像。我们还利用混合高阶注意力机制来捕捉部分之间的复杂互动以及细粒度类别的判别特征之间的微妙差异,关注局部特征和语义上下文关系。此外,我们还利用蒸馏框架的教师和学生模型来对比超空间中高级语义特征映射的差异,比较不同类别的差异。我们在三个广泛使用的FGVC基准(飞机、汽车196和CUB200)上评估我们的方法,并证明了其优越性能。
https://arxiv.org/abs/2404.12037
Large language models (LLMs) have recently shown remarkable performance across a wide range of tasks. However, the substantial number of parameters in LLMs contributes to significant latency during model inference. This is particularly evident when utilizing autoregressive decoding methods, which generate one token in a single forward process, thereby not fully capitalizing on the parallel computing capabilities of GPUs. In this paper, we propose a novel parallel decoding approach, namely \textit{hidden transfer}, which decodes multiple successive tokens simultaneously in a single forward pass. The idea is to transfer the intermediate hidden states of the previous context to the \textit{pseudo} hidden states of the future tokens to be generated, and then the pseudo hidden states will pass the following transformer layers thereby assimilating more semantic information and achieving superior predictive accuracy of the future tokens. Besides, we use the novel tree attention mechanism to simultaneously generate and verify multiple candidates of output sequences, which ensure the lossless generation and further improves the generation efficiency of our method. Experiments demonstrate the effectiveness of our method. We conduct a lot of analytic experiments to prove our motivation. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
大语言模型(LLMs)在最近展示了令人印象深刻的跨任务性能。然而,LLMs中大量的参数导致在模型推理过程中延迟显著。特别是在使用自回归解码方法时,该方法在GPU上的并行计算能力没有得到充分利用。在本文中,我们提出了一个新颖的并行解码方法\textit{hidden transfer},它在一个前向过程中同时解码多个连续的标记。该方法的想法是将前一个上下文的中间隐藏状态传递给将要生成的下一个标记的伪隐藏状态,然后伪隐藏状态通过下一个Transformer层,从而吸收更多的语义信息,实现对未来标记的预测准确性。此外,我们还使用新颖的树注意力机制同时生成和验证多个输出序列候选者,这确保了无损生成,并进一步提高了我们方法的生产效率。实验证明了我们方法的有效性。我们对我们的动机进行了深入的分析和实验。在加速指标方面,我们超过了所有单模型加速技术,包括Medusa和自回归解码。
https://arxiv.org/abs/2404.12022
Multi-modal relation extraction (MMRE) is a challenging task that aims to identify relations between entities in text leveraging image information. Existing methods are limited by their neglect of the multiple entity pairs in one sentence sharing very similar contextual information (ie, the same text and image), resulting in increased difficulty in the MMRE task. To address this limitation, we propose the Variational Multi-Modal Hypergraph Attention Network (VM-HAN) for multi-modal relation extraction. Specifically, we first construct a multi-modal hypergraph for each sentence with the corresponding image, to establish different high-order intra-/inter-modal correlations for different entity pairs in each sentence. We further design the Variational Hypergraph Attention Networks (V-HAN) to obtain representational diversity among different entity pairs using Gaussian distribution and learn a better hypergraph structure via variational attention. VM-HAN achieves state-of-the-art performance on the multi-modal relation extraction task, outperforming existing methods in terms of accuracy and efficiency.
多模态关系提取(MMRE)是一个具有挑战性的任务,旨在利用图像信息识别文本中实体之间的关系。现有方法的一个局限是它们忽略了共享非常相似上下文信息的多个实体对,导致在MMRE任务中难度增加。为了应对这个局限,我们提出了用于多模态关系提取的变分多模态超图注意力网络(VM-HAN)。具体来说,我们首先为每句话构建一个带有相应图像的多模态超图,以建立不同实体对之间的高层次内部/间相互作用关系。我们进一步设计变分超图注意力网络(V-HAN)来通过高斯分布获得表示多样性,并通过变分注意来学习更好的超图结构。VM-HAN在多模态关系提取任务上实现了最先进的性能,在准确性和效率方面均优于现有方法。
https://arxiv.org/abs/2404.12006
Semantic scene completion, also known as semantic occupancy prediction, can provide dense geometric and semantic information for autonomous vehicles, which attracts the increasing attention of both academia and industry. Unfortunately, existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention, the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels, which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore, the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper, we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then, the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides, self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: this https URL.
语义场景完成(也称为语义占用预测),可以为自动驾驶车辆提供丰富的几何和语义信息,这吸引了学术界和产业界越来越多的关注。然而,现有的方法通常将此任务表示为体素级的分类问题,并且在训练过程中对每个体素同等对待3D空间。由于对难于学习的难于体素没有给予足够的关注,因此在一些具有挑战性的区域,性能有限。通常,3D密集空间包含大量空体素,这些体素容易学习,但由于处理所有体素的方式相同,需要大量的计算。此外,边界区域内的体素比内部更难区分。在本文中,我们提出了带有硬化设计的语义场景完成模型来训练具有硬化设计的模型。对网络优化过程的全局硬度定义为动态选择难以学习的体素。然后,采用局部硬度与几何变形来对体素进行逐个细化。此外,还引入了自蒸馏策略来使训练过程稳定和一致。大量实验证明,我们的HASSC方案可以在不产生额外推理成本的情况下有效提高基线模型的准确性。代码可在此处下载:https://这个链接。
https://arxiv.org/abs/2404.11958
Existing neural radiance fields (NeRF)-based novel view synthesis methods for large-scale outdoor scenes are mainly built on a single altitude. Moreover, they often require a priori camera shooting height and scene scope, leading to inefficient and impractical applications when camera altitude changes. In this work, we propose an end-to-end framework, termed AG-NeRF, and seek to reduce the training cost of building good reconstructions by synthesizing free-viewpoint images based on varying altitudes of scenes. Specifically, to tackle the detail variation problem from low altitude (drone-level) to high altitude (satellite-level), a source image selection method and an attention-based feature fusion approach are developed to extract and fuse the most relevant features of target view from multi-height images for high-fidelity rendering. Extensive experiments demonstrate that AG-NeRF achieves SOTA performance on 56 Leonard and Transamerica benchmarks and only requires a half hour of training time to reach the competitive PSNR as compared to the latest BungeeNeRF.
现有的基于神经辐射场(NeRF)的大规模户外场景的新型视图合成方法主要基于单个高度。此外,它们通常需要先验相机拍摄高度和场景范围,导致当相机高度发生变化时,应用变得低效和不实际。在本文中,我们提出了一个端到端的框架,称为AG-NeRF,旨在通过根据场景不同高度合成自由视点图像来降低构建良好重构的成本。具体来说,为了解决从低高度(无人机水平)到高高度(卫星水平)的详细变化问题,我们开发了源图像选择方法和一种基于注意力的特征融合方法,以提取和融合目标视图的高保真度渲染中最相关的特征。大量实验证明,AG-NeRF在56个Leonard和Transamerica基准测试中的性能达到最佳,与最新的BungeeNeRF相比,训练时间仅为一半小时。
https://arxiv.org/abs/2404.11897
Precise image editing with text-to-image models has attracted increasing interest due to their remarkable generative capabilities and user-friendly nature. However, such attempts face the pivotal challenge of misalignment between the intended precise editing target regions and the broader area impacted by the guidance in practice. Despite excellent methods leveraging attention mechanisms that have been developed to refine the editing guidance, these approaches necessitate modifications through complex network architecture and are limited to specific editing tasks. In this work, we re-examine the diffusion process and misalignment problem from a frequency perspective, revealing that, due to the power law of natural images and the decaying noise schedule, the denoising network primarily recovers low-frequency image components during the earlier timesteps and thus brings excessive low-frequency signals for editing. Leveraging this insight, we introduce a novel fine-tuning free approach that employs progressive $\textbf{Fre}$qu$\textbf{e}$ncy truncation to refine the guidance of $\textbf{Diff}$usion models for universal editing tasks ($\textbf{FreeDiff}$). Our method achieves comparable results with state-of-the-art methods across a variety of editing tasks and on a diverse set of images, highlighting its potential as a versatile tool in image editing applications.
通过利用文本到图像模型的生成能力以及用户友好的特点,精确图像编辑引起了越来越多的关注。然而,这些尝试面临着关键挑战:预期精确编辑目标区域与实际指导区域之间存在的不一致性。尽管已经开发出了一些利用注意机制来优化编辑指导的方法,但这种方法需要通过复杂的网络架构进行修改,并且仅限于特定的编辑任务。在这项工作中,我们从频率角度重新审视了扩散过程和偏差问题,发现由于自然图像的功率律和衰减噪声时间表,去噪网络主要在较早的时间步恢复低频图像成分,从而为编辑带来过量的低频信号。利用这一发现,我们引入了一种新型的免费编辑方法,该方法采用渐进式频率截断来优化扩散模型的指导,以实现通用编辑任务(免费扩散)。我们的方法在各种编辑任务中与最先进的方法达到相当的结果,在多样性的图像上表现出色,这表明在图像编辑应用中具有很大的潜力。
https://arxiv.org/abs/2404.11895
X-ray images play a vital role in the intraoperative processes due to their high resolution and fast imaging speed and greatly promote the subsequent segmentation, registration and reconstruction. However, over-dosed X-rays superimpose potential risks to human health to some extent. Data-driven algorithms from volume scans to X-ray images are restricted by the scarcity of paired X-ray and volume data. Existing methods are mainly realized by modelling the whole X-ray imaging procedure. In this study, we propose a learning-based approach termed CT2X-GAN to synthesize the X-ray images in an end-to-end manner using the content and style disentanglement from three different image domains. Our method decouples the anatomical structure information from CT scans and style information from unpaired real X-ray images/ digital reconstructed radiography (DRR) images via a series of decoupling encoders. Additionally, we introduce a novel consistency regularization term to improve the stylistic resemblance between synthesized X-ray images and real X-ray images. Meanwhile, we also impose a supervised process by computing the similarity of computed real DRR and synthesized DRR images. We further develop a pose attention module to fully strengthen the comprehensive information in the decoupled content code from CT scans, facilitating high-quality multi-view image synthesis in the lower 2D space. Extensive experiments were conducted on the publicly available CTSpine1K dataset and achieved 97.8350, 0.0842 and 3.0938 in terms of FID, KID and defined user-scored X-ray similarity, respectively. In comparison with 3D-aware methods ($\pi$-GAN, EG3D), CT2X-GAN is superior in improving the synthesis quality and realistic to the real X-ray images.
由于其高分辨率和高成像速度,X 射线图像在术中进程中有很高的价值。然而,过度曝光的 X 射线会带来一定的对人类健康的潜在风险。数据驱动的算法从体积扫描到 X 射线图像都受到稀疏的成对 X 射线和体积数据不足的限制。现有方法主要是通过建模整个 X 射线成像过程来实现。在这项研究中,我们提出了一种基于学习的称为 CT2X-GAN 的方法,用于端到端地合成三个不同图像域中的 X 射线图像,通过一系列解耦编码器实现解剖结构信息和风格信息之间的解耦。此外,我们还引入了一个新的一致性正则化项以提高合成 X 射线图像和真实 X 射线图像之间的风格相似度。同时,我们通过计算计算得到的真实 DRR 和合成 DRR 图像的相似度来实现监督过程。我们进一步开发了一个姿态注意模块,以增强从 CT 扫描中解耦得到的内容代码的全面信息,从而在较低的 2D 空间中实现高质量的多视角图像合成。我们对公开可用的 CTSpine1K 数据集进行了广泛的实验,分别实现了 97.8350、0.0842 和 3.0938 的 FID、KID 和用户评分的 X 射线相似度。与 3D 感知方法(π-GAN、EG3D)相比,CT2X-GAN 在提高合成质量和真实性方面具有优势。
https://arxiv.org/abs/2404.11889
Multi-head self-attention (MSA) is a key component of Vision Transformers (ViTs), which have achieved great success in various vision tasks. However, their high computational cost and memory footprint hinder their deployment on resource-constrained devices. Conventional pruning approaches can only compress and accelerate the MSA module using head pruning, although the head is not an atomic unit. To address this issue, we propose a novel graph-aware neuron-level pruning method, Structured Neuron-level Pruning (SNP). SNP prunes neurons with less informative attention scores and eliminates redundancy among heads. Specifically, it prunes graphically connected query and key layers having the least informative attention scores while preserving the overall attention scores. Value layers, which can be pruned independently, are pruned to eliminate inter-head redundancy. Our proposed method effectively compresses and accelerates Transformer-based models for both edge devices and server processors. For instance, the DeiT-Small with SNP runs 3.1$\times$ faster than the original model and achieves performance that is 21.94\% faster and 1.12\% higher than the DeiT-Tiny. Additionally, SNP combine successfully with conventional head or block pruning approaches. SNP with head pruning could compress the DeiT-Base by 80\% of the parameters and computational costs and achieve 3.85$\times$ faster inference speed on RTX3090 and 4.93$\times$ on Jetson Nano.
多头自注意(MSA)是 Vision Transformers(ViTs)的关键组件,已经在各种视觉任务中取得了巨大的成功。然而,它们的高计算成本和内存开销阻碍了它们在资源受限设备上的部署。传统的剪枝方法仅通过头剪枝来压缩和加速 MSA 模块,尽管头不是原子的单位。为解决这个问题,我们提出了一种新的基于图的神经元级剪枝方法,结构化神经元级剪枝(SNP)。SNP 通过剪除具有较低信息性注意分数的神经元来截断图形连接的查询和键层,同时保留整体注意分数。价值层(可以独立剪除)被剪除以消除头之间的冗余。我们提出的方法既有效地压缩了基于 Transformer 的模型,又加速了边缘设备和服务器处理器的运行速度。例如,使用 SNP 的 DeiT-Small 比原始模型快 3.1 倍,实现了性能比原模型快 21.94% 和更高的速度。此外,SNP 成功地与传统的头或块剪枝方法结合使用。使用头剪枝的 SNP 可以将 DeiT-Base 的参数压缩 80%,并在 RTX3090 上实现 3.85 倍的推理速度,在 Jetson Nano 上实现 4.93 倍的推理速度。
https://arxiv.org/abs/2404.11630
Recently, in the super-resolution (SR) domain, transformers have outperformed CNNs with fewer FLOPs and fewer parameters since they can deal with long-range dependency and adaptively adjust weights based on instance. In this paper, we demonstrate that CNNs, although less focused on in the current SR domain, surpass Transformers in direct efficiency measures. By incorporating the advantages of Transformers into CNNs, we aim to achieve both computational efficiency and enhanced performance. However, using a large kernel in the SR domain, which mainly processes large images, incurs a large computational overhead. To overcome this, we propose novel approaches to employing the large kernel, which can reduce latency by 86\% compared to the naive large kernel, and leverage an Element-wise Attention module to imitate instance-dependent weights. As a result, we introduce Partial Large Kernel CNNs for Efficient Super-Resolution (PLKSR), which achieves state-of-the-art performance on four datasets at a scale of $\times$4, with reductions of 68.1\% in latency and 80.2\% in maximum GPU memory occupancy compared to SRFormer-light.
近年来,在超分辨率(SR)领域,Transformer 已经超越了 CNN,因为它们可以处理长距离依赖关系并根据实例自适应调整权重。在本文中,我们证明了 CNN 在当前 SR 领域虽然不如 Transformer 聚焦,但在直接效率测量方面超过了 Transformer。通过将 Transformer 的优势融入 CNN 中,我们旨在实现计算效率和增强性能的平衡。然而,在 SR 领域使用大核,主要用于处理大图像,会带来较大的计算开销。为了克服这一问题,我们提出了使用大核的新方法,与 naive 大核相比,可以降低延迟 86\%,并利用 Element-wise Attention 模块模仿实例相关的权重。因此,我们引入了 efficient Super-Resolution (PLKSR) 部分大核卷积神经网络,在 scale*4 的数据集上实现了最先进的性能,与 SRFormer-light 相比,延迟减少了 68.1\%,最大 GPU 内存占用减少了 80.2\%。
https://arxiv.org/abs/2404.11848
Medical imaging has been used for diagnosis of various conditions, making it one of the most powerful resources for effective patient care. Due to widespread availability, low cost, and low radiation, chest X-ray is one of the most sought after radiology examination for the diagnosis of various thoracic diseases. Due to advancements in medical imaging technologies and increasing patient load, current radiology workflow faces various challenges including increasing backlogs, working long hours, and increase in diagnostic errors. An automated computer-aided diagnosis system that can interpret chest X-rays to augment radiologists by providing actionable insights has potential to provide second opinion to radiologists, highlight relevant regions in the image, in turn expediting clinical workflow, reducing diagnostic errors, and improving patient care. In this study, we applied a novel architecture augmenting the DenseNet121 Convolutional Neural Network (CNN) with multi-head self-attention mechanism using transformer, namely SA-DenseNet121, that can identify multiple thoracic diseases in chest X-rays. We conducted experiments on four of the largest chest X-ray datasets, namely, ChestX-ray14, CheXpert, MIMIC-CXR-JPG, and IU-CXR. Experimental results in terms of area under the receiver operating characteristics (AUC-ROC) shows that augmenting CNN with self-attention has potential in diagnosing different thoracic diseases from chest X-rays. The proposed methodology has the potential to support the reading workflow, improve efficiency, and reduce diagnostic errors.
医学影像在用于诊断各种疾病方面发挥着重要作用,因此是实现有效患者护理的最强大的资源之一。由于普遍可用性、低成本和低辐射,胸部X光片是诊断各种胸腔疾病最渴望的放射学检查之一。由于医学影像技术的进步和患者数量的增加,当前的放射学工作流程面临着各种挑战,包括增加延迟、延长工作时间和工作诊断错误等。 我们提出了一个自动化的计算机辅助诊断系统,通过提供有行动力的见解来辅助放射科医生,从而增加其工作效率。这个系统采用Transformer架构来增强DenseNet121卷积神经网络(CNN),可以识别出胸部X光片中的多种疾病。我们对四个最大的胸部X光片数据集(ChestX-ray14、CheXpert、MIMIC-CXR-JPG和IU-CXR)进行了实验。 在接收者操作特征(AUC-ROC)方面的实验结果表明,通过自注意机制增强CNN具有在胸部X光片中诊断不同胸腔疾病的潜力。所提出的方法具有支持阅读工作流程、提高效率和减少诊断错误的可能性。
https://arxiv.org/abs/2404.11843