Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.
基于文本的視頻生成见证了快速進步。然而,仅使用文本提示并不能准确地描述用户意图,特别是对于定制内容创作。在本文中,我們研究了使用图像提示的視頻生成任務,這在文本提示之外提供了更準確和直接的內容控制。具體來說,我們提出了VideoBooth框架,其有两个专门的设计:1)我們提出將圖像提示以粗略到精確的方式嵌入。粗略的圖像標識符從圖像編碼器提供了對圖像提示的高级編碼,而建議的注意力注入模塊提供的圖像提示的多尺度詳細編碼。這兩種互补的標識符可以忠誠地捕捉所需的 appearance。2)在精細級別的注意力注入模塊中,多尺度圖像提示被作為額外的鍵值輸入到不同的跨場關注層中。這多余的空間信息對第一幀的細節進行精確的優化,然後传播到其余帧,保持時間一致性。大量的實驗證明,VideoBooth在生成指定圖像提示的定制高質量的視頻方面實現了最先进的性能。值得注意的是,VideoBooth是一個通用的框架,一個單一模型可以運行在進階傳遞中處理廣泛的圖像提示。
https://arxiv.org/abs/2312.00777
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
基础模型,现在大部分深度学习应用都依赖于此,几乎普遍基于Transformer架构及其核心注意力模块。为了解决Transformer在长序列上的计算效率问题,还开发了许多诸如线性注意力、门控卷积和循环模型(SSM)的子quadratic-time架构。然而,它们在重要模式(如语言)上的表现并没有达到与注意力相同的水平。我们发现,这类模型的关键不足在于它们无法进行内容基于的推理,并进行了几项改进。首先,让SSM参数成为输入函数 simple让SSM参数成为输入函数 simpleaddresses their weakness with discrete modalities,让模型根据当前词的可选性地在序列长度维度上选择传播或忘记信息。其次,尽管此变化阻止了使用高效的卷积操作,但我们设计了一个硬件意识到的并行算法。我们将这些选择性的SSM集成到一个简单的端到端神经网络架构中,该架构没有注意力或甚至MLP块(Mamba)。Mamba具有快速的推理(比Transformers快5倍)和线性扩展在序列长度上,而且在实际数据上的性能甚至比百万长度的序列还要好。作为通用的序列模型骨架,Mamba在几个模式(如语言、音频和基因组)上实现了最先进的性能。在语言建模方面,我们的Mamba-3B模型在同等大小和两倍大小的Transformer模型上均优于它们,不仅在预训练阶段,而且在下游评估中也是如此。
https://arxiv.org/abs/2312.00752
Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.
Transformer模型在自然语言处理和计算机视觉应用领域取得了显著的成功。然而,由于在模型深度增加时出现的过度平滑问题,深度Transformer模型的表示能力会降低。在本文中,我们证明了Transformer中的自注意力层通过一个促进平滑的函数最小化了一个函数,从而导致标记的均匀性。然后,我们提出了一个新颖的 regularizer,它惩罚自注意力和输入标记之间平滑输出标记的范数,以保留标记的准确性。通过最小化由此产生的 regularized energy functional,我们推导出了一种名为 NeurTransformer with Regularized Nonlocal Functional (NeuTRENO) 的神经Transformer模型,这是一种新的Transformer模型,可以减轻过度平滑问题。我们通过实验实证证明NeuTRENO相对于基线Transformer和最先进的方法在各种实际任务(包括对象分类、图像分割和自然语言建模)中减少标记表示过度平滑的优势。
https://arxiv.org/abs/2312.00751
Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at this https URL.
bird's-eye view (BeV) 表示已经成为了驾驶应用程序的事实共享空间,提供了一个统一的传感器数据融合空间,支持各种下游任务。然而,传统的模型使用固定的分辨率范围和网格,由于所有细胞资源均匀分配,导致计算效率低下。为了解决这个问题,我们提出了 PointBeV,一种新颖的稀疏 BeV 分割模型, operating on sparse BeV 细胞而不是密集网格。这种方法可以精确控制内存使用,使得可以使用长时上下文,并适应具有内存限制的平台。PointBeV 使用一种高效的两步策略进行训练,使得在感兴趣的区域进行集中计算。在推理时,它可以与各种内存/性能权衡灵活配合,并能够适应新的具体应用场景。PointBeV 在 nuScenes 数据集上取得了最先进的分数,在车辆、行人、车道分割中展示了在静态和时间设置下的卓越性能,尽管它仅使用稀疏信号进行训练。我们将发布我们的代码以及用于架构的两个新的高效模块:稀疏特征提取,用于从图像中有效提取特征到 BeV;子流注意,它允许有效的时序建模。我们的代码可在此处访问:https://www.aclweb.org/anthology/N22-11966
https://arxiv.org/abs/2312.00703
We present GIFT (Generative Interpretable Fine-tuning Transformers) for fine-tuning pretrained (often large) Transformer models at downstream tasks in a parameter-efficient way with built-in interpretability. Our GIFT is a deep parameter-residual learning method, which addresses two problems in fine-tuning a pretrained Transformer model: Where to apply the parameter-efficient fine-tuning (PEFT) to be extremely lightweight yet sufficiently expressive, and How to learn the PEFT to better exploit the knowledge of the pretrained model in a direct way? For the former, we select the final projection (linear) layer in the multi-head self-attention of a Transformer model, and verify its effectiveness. For the latter, in contrast to the prior art that directly introduce new model parameters (often in low-rank approximation form) to be learned in fine-tuning with downstream data, we propose a method for learning to generate the fine-tuning parameters. Our GIFT is a hyper-Transformer which take as input the pretrained parameters of the projection layer to generate its fine-tuning parameters using a proposed Parameter-to-Cluster Attention (PaCa). The PaCa results in a simple clustering-based forward explainer that plays the role of semantic segmentation in testing. In experiments, our proposed GIFT is tested on the VTAB benchmark and the fine-grained visual classification (FGVC) benchmark. It obtains significantly better performance than the prior art. Our code is available at this https URL
我们提出了GIFT(生成可解释的微调转换器)方法,以在参数效率的前提下,在下游任务中微调预训练(通常较大)Transformer模型。我们的GIFT是一种深度参数残差学习方法,它解决了在微调预训练Transformer模型时遇到的两个问题:如何将参数高效的微调(PEFT)应用到模型的最后投影(线性)层,以及如何通过直接利用预训练模型的知识来学习PEFT。对于第一个问题,我们选择Transformer模型的多头自注意力最后投影层,并验证其有效性。对于第二个问题,与先前的艺术作品不同,我们提出了一种学习生成微调参数的方法。我们的GIFT是一种超Transformer,它接受投影层的预训练参数,并通过所提出的参数到聚类的注意(PaCa)方法生成微调参数。PaCa产生一个简单的聚类为基础的向前解释器,在测试中扮演着语义分割的角色。在实验中,我们对VTAB基准和细粒度视觉分类(FGVC)基准进行了测试。与先前的艺术作品相比,我们的GIFT取得了显著的更好的性能。我们的代码可在此处访问:https://www.xxxxxx
https://arxiv.org/abs/2312.00700
Table Structure Recognition (TSR) aims at transforming unstructured table images into structured formats, such as HTML sequences. One type of popular solution is using detection models to detect components of a table, such as columns and rows, then applying a rule-based post-processing method to convert detection results into HTML sequences. However, existing detection-based studies often have the following limitations. First, these studies usually pay more attention to improving the detection performance, which does not necessarily lead to better performance regarding cell-level metrics, such as TEDS. Second, some solutions over-simplify the problem and can miss some critical information. Lastly, even though some studies defined the problem to detect more components to provide as much information as other types of solutions, these studies ignore the fact this problem definition is a multi-label detection because row, projected row header and column header can share identical bounding boxes. Besides, there is often a performance gap between two-stage and transformer-based detection models regarding the structure-only TEDS, even though they have similar performance regarding the COCO metrics. Therefore, we revisit the limitations of existing detection-based solutions, compare two-stage and transformer-based detection models, and identify the key design aspects for the success of a two-stage detection model for the TSR task, including the multi-class problem definition, the aspect ratio for anchor box generation, and the feature generation of the backbone network. We applied simple methods to improve these aspects of the Cascade R-CNN model, achieved state-of-the-art performance, and improved the baseline Cascade R-CNN model by 19.32%, 11.56% and 14.77% regarding the structure-only TEDS on SciTSR, FinTabNet, and PubTables1M datasets.
表格结构识别(TSR)旨在将无结构表格图像转换为结构格式,如HTML序列。一种流行的解决方案是使用检测模型检测表格的组件,如列和行,然后应用基于规则的后处理方法将检测结果转换为HTML序列。然而,现有基于检测的研究通常具有以下局限性。首先,这些研究通常更加关注提高检测性能,这并不一定导致关于细胞层面的指标(如TEDS)的更好表现。其次,一些解决方案过于简单化问题,可能遗漏一些关键信息。最后,尽管一些研究将问题定义为检测更多的组件以提供尽可能多的信息,但这些研究忽略了这个问题是一个多标签检测,因为行、投影行头和列头可以共享相同的边界框。此外,即使在它们在COCO指标上的性能类似,两阶段和Transformer-based检测模型之间也存在性能差距。因此,我们重新审视了现有基于检测的研究的局限性,比较了两阶段和Transformer-based检测模型,并确定了TSR任务中两阶段检测模型成功的关键设计方面,包括多分类问题定义、锚框生成方面和基础知识网络的特征生成。我们使用简单的方法来提高这些方面,达到了最先进的性能,并将基线Cascade R-CNN模型在结构-only TEDS上的性能提高了19.32%、11.56%和14.77%。
https://arxiv.org/abs/2312.00699
The neural architectures of language models are becoming increasingly complex, especially that of Transformers, based on the attention mechanism. Although their application to numerous natural language processing tasks has proven to be very fruitful, they continue to be models with little or no interpretability and explainability. One of the tasks for which they are best suited is the encoding of the contextual sense of words using contextualized embeddings. In this paper we propose a transparent, interpretable, and linguistically motivated strategy for encoding the contextual sense of words by modeling semantic compositionality. Particular attention is given to dependency relations and semantic notions such as selection preferences and paradigmatic classes. A partial implementation of the proposed model is carried out and compared with Transformer-based architectures for a given semantic task, namely the similarity calculation of word senses in context. The results obtained show that it is possible to be competitive with linguistically motivated models instead of using the black boxes underlying complex neural architectures.
语言模型的神经架构变得越来越复杂,尤其是基于注意机制的Transformer架构。尽管它们在许多自然语言处理任务上的应用已经证明非常有益,但它们仍然是具有少量或完全不可解释性和可解释性的模型。它们最适合的任务之一是使用上下文嵌入编码单词的上下文意义。在本文中,我们提出了一个透明、可解释、有语言动机的方法,通过建模语义组合性来编码单词的上下文意义。特别关注依赖关系和语义概念,如选择偏好和范式类。对所提出的模型的部分实现进行了比较,并将其与基于Transformer的架构在给定语义任务上的相似性进行了比较。得到的结果表明,可以与语义驱动的模型竞争,而不使用复杂神经架构背后的黑盒。
https://arxiv.org/abs/2312.00680
The current paradigm of large-scale pre-training and fine-tuning Transformer large language models has lead to significant improvements across the board in natural language processing. However, such large models are susceptible to overfitting to their training data, and as a result the models perform poorly when the domain changes. Also, due to the model's scale, the cost of fine-tuning the model to the new domain is large. Nonparametric Variational Information Bottleneck (NVIB) has been proposed as a regulariser for training cross-attention in Transformers, potentially addressing the overfitting problem. We extend the NVIB framework to replace all types of attention functions in Transformers, and show that existing pretrained Transformers can be reinterpreted as Nonparametric Variational (NV) models using a proposed identity initialisation. We then show that changing the initialisation introduces a novel, information-theoretic post-training regularisation in the attention mechanism, which improves out-of-domain generalisation without any training. This success supports the hypothesis that pretrained Transformers are implicitly NV Bayesian models.
当前的大规模预训练和微调Transformer大型语言模型的范式在自然语言处理方面的表现已经取得了显著的提高。然而,这些大型模型容易过拟合到其训练数据,因此当领域发生变化时,这些模型表现不佳。此外,由于模型的规模,微调模型以新领域为代价的成本较高。非参数变分信息瓶颈(NVIB)作为一种跨注意力的 regulariser 被提出,可能解决过拟合问题。我们扩展了NVIB框架,用它来替代Transformer中所有类型的注意力函数,并表明,通过所提出的身份初始化,预训练的Transformer可以重新解释为非参数变分模型。然后我们证明了,初始化会引入一种新的信息论上的后训练 regularisation,在注意力机制中,从而改善离域泛化能力,而无需进行训练。这一成功支持了预训练Transformers是隐式NV贝叶斯模型的假设。
https://arxiv.org/abs/2312.00662
We consider transferability estimation, the problem of estimating how well deep learning models transfer from a source to a target task. We focus on regression tasks, which received little previous attention, and propose two simple and computationally efficient approaches that estimate transferability based on the negative regularized mean squared error of a linear regression model. We prove novel theoretical results connecting our approaches to the actual transferability of the optimal target models obtained from the transfer learning process. Despite their simplicity, our approaches significantly outperform existing state-of-the-art regression transferability estimators in both accuracy and efficiency. On two large-scale keypoint regression benchmarks, our approaches yield 12% to 36% better results on average while being at least 27% faster than previous state-of-the-art methods.
我们考虑可迁移性估计,即估计深度学习模型从源任务到目标任务的转移性能。我们关注回归任务,这些任务尚未得到足够的关注,并提出了两种简单且计算效率高的方法,基于线性回归模型的负严格平均误差来估计可迁移性。我们证明了我们的方法将实际可迁移性转移到从迁移学习过程获得的优化目标模型的结果与我们的方法联系起来。尽管它们的简单性,但我们的方法在准确性和效率上显著优于现有的回归传输性估计器。在两个大型关键点回归基准上,我们的方法平均可提供12%至36%的更好的结果,而至少比以前的方法快27%。
https://arxiv.org/abs/2312.00656
Unsupervised object-centric learning aims to decompose scenes into interpretable object entities, termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them, crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques, (i) an attention-based self-training approach, which distills superior slot-based attention masks from the decoder to the encoder, enhancing object segmentation, and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation, especially with complex real-world images. We provide the implementation code at this https URL .
无监督以物体为中心的学习旨在将场景分解为可解释的对象实体,称为插槽。基于插槽的自动编码器被视为这种任务的一个突出方法。在其中,关键方面包括指导编码器生成特定对象的插槽,并确保解码器在重构过程中使用它们。本文介绍了两种新颖的技术,(i) 一个基于注意力的自训练方法,它从解码器中提取出色的插槽级注意力,增强物体分割,和(ii) 一种创新的自回归变换器补码策略,它加强了插槽向量在重构中的作用。这些策略的有效性在实验中得到了展示。这种结合方法在无监督物体分割方面显著超过了先前的插槽自动编码器方法,特别是对于复杂现实图像。我们在这里提供了实现代码的URL。
https://arxiv.org/abs/2312.00648
Medical image segmentation plays a crucial role in various healthcare applications, enabling accurate diagnosis, treatment planning, and disease monitoring. In recent years, Vision Transformers (ViTs) have emerged as a promising technique for addressing the challenges in medical image segmentation. In medical images, structures are usually highly interconnected and globally distributed. ViTs utilize their multi-scale attention mechanism to model the long-range relationships in the images. However, they do lack image-related inductive bias and translational invariance, potentially impacting their performance. Recently, researchers have come up with various ViT-based approaches that incorporate CNNs in their architectures, known as Hybrid Vision Transformers (HVTs) to capture local correlation in addition to the global information in the images. This survey paper provides a detailed review of the recent advancements in ViTs and HVTs for medical image segmentation. Along with the categorization of ViT and HVT-based medical image segmentation approaches we also present a detailed overview of their real-time applications in several medical image modalities. This survey may serve as a valuable resource for researchers, healthcare practitioners, and students in understanding the state-of-the-art approaches for ViT-based medical image segmentation.
医学图像分割在各种医疗应用中扮演着至关重要的角色,从而实现准确的诊断、治疗规划和疾病监测。近年来,随着Vision Transformers(ViTs)的出现,解决医学图像分割领域的挑战变得更加具有前景。在医学图像中,结构通常高度相互连接和全局分布。ViTs利用其多尺度关注机制来建模图像中的长距离关系。然而,它们确实缺乏与图像相关的归纳偏见和平移不变性,这可能会影响其性能。最近,研究人员提出了一些ViT-based方法,在架构中包含CNN,称为混合视觉Transformers(HVTs),以捕捉图像中的局部相关性以及全局信息。 本调查论文对ViTs和HVTs在医学图像分割方面的最新进展进行了详细回顾。与ViT和HVT-based医学图像分割方法的分类一样,我们还详细介绍了它们在几种医学图像模式中的实时应用。本调查可能成为研究人员、医疗保健从业者和学生了解ViT-based医学图像分割领域最新方法的宝贵资源。
https://arxiv.org/abs/2312.00634
Hyperdimensional Computing (HDC) is a brain-inspired and light-weight machine learning method. It has received significant attention in the literature as a candidate to be applied in the wearable internet of things, near-sensor artificial intelligence applications and on-device processing. HDC is computationally less complex than traditional deep learning algorithms and typically achieves moderate to good classification performance. A key aspect that determines the performance of HDC is the encoding of the input data to the hyperdimensional (HD) space. This article proposes a novel light-weight approach relying only on native HD arithmetic vector operations to encode binarized images that preserves similarity of patterns at nearby locations by using point of interest selection and local linear mapping. The method reaches an accuracy of 97.35% on the test set for the MNIST data set and 84.12% for the Fashion-MNIST data set. These results outperform other studies using baseline HDC with different encoding approaches and are on par with more complex hybrid HDC models. The proposed encoding approach also demonstrates a higher robustness to noise and blur compared to the baseline encoding.
超维度计算(HDC)是一种基于大脑启发和轻量级的机器学习方法。在文献中,HDC被广泛被视为可应用于可穿戴互联网、近传感器人工智能应用和本地处理领域的潜在候选者。HDC比传统深度学习算法在计算上更加简单,通常具有中等至良好的分类性能。决定HDC性能的关键方面是输入数据编码到超维度(HD)空间。本文提出了一种仅依赖于本地高维度算术向量操作来编码二进制图像的方法,通过点对点选择和局部线性映射来保留近距离位置的图案相似性。在MNIST测试集上,该方法达到了97.35%的准确度,而在Fashion-MNIST测试集上,准确度为84.12%。这些结果与其他使用不同编码方法的基础HDC研究相比,表现优异,且与更复杂的混合HDC模型相当。所提出的编码方法还表明,与基线编码相比,对噪声和模糊的鲁棒性更高。
https://arxiv.org/abs/2312.00454
In this paper, we propose an efficient and high-performance method for partially relevant video retrieval (PRVR), which aims to retrieve untrimmed long videos that contain at least one relevant moment to the input text query. In terms of both efficiency and performance, the overlooked bottleneck of previous studies is the visual encoding of dense frames. This guides researchers to choose lightweight visual backbones, yielding sub-optimal retrieval performance due to their limited capabilities of learned visual representations. However, it is undesirable to simply replace them with high-performance large-scale vision-and-language models (VLMs) due to their low efficiency. To address these issues, instead of dense frames, we focus on super images, which are created by rearranging the video frames in a $N \times N$ grid layout. This reduces the number of visual encodings to $\frac{1}{N^2}$ and compensates for the low efficiency of large-scale VLMs, allowing us to adopt them as powerful encoders. Surprisingly, we discover that with a simple query-image attention trick, VLMs generalize well to super images effectively and demonstrate promising zero-shot performance against SOTA methods efficiently. In addition, we propose a fine-tuning approach by incorporating a few trainable modules into the VLM backbones. The experimental results demonstrate that our approaches efficiently achieve the best performance on ActivityNet Captions and TVR.
在本文中,我们提出了一种高效且高性能的部分相关视频检索(PRVR)方法,旨在检索输入文本查询中包含至少一个相关时刻的未剪辑长视频。在效率和性能方面,之前的研究被忽视的一个瓶颈是密帧的视觉编码。这使得研究者选择轻量级的视觉骨干,但由于其学习到的视觉表示能力有限,导致检索性能低于他们的能力。然而,简单地用高性能的大规模视觉与语言模型(VLMs)替换它们并不理想,因为它们的效率太低了。为了应对这些问题,我们关注超图像,这是通过将视频帧按照 $N \times N$ 的网格布局重新排列来创建的。这减少了视觉编码的数量至 $\frac{1}{N^2}$,并弥补了大规模 VLMs 的低效率,使我们可以将它们用作强大的编码器。令人惊讶的是,我们发现,通过一个简单的查询图像关注技巧,VLMs 很好地向超图像进行扩展,并高效地对抗了目前的最优方法。此外,我们通过将几个可训练的模块集成到 VLM 骨干网络中,提出了一种微调方法。实验结果表明,我们的方法在活动网络捕捉和 TVR 上实现了最佳性能。
https://arxiv.org/abs/2312.00414
Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
由于在许多视觉任务中的出色表现,Transformer Vision模型已经引起了很大的关注。尽管在token mixer或attention block上已经进行了详细研究,但通道混合器或特征混合块(FFN或MLP)尚未深入研究,尽管它占据了模型中大部分的参数和计算。在本文中,我们研究是否稀疏特征混合可以取代密集连接,并通过支持更大的扩展比来证实这一结论。为了提高由该结构形成的特征簇的准确性,在训练过程中引入了一个轻量级、参数无关的通道协方差注意(CCA)机制作为并行分支。该设计的CCA允许在训练过程中逐步混合通道组,其贡献在训练达到收敛时逐渐消失。这使得CCA块在推理时可以被丢弃,从而实现在不增加计算成本的情况下提高性能。通过控制MLP中块的扩展比,可以将得到的具有不同复杂度和性能的模型插接到任何ViT架构中。这通过引入一个新的SCHEMEformer模型家族来证明。在不同的ViT骨干网络、图像分类、目标检测和语义分割实验中,与现有设计相比,具有显著的准确性提升,尤其是在较低的FLOPs条件下。例如,SCHEMEformer在ImageNet-1K上使用纯注意力混合器建立了79.7%的准确率的新SOTA。
https://arxiv.org/abs/2312.00412
Musculoskeletal diseases and cognitive impairments in patients lead to difficulties in movement as well as negative effects on their psychological health. Clinical gait analysis, a vital tool for early diagnosis and treatment, traditionally relies on expensive optical motion capture systems. Recent advances in computer vision and deep learning have opened the door to more accessible and cost-effective alternatives. This paper introduces a novel spatio-temporal Transformer network to estimate critical gait parameters from RGB videos captured by a single-view camera. Empirical evaluations on a public dataset of cerebral palsy patients indicate that the proposed framework surpasses current state-of-the-art approaches and show significant improvements in predicting general gait parameters (including Walking Speed, Gait Deviation Index - GDI, and Knee Flexion Angle at Maximum Extension), while utilizing fewer parameters and alleviating the need for manual feature extraction.
肌肉骨骼疾病和认知障碍患者的运动困难以及对其心理健康的负面影响,使得临床步态分析这一重要的早期诊断和治疗工具,传统上依赖于昂贵的光学运动捕捉系统。 近年来计算机视觉和深度学习的进步,为更可负担且经济有效的替代方案打开了大门。本文介绍了一种新颖的空间-时间Transformer网络,可以从单视相机捕捉的RGB视频中估计关键步态参数。在脊髓性截瘫患者公开数据集的实证评估表明,与现有方法相比,所提出的框架取得了显著的进展,并显示了预测总体步态参数(包括行走速度、步态偏差指数 - GDI和伸膝角度的最大扩展)的显著改善,同时使用了更少的参数,减轻了手动特征提取的需求。
https://arxiv.org/abs/2312.00398
In the domain of Mobility Data Science, the intricate task of interpreting models trained on trajectory data, and elucidating the spatio-temporal movement of entities, has persistently posed significant challenges. Conventional XAI techniques, although brimming with potential, frequently overlook the distinct structure and nuances inherent within trajectory data. Observing this deficiency, we introduced a comprehensive framework that harmonizes pivotal XAI techniques: LIME (Local Interpretable Model-agnostic Explanations), SHAP (SHapley Additive exPlanations), Saliency maps, attention mechanisms, direct trajectory visualization, and Permutation Feature Importance (PFI). Unlike conventional strategies that deploy these methods singularly, our unified approach capitalizes on the collective efficacy of these techniques, yielding deeper and more granular insights for models reliant on trajectory data. In crafting this synthesis, we effectively address the multifaceted essence of trajectories, achieving not only amplified interpretability but also a nuanced, contextually rich comprehension of model decisions. To validate and enhance our framework, we undertook a survey to gauge preferences and reception among various user demographics. Our findings underscored a dichotomy: professionals with academic orientations, particularly those in roles like Data Scientist, IT Expert, and ML Engineer, showcased a profound, technical understanding and often exhibited a predilection for amalgamated methods for interpretability. Conversely, end-users or individuals less acquainted with AI and Data Science showcased simpler inclinations, such as bar plots indicating timestep significance or visual depictions pinpointing pivotal segments of a vessel's trajectory.
在移动数据科学领域,解释在轨迹数据上训练的模型的复杂任务以及阐明实体在空间和时间运动中的运动,一直是一个具有挑战性的任务。传统的XAI方法虽然充满潜力,但通常忽视轨迹数据中固有的差异和细微之处。观察到这一不足,我们引入了一个全面的框架,协调关键的XAI方法:LIME(局部可解释模型无关的解释),SHAP(边际可解释性Additive解释),突出度图,注意机制,直接轨迹可视化和Permutation Feature Importance(PFI)。与传统的策略仅使用这些方法不同,我们的统一方法利用这些技术的集体效果,为模型依赖轨迹数据的模型提供了更深入和更细致的洞察。在构建这一综合时,我们有效解决了轨迹的多面性,不仅实现了增强的 interpretability,还获得了对模型决策的细微和丰富的理解。为了验证和完善我们的框架,我们对不同用户群体进行了调查,以评估他们的偏好和反应。我们的研究结果证实了一种二元论:具有学术取向的专业人士,特别是数据科学家、IT专家和ML工程师,表现出深刻的技术理解和往往倾向于使用可解释性方法。相反,对AI和数据科学不太熟悉的用户或个人表现出更简单的倾向,例如时间步长的条形图表示重要性,或视觉描绘船舶轨迹的关键部分。
https://arxiv.org/abs/2312.00380
Information retrieval in real-time search presents unique challenges distinct from those encountered in classical web search. These challenges are particularly pronounced due to the rapid change of user search intent, which is influenced by the occurrence and evolution of breaking news events, such as earthquakes, elections, and wars. Previous dense retrieval methods, which primarily focused on static semantic representation, lack the capacity to capture immediate search intent, leading to inferior performance in retrieving the most recent event-related documents in time-sensitive scenarios. To address this issue, this paper expands the query with event information that represents real-time search intent. The Event information is then integrated with the query through a cross-attention mechanism, resulting in a time-context query representation. We further enhance the model's capacity for event representation through multi-task training. Since publicly available datasets such as MS-MARCO do not contain any event information on the query side and have few time-sensitive queries, we design an automatic data collection and annotation pipeline to address this issue, which includes ModelZoo-based Coarse Annotation and LLM-driven Fine Annotation processes. In addition, we share the training tricks such as two-stage training and hard negative sampling. Finally, we conduct a set of offline experiments on a million-scale production dataset to evaluate our approach and deploy an A/B testing in a real online system to verify the performance. Extensive experimental results demonstrate that our proposed approach significantly outperforms existing state-of-the-art baseline methods.
实时搜索信息检索与经典网络搜索面临独特的挑战。这些挑战尤其突出,因为用户搜索意图的快速变化,受到如地震、选举和战争等新闻事件的发生和演变的影响。之前的密集检索方法主要关注静态语义表示,无法捕捉到即时搜索意图,导致在时间敏感场景中检索相关事件文档的性能较差。为了应对这个问题,本文在查询中扩展了事件信息,该信息代表实时搜索意图。然后,通过跨注意机制将事件信息与查询集成,得到了一个时间上下文查询表示。我们通过多任务训练进一步增强了模型的事件表示能力。 由于类似MS-MARCO这样的公共数据集在查询侧不包含事件信息,并且很少有时间敏感查询,因此我们设计了一个自动数据收集和标注流程来解决这个问题,包括基于ModelZoo的粗标注和基于LLM的细标注过程。此外,我们还分享了训练技巧,如两级训练和负样本抽样。最后,我们在一个亿规模的生产数据集上进行了离线实验,以评估我们的方法并部署一个在线系统进行A/B测试,以验证其性能。 广泛的实验结果表明,我们提出的方法显著优于现有状态下的最佳基线方法。
https://arxiv.org/abs/2312.00372
Work on personality detection has tended to incorporate psychological features from different personality models, such as BigFive and MBTI. There are more than 900 psychological features, each of which is helpful for personality detection. However, when used in combination, the application of different calculation standards among these features may result in interference between features calculated using distinct systems, thereby introducing noise and reducing performance. This paper adapts different psychological models in the proposed PsyAttention for personality detection, which can effectively encode psychological features, reducing their number by 85%. In experiments on the BigFive and MBTI models, PysAttention achieved average accuracy of 65.66% and 86.30%, respectively, outperforming state-of-the-art methods, indicating that it is effective at encoding psychological features.
在个性检测工作中,通常会从不同的人格模型中整合心理特征,如BigFive和MBTI。超过900个心理特征,每个特征都对个性检测有帮助。然而,当这些特征组合使用时,不同计算标准的应用可能导致不同系统计算的特征之间的干扰,从而引入噪声并降低性能。本文基于所提出的PsyAttention个性检测模型,适应了不同的人格模型,有效减少了特征的数量,降低了85%。在BigFive和MBTI模型的实验中,PysAttention分别实现了65.66%和86.30%的平均准确率,均优于最先进的方法,表明其有效编码心理特征。
https://arxiv.org/abs/2312.00293
In this paper, we present a complete and efficient implementation of a knowledge-sharing augmented kinesthetic teaching approach for efficient task execution in robotics. Our augmented kinesthetic teaching method integrates intuitive human feedback, including verbal, gesture, gaze, and physical guidance, to facilitate the extraction of multiple layers of task information including control type, attention direction, input and output type, action state change trigger, etc., enhancing the adaptability and autonomy of robots during task execution. We propose an efficient Programming by Demonstration (PbD) framework for users with limited technical experience to teach the robot in an intuitive manner. The proposed framework provides an interface for such users to teach customized tasks using high-level commands, with the goal of achieving a smoother teaching experience and task execution. This is demonstrated with the sample task of pouring water.
在本文中,我们提出了一个知识共享增强的身体教法,用于机器人中高效任务执行。我们的增强身体教法集成了直观的人类反馈,包括口头、手势、眼神和身体指导,以帮助提取包括控制类型、注意方向、输入和输出类型、动作状态变化触发器等在内的多个任务层信息,提高了机器人在任务执行过程中的适应性和自主性。我们提出了一个高效的演示编程(PbD)框架,供具有有限技术经验的用户以直观的方式教授机器人。该框架为用户提供了一个接口,使用高级命令教授定制任务,旨在实现更平滑的教学体验和任务执行。这在用例任务“倒水”中得到了演示。
https://arxiv.org/abs/2312.00262
Currently, to further improve visual enjoyment, Ultra-High-Definition (UHD) images are catching wide attention. Here, UHD images are usually referred to as having a resolution greater than or equal to $3840 \times 2160$. However, since the imaging equipment is subject to environmental noise or equipment jitter, UHD images are prone to contrast degradation, blurring, low dynamic range, etc. To address these issues, a large number of algorithms for UHD image enhancement have been proposed. In this paper, we introduce the current state of UHD image enhancement from two perspectives, one is the application field and the other is the technology. In addition, we briefly explore its trends.
目前,为了进一步提高视觉享受,超高清(UHD)图像引起了广泛关注。在这里,UHD图像通常被认为具有分辨率为3840×2160及以上的分辨率。然而,由于成像设备受环境噪声或设备抖动的影响,UHD图像容易发生对比度降低、模糊、低动态范围等问题。为解决这些问题,已经提出了许多UHD图像增强算法的改进措施。在本文中,我们从应用领域和技术两个方面介绍了UHD图像增强的现状。此外,我们还简要探讨了其发展趋势。
https://arxiv.org/abs/2312.00250