The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
https://arxiv.org/abs/2309.13006
Recognizing the prevalence of domain shift as a common challenge in machine learning, various domain generalization (DG) techniques have been developed to enhance the performance of machine learning systems when dealing with out-of-distribution (OOD) data. Furthermore, in real-world scenarios, data distributions can gradually change across a sequence of sequential domains. While current methodologies primarily focus on improving model effectiveness within these new domains, they often overlook fairness issues throughout the learning process. In response, we introduce an innovative framework called Counterfactual Fairness-Aware Domain Generalization with Sequential Autoencoder (CDSAE). This approach effectively separates environmental information and sensitive attributes from the embedded representation of classification features. This concurrent separation not only greatly improves model generalization across diverse and unfamiliar domains but also effectively addresses challenges related to unfair classification. Our strategy is rooted in the principles of causal inference to tackle these dual issues. To examine the intricate relationship between semantic information, sensitive attributes, and environmental cues, we systematically categorize exogenous uncertainty factors into four latent variables: 1) semantic information influenced by sensitive attributes, 2) semantic information unaffected by sensitive attributes, 3) environmental cues influenced by sensitive attributes, and 4) environmental cues unaffected by sensitive attributes. By incorporating fairness regularization, we exclusively employ semantic information for classification purposes. Empirical validation on synthetic and real-world datasets substantiates the effectiveness of our approach, demonstrating improved accuracy levels while ensuring the preservation of fairness in the evolving landscape of continuous domains.
认识到域转换是机器学习中常见的挑战,各种域扩展技术(DG)已经被开发用于提高处理非分布数据(OOD)机器学习系统的性能。此外,在现实世界场景中,数据分布可以逐步变化在一个连续的域序列中。虽然当前的方法主要关注在这些新域中提高模型有效性,但它们往往在整个学习过程中忽视公平问题。为了应对这种情况,我们提出了一种名为“反事实公平 aware 域扩展”的创新框架(CDSAE)。该方法有效地将环境信息和敏感属性从分类特征嵌入表示中分离出来。这种同时分离不仅极大地改善了跨不同熟悉域模型的泛化能力,而且还有效地解决了与不公平分类相关的挑战。我们的策略基于因果关系推理的原则,以解决这些双重问题。为了研究语义信息、敏感属性和环境 cues之间的关系,我们 systematic 地将外部不确定性因素分类为四个隐变量:1)受敏感属性影响的语义信息,2)不受敏感属性影响的语义信息,3)受敏感属性影响的环境问题,4)不受敏感属性影响的环境问题。通过引入公平 regularization,我们仅用于分类目的的语义信息。对合成数据和实际数据集的模拟验证证实了我们方法的有效性,证明了提高准确性水平,同时确保了连续域演化 landscape 中公平性的保持。
https://arxiv.org/abs/2309.13005
Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at this https URL.
开放集对象检测的目标是检测训练期间未观察到的任意类别。最近的进展都采用了开放词汇表范式,利用视觉语言骨架来表示类别用语言表示。在本文中,我们介绍了DE-ViT,它是一个开放集对象检测器,使用仅视觉的DINOv2骨架和通过绕过每个类别的推断来学习新类别,而不是使用语言。为了改善一般性检测能力,我们将多分类任务转换为二进制分类任务,而绕过每个类别的推断,并提出了一种新的区域传播技术来进行定位。我们评估了DE-ViT在开放词汇表、少量样本和一次性检测基准上的表现,与COCO和LVIS进行比较。对于COCO,DE-ViT在开放词汇表SoTA上比SoTA表现更好,在新类中达到了50AP50。DE-ViT在10次检测、30次检测和一次性检测SoTA上超过SoTA的15mAP、7.2mAP和2.8AP50。对于LVIS,DE-ViT比开放词汇表SoTA表现更好,达到了34.3 mask APr。代码在此httpsURL上可用。
https://arxiv.org/abs/2309.12969
In this work, we investigate two popular end-to-end automatic speech recognition (ASR) models, namely Connectionist Temporal Classification (CTC) and RNN-Transducer (RNN-T), for offline recognition of voice search queries, with up to 2B model parameters. The encoders of our models use the neural architecture of Google's universal speech model (USM), with additional funnel pooling layers to significantly reduce the frame rate and speed up training and inference. We perform extensive studies on vocabulary size, time reduction strategy, and its generalization performance on long-form test sets. Despite the speculation that, as the model size increases, CTC can be as good as RNN-T which builds label dependency into the prediction, we observe that a 900M RNN-T clearly outperforms a 1.8B CTC and is more tolerant to severe time reduction, although the WER gap can be largely removed by LM shallow fusion.
在本研究中,我们研究了两种流行的端到端自动语音识别(ASR)模型,即连接性时间分类(CTC)和RNN-控制器(RNN-T),以 offline 识别语音搜索查询,使用高达2B的模型参数。我们模型的编码器使用谷歌的通用语音模型(USM)的神经网络架构,并添加 funnel Pooling 层来显著降低帧率,加快训练和推断。我们深入研究了词汇量、时间减少策略以及在长篇测试集上的通用表现。尽管有人猜测,随着模型规模的增长,CTC可能不亚于 RNN-T,它将标签依赖项引入预测中,但我们观察到,一个900M的RNN-T明显 outperforms a 1.8B的CTC,并且更加容忍严重的时间减少,尽管通过LM浅融合可以大部分消除WER之间的差距。
https://arxiv.org/abs/2309.12963
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at this https URL.
弱监督的对象定位和语义分割旨在使用图像级别标签只定位对象。最近,出现了一种新范式,通过生成前景预测图(FPM)来实现像素级定位。虽然现有的FPM相关方法使用交叉熵来评估前景预测图和指导生成器学习,但本文提出了对对象定位学习过程的两个惊人的实验观察:对于训练网络,当前景掩膜扩展时,1)交叉熵收敛到零,当前景掩膜仅覆盖对象区域的一部分时。2)激活值持续增加,直到前景掩膜扩展到对象边界。因此,为了实现更有效的定位表现,我们主张使用激活值来学习更多的对象区域。在本文中,我们提出了一种背景激活抑制(BAS)方法。具体来说,一个激活图约束(AMC)模块旨在抑制背景激活值,以促进生成器学习。同时,通过使用前景区域指导和监督区域大小,BAS可以学习整个对象区域。在推理阶段,我们考虑不同类别的预测图一起获得最终定位结果。广泛的实验表明,BAS在CUB-200-2011和ILSVRC数据集上实现了显著和一致性的提高,与基准方法相比。此外,我们的方法还在PASCAL VOC 2012和MS COCO 2014数据集上实现了弱监督语义分割性能的顶尖水平。代码和模型在此httpsURL上可用。
https://arxiv.org/abs/2309.12943
Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at this https URL.
现有的视频字幕方法通常需要先从解码视频中抽取帧并执行后续处理(例如特征提取和或字幕模型学习)。在这条处理路径中,手动帧采样可能会忽略视频中的关键信息,从而降低性能。此外,抽取的帧中的冗余信息可能会影响视频字幕推断的效率。针对这个问题,我们从压缩域的角度研究视频字幕,比现有的处理路径具有多项优势:1)与解码视频中的 raw 图像相比,压缩视频由 I-frames、运动向量和残差组成,具有很高的辨识度,这使得我们可以通过专门的模型设计利用整个视频进行学习,而无需手动采样;2)由于处理的信息规模更小且冗余更少,字幕模型在推断方面更加高效。我们提出了一种在压缩域中用于视频字幕的端到端Transformer,使其可以从压缩视频中学习字幕。我们证明,即使采用简单的设计,我们的方法也可以在不同基准测试中实现最先进的性能,同时比现有方法运行快近2倍。代码可在该 https URL 上获取。
https://arxiv.org/abs/2309.12867
Recent progress in Automatic Speech Recognition (ASR) has been coupled with a substantial increase in the model sizes, which may now contain billions of parameters, leading to slow inferences even with adapted hardware. In this context, several ASR models exist in various sizes, with different inference costs leading to different performance levels. Based on the observation that smaller models perform optimally on large parts of testing corpora, we propose to train a decision module, that would allow, given an audio sample, to use the smallest sufficient model leading to a good transcription. We apply our approach to two Whisper models with different sizes. By keeping the decision process computationally efficient, we build a decision module that allows substantial computational savings with reduced performance drops.
最近Automatic Speech Recognition (ASR)的进步与模型大小的显著增加相耦合,这些模型现在可能包含数十亿参数,导致即使使用适应硬件也缓慢Inference。在这种情况下,存在几种不同大小的ASR模型,不同的Inference成本导致不同的性能水平。基于观察到小型模型在测试数据集的大部分方面表现最佳,我们建议训练一个决策模块,给定一个音频样本,使用最小的足够模型,从而得到良好的转录。我们分别对两个不同大小的Whisper模型应用了我们的方法。通过保持决策过程计算高效的模式,我们构建了一个决策模块,允许实现显著的计算节省,同时减少了性能下降。
https://arxiv.org/abs/2309.12712
Surface defect inspection is a very challenging task in which surface defects usually show weak appearances or exist under complex backgrounds. Most high-accuracy defect detection methods require expensive computation and storage overhead, making them less practical in some resource-constrained defect detection applications. Although some lightweight methods have achieved real-time inference speed with fewer parameters, they show poor detection accuracy in complex defect scenarios. To this end, we develop a Global Context Aggregation Network (GCANet) for lightweight saliency detection of surface defects on the encoder-decoder structure. First, we introduce a novel transformer encoder on the top layer of the lightweight backbone, which captures global context information through a novel Depth-wise Self-Attention (DSA) module. The proposed DSA performs element-wise similarity in channel dimension while maintaining linear complexity. In addition, we introduce a novel Channel Reference Attention (CRA) module before each decoder block to strengthen the representation of multi-level features in the bottom-up path. The proposed CRA exploits the channel correlation between features at different layers to adaptively enhance feature representation. The experimental results on three public defect datasets demonstrate that the proposed network achieves a better trade-off between accuracy and running efficiency compared with other 17 state-of-the-art methods. Specifically, GCANet achieves competitive accuracy (91.79% $F_{\beta}^{w}$, 93.55% $S_\alpha$, and 97.35% $E_\phi$) on SD-saliency-900 while running 272fps on a single gpu.
表面缺陷检查是一项极具挑战性的任务,通常表面缺陷呈现出较弱的外观或存在于复杂的背景中。大多数高精度的缺陷检测方法都需要昂贵的计算和存储 overhead,因此在一些资源受限的缺陷检测应用中不太实用。虽然一些轻量级方法在仅有几个参数的情况下已经可以实现实时推断速度,但在复杂的缺陷场景中表现出较差的检测精度。为此,我们开发了一种全球上下文聚合网络(GCANet),用于在编码器和解码器结构中 lightweight saliency检测表面缺陷。我们首先在轻量级骨架的顶部引入了一个新的transformer编码器,该编码器通过 novel Depth-wise Self-Attention (DSA)模块实现了全球上下文信息捕捉。 proposed DSA 在通道维度上进行元素相似性计算,同时保持线性复杂性。此外,我们在每个解码块前引入了一个 novel Channel Reference Attention (CRA)模块,以加强从bottom-up路径上下来的多级特征表示。 proposed CRA利用不同层上特征之间的通道相关性,自适应地增强特征表示。在三个公开缺陷数据集上的实验结果显示,与另外17个先进方法相比, proposed 网络在准确性和运行效率之间的更好权衡。具体来说,GCANet 在SD-saliency-900上实现了 competitive accuracy(91.79% $F_{\beta}^{w}$,93.55% $S_\alpha$,97.35% $E_\phi$),同时运行在单个GPU上的帧率为272fps。
https://arxiv.org/abs/2309.12641
To alleviate the expensive human labeling, semi-supervised semantic segmentation employs a few labeled images and an abundant of unlabeled images to predict the pixel-level label map with the same size. Previous methods often adopt co-training using two convolutional networks with the same architecture but different initialization, which fails to capture the sufficiently diverse features. This motivates us to use tri-training and develop the triple-view encoder to utilize the encoders with different architectures to derive diverse features, and exploit the knowledge distillation skill to learn the complementary semantics among these encoders. Moreover, existing methods simply concatenate the features from both encoder and decoder, resulting in redundant features that require large memory cost. This inspires us to devise a dual-frequency decoder that selects those important features by projecting the features from the spatial domain to the frequency domain, where the dual-frequency channel attention mechanism is introduced to model the feature importance. Therefore, we propose a Triple-view Knowledge Distillation framework, termed TriKD, for semi-supervised semantic segmentation, including the triple-view encoder and the dual-frequency decoder. Extensive experiments were conducted on two benchmarks, \ie, Pascal VOC 2012 and Cityscapes, whose results verify the superiority of the proposed method with a good tradeoff between precision and inference speed.
为了减轻昂贵的人工标注成本,半监督语义分割使用少量标记图像和大量未标记图像来预测具有相同大小的像素级标签地图。以往的方法通常采用共训练,使用具有相同架构但不同初始化的卷积神经网络,但无法捕捉到足够多样化的特征。这激励我们使用三角训练和发展三重视角编码器,利用编码器具有不同架构来提取多样化特征,并利用知识蒸馏技能学习这些编码器之间的互补语义。此外,现有的方法只是将特征从编码器和解码器中拼接在一起,导致冗余特征,需要巨大的内存成本。这启发我们设计一种双频解码器,通过从空间域到频率域 projected特征来选择这些重要特征,并在频率域中引入双频通道注意力机制来建模特征重要性。因此,我们提出了一个名为 TriKD 的三重视角知识蒸馏框架,包括三重视角编码器和双频解码器。在两个基准上进行了广泛的实验,分别是Pascal VOC 2012和城市景观,其结果证实了该方法的优越性,具有精度和推理速度的良好权衡。
https://arxiv.org/abs/2309.12557
Many mathematical models have been leveraged to design embeddings for representing Knowledge Graph (KG) entities and relations for link prediction and many downstream tasks. These mathematically-inspired models are not only highly scalable for inference in large KGs, but also have many explainable advantages in modeling different relation patterns that can be validated through both formal proofs and empirical results. In this paper, we make a comprehensive overview of the current state of research in KG completion. In particular, we focus on two main branches of KG embedding (KGE) design: 1) distance-based methods and 2) semantic matching-based methods. We discover the connections between recently proposed models and present an underlying trend that might help researchers invent novel and more effective models. Next, we delve into CompoundE and CompoundE3D, which draw inspiration from 2D and 3D affine operations, respectively. They encompass a broad spectrum of techniques including distance-based and semantic-based methods. We will also discuss an emerging approach for KG completion which leverages pre-trained language models (PLMs) and textual descriptions of entities and relations and offer insights into the integration of KGE embedding methods with PLMs for KG completion.
许多数学模型被用来利用设计表示知识图(KG)实体和关系嵌入,以进行链接预测和其他许多后续任务。这些数学模型不仅具有在大型KG中进行推理的高度可扩展性,而且还具有许多可解释的优势,在建模不同的关系模式时,可以通过形式证明和实证结果进行验证。在本文中,我们进行了全面综述KG完成的研究现状。特别是,我们重点关注KG嵌入(KGE)设计的两个主要分支:距离方法和语义匹配方法。我们发现了最近提出模型之间的联系,并提出了可能有助于研究人员发明新且更有效模型的潜在趋势。接下来,我们将探讨结合化合物E和化合物E3D,分别从2D和3D阿夫洛夫操作中汲取灵感。它们涵盖了包括距离方法和语义方法在内的广泛这些方法。此外,我们还将讨论KG完成新兴的方法,利用预训练的语言模型(PLMs)和实体和关系文本描述,并提供关于将KGE嵌入方法与PLMs用于KG完成之间的集成的洞察。
https://arxiv.org/abs/2309.12501
Compression of a neural network can help in speeding up both the training and the inference of the network. In this research, we study applying compression using low rank decomposition on network layers. Our research demonstrates that to acquire a speed up, the compression methodology should be aware of the underlying hardware as analysis should be done to choose which layers to compress. The advantage of our approach is demonstrated via a case study of compressing ResNet50 and training on full ImageNet-ILSVRC2012. We tested on two different hardware systems Nvidia V100 and Huawei Ascend910. With hardware targeted compression, results on Ascend910 showed 5.36% training speedup and 15.79% inference speed on Ascend310 with only 1% drop in accuracy compared to the original uncompressed model
神经网络的压缩可以帮助加快训练和推理过程。在这项研究中,我们研究了使用低秩分解对网络层进行压缩的方法。研究结果表明,要获得速度加快,压缩方法应该考虑到底层硬件,并进行选择哪些层进行压缩的分析。通过一个案例研究展示了我们方法的优势,该研究 compressed ResNet50 并使用了 full ImageNet-ILSVRC2012 数据集进行训练。我们使用了两个不同的硬件系统 Nvidia V100 和 Huawei Ascend910。通过硬件目标压缩,Ascend910 的结果表现出 5.36% 的训练速度加快和 15.79% 的推理速度,与原始未压缩模型相比,精度下降了1%。
https://arxiv.org/abs/2309.12412
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models (LLMs), with limited computation cost. Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. For example, training on the context length of 8192 needs 16x computational costs in self-attention layers as that of 2048. In this paper, we speed up the context extension of LLMs in two aspects. On the one hand, although dense global attention is needed during inference, fine-tuning the model can be effectively and efficiently done by sparse local attention. The proposed shift short attention effectively enables context extension, leading to non-trivial computation saving with similar performance to fine-tuning with vanilla attention. Particularly, it can be implemented with only two lines of code in training, while being optional in inference. On the other hand, we revisit the parameter-efficient fine-tuning regime for context expansion. Notably, we find that LoRA for context extension works well under the premise of trainable embedding and normalization. LongLoRA demonstrates strong empirical results on various tasks on LLaMA2 models from 7B/13B to 70B. LongLoRA adopts LLaMA2 7B from 4k context to 100k, or LLaMA2 70B to 32k on a single 8x A100 machine. LongLoRA extends models' context while retaining their original architectures, and is compatible with most existing techniques, like FlashAttention-2. In addition, to make LongLoRA practical, we collect a dataset, LongQA, for supervised fine-tuning. It contains more than 3k long context question-answer pairs.
我们提出了 LongLoRA,一种高效的微调方法,可以扩展训练前已训练的大型语言模型(LLM)的上下文大小,同时以有限的计算成本实现。通常,训练上下文大小较长的LLM具有计算成本高的特点,需要大量的训练时间和GPU资源。例如,训练8192上下文长度的LLM需要比2048上下文长度的计算成本更高的 self-attention 层。在本文中,我们加速了LLM的上下文扩展。一方面,虽然推理时需要密集的注意力,但稀疏的注意力可以在模型微调方面有效地和高效地进行。 proposed 的简短注意力有效地促进了上下文扩展,导致与仅使用常规注意力的微调相比,显著的计算节省,类似于仅使用常规注意力的微调。特别地,在训练时只需要两行代码,而在推理时则是可选的。另一方面,我们重新考虑了上下文扩展参数高效的微调模式。值得注意的是,我们发现 LoRA 对上下文扩展的工作在可训练嵌入和归一化的前提下良好运作。LongLoRA 在多个任务上展示了强 empirical 结果,包括从 7B/13B 到 70B 的 LLaMA2 模型。LongLoRA 将 LLaMA2 7B 从 4k 上下文扩展到 100k,或在一个 8x A100 机器上将 LLaMA2 70B 扩展到 32k。LongLoRA 在保持模型原架构的情况下扩展了模型的上下文,并与大多数现有技术,如 FlashAttention-2 兼容。此外,为了使 LongLoRA 实现,我们收集了一个数据集 LongQA,用于监督微调。该数据集包含超过 3k 长的上下文问答对。
https://arxiv.org/abs/2309.12307
The objective of this work is the effective extraction of spatial and dynamic features for Continuous Sign Language Recognition (CSLR). To accomplish this, we utilise a two-pathway SlowFast network, where each pathway operates at distinct temporal resolutions to separately capture spatial (hand shapes, facial expressions) and dynamic (movements) information. In addition, we introduce two distinct feature fusion methods, carefully designed for the characteristics of CSLR: (1) Bi-directional Feature Fusion (BFF), which facilitates the transfer of dynamic semantics into spatial semantics and vice versa; and (2) Pathway Feature Enhancement (PFE), which enriches dynamic and spatial representations through auxiliary subnetworks, while avoiding the need for extra inference time. As a result, our model further strengthens spatial and dynamic representations in parallel. We demonstrate that the proposed framework outperforms the current state-of-the-art performance on popular CSLR datasets, including PHOENIX14, PHOENIX14-T, and CSL-Daily.
本研究的目标是有效地提取连续口语语言识别(CSLR)所需的空间动态特征。为了实现这一目标,我们使用了两路径的慢快网络,每个路径都采用不同的时间分辨率,分别捕获空间(手形状、面部表情)和动态(动作)信息。此外,我们引入了两个专门设计的特征融合方法,分别为(1)双向特征融合(BFF),该方法有助于将动态语义转化为空间语义,反之亦然;(2)路径特征增强(PFE),该方法通过辅助子网络丰富动态和空间表示,同时避免额外的推理时间。因此,我们的模型并行加强了空间和动态表示。我们证明了所提出的框架在流行的CSLR数据集上优于当前最先进的性能,包括PHOENIX14、PHOENIX14-T和CSL-每日。
https://arxiv.org/abs/2309.12304
In recent years, datasets of paired audio and captions have enabled remarkable success in automatically generating descriptions for audio clips, namely Automated Audio Captioning (AAC). However, it is labor-intensive and time-consuming to collect a sufficient number of paired audio and captions. Motivated by the recent advances in Contrastive Language-Audio Pretraining (CLAP), we propose a weakly-supervised approach to train an AAC model assuming only text data and a pre-trained CLAP model, alleviating the need for paired target data. Our approach leverages the similarity between audio and text embeddings in CLAP. During training, we learn to reconstruct the text from the CLAP text embedding, and during inference, we decode using the audio embeddings. To mitigate the modality gap between the audio and text embeddings we employ strategies to bridge the gap during training and inference stages. We evaluate our proposed method on Clotho and AudioCaps datasets demonstrating its ability to achieve a relative performance of up to ~$83\%$ compared to fully supervised approaches trained with paired target data.
近年来,配对音频和字幕数据集在自动为音频片段生成描述方面取得了显著的成功,也就是自动化音频标题生成(AAC)。然而,收集足够的配对音频和字幕数据集是费时费力的。基于最近在Contrastive Language-Audio Pretraining(CLAP)方面的进展,我们提出了一种弱监督的方法来训练一个AAC模型,假设只有文本数据和一个预先训练的CLAP模型,从而消除了需要配对目标数据的需求。我们的方法利用CLAP中音频和文本嵌入之间的相似性。在训练期间,我们学习从CLAP文本嵌入中恢复文本,而在推理期间,我们使用音频嵌入进行解码。为了缓解音频和文本嵌入之间的模式差异,我们采用了在训练和推理阶段中 bridge the gap 的策略。我们评估了我们提出的方法在Clotho和AudioCaps数据集上的表现,表明它能够与完全监督的方法训练使用配对目标数据相比实现高达 ~$83\%$ 的相对性能。
https://arxiv.org/abs/2309.12242
Deep learning's immense capabilities are often constrained by the complexity of its models, leading to an increasing demand for effective sparsification techniques. Bayesian sparsification for deep learning emerges as a crucial approach, facilitating the design of models that are both computationally efficient and competitive in terms of performance across various deep learning applications. The state-of-the-art -- in Bayesian sparsification of deep neural networks -- combines structural shrinkage priors on model weights with an approximate inference scheme based on black-box stochastic variational inference. However, model inversion of the full generative model is exceptionally computationally demanding, especially when compared to standard deep learning of point estimates. In this context, we advocate for the use of Bayesian model reduction (BMR) as a more efficient alternative for pruning of model weights. As a generalization of the Savage-Dickey ratio, BMR allows a post-hoc elimination of redundant model weights based on the posterior estimates under a straightforward (non-hierarchical) generative model. Our comparative study highlights the computational efficiency and the pruning rate of the BMR method relative to the established stochastic variational inference (SVI) scheme, when applied to the full hierarchical generative model. We illustrate the potential of BMR to prune model parameters across various deep learning architectures, from classical networks like LeNet to modern frameworks such as Vision Transformers and MLP-Mixers.
深度学习的巨大能力往往受到其模型复杂性的限制,导致对有效聚类技巧的需求不断增加。深度学习中的贝叶斯聚类 emerge 成为一个重要的方法,有助于设计和计算效率较高且在各种深度学习应用中表现强劲的模型。目前的前沿 - 在深度学习中的贝叶斯聚类 - 将模型权重的结构缩小前概率与基于黑盒随机解释推理的近似推理方案相结合。然而,整个生成模型的模型估计 inversion 是非常计算密集型的,特别是在与标准深度学习的点估计相比的情况下。在这种情况下,我们主张使用贝叶斯模型减少(BMR)作为减少模型权重的更高效的替代方法。作为 Savage- Dickey 比例的一般化,BMR 允许在一种简单(非层次化)生成模型下通过后估计删除冗余的模型权重。我们的比较研究突出了 BMR 方法相对于 established 的随机解释推理(SVI)方案的计算效率和 pruning 速率,将其应用于整个层次生成模型。我们展示了 BMR 用于修剪模型参数在不同深度学习架构中的潜力,从经典网络如 LeNet 到现代框架如 Vision Transformers 和 MLP-Mixers。
https://arxiv.org/abs/2309.12095
Prompt Tuning is emerging as a scalable and cost-effective method to fine-tune Pretrained Language Models (PLMs). This study benchmarks the performance and computational efficiency of Prompt Tuning and baseline methods on a multi-label text classification task. This is applied to the use case of classifying companies into an investment firm's proprietary industry taxonomy, supporting their thematic investment strategy. Text-to-text classification with PLMs is frequently reported to outperform classification with a classification head, but has several limitations when applied to a multi-label classification problem where each label consists of multiple tokens: (a) Generated labels may not match any label in the industry taxonomy; (b) During fine-tuning, multiple labels must be provided in an arbitrary order; (c) The model provides a binary decision for each label, rather than an appropriate confidence score. Limitation (a) is addressed by applying constrained decoding using Trie Search, which slightly improves classification performance. All limitations (a), (b), and (c) are addressed by replacing the PLM's language head with a classification head. This improves performance significantly, while also reducing computational costs during inference. The results indicate the continuing need to adapt state-of-the-art methods to domain-specific tasks, even in the era of PLMs with strong generalization abilities.
Prompt Tuning正在成为一种可扩展且成本效益高的 fine-tuning 预训练语言模型(PLMs)的方法。本研究对 prompt Tuning 和基准方法在多标签文本分类任务中的表现和计算效率进行了基准比较。该应用将公司分类到一家投资公司的行业分类术语,支持其主题投资战略。使用PLMs进行文本到文本分类经常报告比分类头分类表现更好,但在面对每个标签由多个代币组成的多标签分类问题时存在一些限制:(a) 生成的标签可能不与任何标签在行业分类中匹配;(b) 在 fine-tuning 时,必须按任意顺序提供多个标签;(c) 模型为每个标签提供二进制决策,而不是适当的置信度分数。限制 (a) 可以通过使用约束解码(Trie搜索)来解决,这略微提高了分类性能。所有限制 (a)、(b) 和 (c) 都可以通过将语言 head 替换为分类 head 来解决。这显著提高了性能,同时在推理期间减少了计算成本。结果表明,即使在PLMs具有强大泛化能力的时代,仍需要适应特定任务的高级方法。
https://arxiv.org/abs/2309.12075
Autonomous wheel loading involves selecting actions that maximize the total performance over many repetitions. The actions should be well adapted to the current state of the pile and its future states. Selecting the best actions is difficult since the pile states are consequences of previous actions and thus are highly unknown. To aid the selection of actions, this paper investigates data-driven models to predict the loaded mass, time, work, and resulting pile state of a loading action given the initial pile state. Deep neural networks were trained on data using over 10,000 simulations to an accuracy of 91-97,% with the pile state represented either by a heightmap or by its slope and curvature. The net outcome of sequential loading actions is predicted by repeating the model inference at five milliseconds per loading. As errors accumulate during the inferences, long-horizon predictions need to be combined with a physics-based model.
自动驾驶汽车的装载涉及到选择能够最大化多次重复总性能的行动。这些行动应该适应堆的状态,并适应其未来状态。选择最好的行动非常困难,因为堆的状态是先前行动的结果,因此高度非常未知。为了协助选择行动,本文研究了基于数据的模型,以预测给定初始堆状态的流量行动的加载重量、时间、工作以及最终堆状态。使用超过10,000次模拟训练了深度神经网络,其精度高达91-97%。每个加载行动的累积结果都可以通过在每个加载行动后重复模型推理5毫秒进行预测。随着推理中的误差不断增加,长期预测需要与基于物理的模型相结合。
https://arxiv.org/abs/2309.12016
This work visits the topic of jointly parsing constituency and dependency trees, i.e., to produce compatible constituency and dependency trees simultaneously for input sentences, which is attractive considering that the two types of trees are complementary in representing syntax. Compared with previous works, we make progress in four aspects: (1) adopting a much more efficient decoding algorithm, (2) exploring joint modeling at the training phase, instead of only at the inference phase, (3) proposing high-order scoring components for constituent-dependency interaction, (4) gaining more insights via in-depth experiments and analysis.
这项工作研究了同时解析生成式和依赖关系树的主题,即同时为输入句子生成 compatible 的生成式和依赖关系树,这很有吸引力,因为这两种树在表示语法方面是互补的。与以前的工作相比,我们在四个方面取得了进展:(1)采用更高效的解码算法,(2)在训练阶段而不是推理阶段探索联合建模,(3)提议用于生成式-依赖关系相互作用的高等级评分组件,(4)通过深入的实验和分析获得更多见解。
https://arxiv.org/abs/2309.11888
Existing NeRF models for satellite images suffer from slow speeds, mandatory solar information as input, and limitations in handling large satellite images. In response, we present SatensoRF, which significantly accelerates the entire process while employing fewer parameters for satellite imagery of large size. Besides, we observed that the prevalent assumption of Lambertian surfaces in neural radiance fields falls short for vegetative and aquatic elements. In contrast to the traditional hierarchical MLP-based scene representation, we have chosen a multiscale tensor decomposition approach for color, volume density, and auxiliary variables to model the lightfield with specular color. Additionally, to rectify inconsistencies in multi-date imagery, we incorporate total variation loss to restore the density tensor field and treat the problem as a denosing this http URL validate our approach, we conducted assessments of SatensoRF using subsets from the spacenet multi-view dataset, which includes both multi-date and single-date multi-view RGB images. Our results clearly demonstrate that SatensoRF surpasses the state-of-the-art Sat-NeRF series in terms of novel view synthesis performance. Significantly, SatensoRF requires fewer parameters for training, resulting in faster training and inference speeds and reduced computational demands.
现有的NeRF模型对卫星图像的处理速度较慢,必须使用太阳信息作为输入,并且处理大型卫星图像存在限制。为了解决这些问题,我们提出了SatensoRF,它 significantly加速了整个过程,同时使用较少的参数处理大型卫星图像。此外,我们观察到,在神经网络辐射场中,通常使用的 Lambertian 表面假设对于植物和水生元素并不适用。与传统的分层MLP场景表示不同,我们选择了一种多尺度tensor分解 approach,用于颜色、体积密度和辅助变量,以模型具有发光颜色的光场。为了纠正多日期图像中的不一致性,我们引入了total variation loss来恢复密度 Tensor 场,并将其视为检测这个http URL以验证我们的方法。我们使用空间net多视角数据集的子集评估了SatensoRF,该数据集包括多日期和单个日期的多视角RGB图像。我们的结果显示,SatensoRF在新颖视角合成性能方面超越了当前最先进的Sat-NeRF系列。 significantly,SatensoRF在训练时需要更少的参数,导致更快的训练和推理速度,以及减少的计算需求。
https://arxiv.org/abs/2309.11767
Whisper is a powerful automatic speech recognition (ASR) model. Nevertheless, its zero-shot performance on low-resource speech requires further improvement. Child speech, as a representative type of low-resource speech, is leveraged for adaptation. Recently, parameter-efficient fine-tuning (PEFT) in NLP was shown to be comparable and even better than full fine-tuning, while only needing to tune a small set of trainable parameters. However, current PEFT methods have not been well examined for their effectiveness on Whisper. In this paper, only parameter composition types of PEFT approaches such as LoRA and Bitfit are investigated as they do not bring extra inference costs. Different popular PEFT methods are examined. Particularly, we compare LoRA and AdaLoRA and figure out the learnable rank coefficient is a good design. Inspired by the sparse rank distribution allocated by AdaLoRA, a novel PEFT approach Sparsely Shared LoRA (S2-LoRA) is proposed. The two low-rank decomposed matrices are globally shared. Each weight matrix only has to maintain its specific rank coefficients that are constrained to be sparse. Experiments on low-resource Chinese child speech show that with much fewer trainable parameters, S2-LoRA can achieve comparable in-domain adaptation performance to AdaLoRA and exhibit better generalization ability on out-of-domain data. In addition, the rank distribution automatically learned by S2-LoRA is found to have similar patterns to AdaLoRA's allocation.
Whisper 是一个强大的自动语音识别模型(ASR)。然而,它对低资源语音的零样本性能仍需要进一步改进。儿童语言作为代表低资源语音类型的一种,是适应的利用方式。最近,在自然语言处理(NLP)中,高效的参数微调(PEFT)被认为与Full Fine-tuning 相当,甚至更好,只需要调整一小部分可训练参数。但是,当前 PEFT 方法对 Whisper 的有效性并未充分研究。在本文中,仅考虑了 PEFT 方法中的参数组成类型,如 LoRA 和 Bitfit,因为它们不会带来额外的推理成本。不同流行的 PEFT 方法也被研究了。特别是,我们比较了 LoRA 和 AdaLoRA 的稀疏秩分布,并发现可学习秩系数是一个好设计。受到 AdaLoRA 分配的稀疏秩分布启发,我们提出了一种名为 Sparsely Shared LoRA(S2-LoRA)的新 PEFT 方法。两个低秩分解矩阵全球共享。每个权重矩阵只需要维持其特定的秩系数,被限制为稀疏。对低资源中国儿童语言的实验表明,尽管可训练参数较少,S2-LoRA 可以在同域适应性能上与 AdaLoRA 相当,并在跨域数据上表现出更好的泛化能力。此外,S2-LoRA 自动学习的秩分布与 AdaLoRA 的分配具有相似模式。
https://arxiv.org/abs/2309.11756