We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation (LoRA) Experts. Moving beyond conventional methods that employ a static top-k strategy for activating experts, AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks. By replacing a single LoRA in a layer with multiple LoRA experts and integrating a gating function with the threshold mechanism, AdaMoLE effectively selects and activates the most appropriate experts based on the input context. Our extensive evaluations across a variety of commonsense reasoning and natural language processing tasks show that AdaMoLE exceeds baseline performance. This enhancement highlights the advantages of AdaMoLE's adaptive selection of LoRA experts, improving model effectiveness without a corresponding increase in the expert count. The experimental validation not only confirms AdaMoLE as a robust approach for enhancing LLMs but also suggests valuable directions for future research in adaptive expert selection mechanisms, potentially broadening the scope for optimizing model performance across diverse language processing tasks.
我们提出了AdaMoLE,一种通过自适应混合低秩适应(LoRA)专家来微调大型语言模型(LLMs)的新方法。超越了采用静态top-k策略激活专家的常规方法,AdaMoLE通过专用的阈值网络动态调整激活阈值,并根据不同任务的不断变化复杂性进行自适应响应。通过用多个LoRA专家替换一个层中的单个LoRA,并集成阈机制的gating函数,AdaMoLE有效地基于输入上下文选择和激活最合适的专家。我们对多种常识推理和自然语言处理任务进行的广泛评估表明,AdaMoLE超越了基线性能。这种增强突出了AdaMoLE自适应选择LoRA专家的优势,在不增加专家数量的情况下提高了模型的效果。实验验证不仅证实了AdaMoLE是一种增强LLM的稳健方法,而且为未来的自适应专家选择机制研究提供了有价值的方向,可能拓宽了优化模型性能跨多样化语言处理任务的范围。
https://arxiv.org/abs/2405.00361
Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.
大语言模型(LLMs)在自动回归解码与大多数当代GPU设计之间的不匹配方面存在效率低的问题。具体来说,通过其有限计算内存带宽加载数十亿到数百亿个参数到GPU缓存中,但只有很少的token被实际计算。因此,GPU大部分时间都在内存传输上而不是计算。最近,并行解码,一种类 speculation decoding 算法,变得越来越受欢迎,并在生成方面取得了令人印象深刻的效率提升。它引入了额外的解码头到大型模型,使它们能够同时预测多个后续token,并在一个解码步骤中验证这些候选继续。然而,这种方法与用于预训练的下一个token预测的训练目标相偏离,导致对于候选token的命中率较低。在本文中,我们提出了一个新的类 speculation decoding 算法,Clover,该算法将序列知识集成到并行解码过程中。这种增强提高了投机者的命中率,从而提高了整体效率。Clover通过反向连接传输预speculated tokens的序列知识,然后采用注意力解码器将这些speculated tokens集成起来。此外,Clover还引入了一个增强块,用于修改隐藏状态,使其更符合预测投机生成而不是下一个token的预测。实验结果表明,Clover在巴ichuan-Small和巴ichuan-Large上的性能均优于基线,分别提高了91%和146%,超过了之前最佳方法Medusa在巴ichuan-Small和巴ichuan-Large上的性能,提高了37%和57%。
https://arxiv.org/abs/2405.00263
In the evolving landscape of computer vision, foundation models have emerged as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks. Among these, the Segment Anything Model (SAM) by Meta AI has distinguished itself in image segmentation. However, SAM, like its counterparts, encounters limitations in specific niche applications, prompting a quest for enhancement strategies that do not compromise its inherent capabilities. This paper introduces ASAM, a novel methodology that amplifies SAM's performance through adversarial tuning. We harness the potential of natural adversarial examples, inspired by their successful implementation in natural language processing. By utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B dataset, generating adversarial instances that are more representative of natural variations rather than conventional imperceptible perturbations. Our approach maintains the photorealism of adversarial examples and ensures alignment with original mask annotations, thereby preserving the integrity of the segmentation task. The fine-tuned ASAM demonstrates significant improvements across a diverse range of segmentation tasks without necessitating additional data or architectural modifications. The results of our extensive evaluations confirm that ASAM establishes new benchmarks in segmentation tasks, thereby contributing to the advancement of foundational models in computer vision. Our project page is in this https URL.
在计算机视觉不断演变的领域中,基础模型已经成为了重要的工具,表现出了对各种任务非常出色的适应性。在这些基础模型中,元人工智能(Meta AI)的分割 Anything Model(SAM)在图像分割方面表现出了卓越的适应性。然而,与它的同类模型一样,SAM在特定领域应用中遇到了局限性,催生了不牺牲其固有能力的增强策略的寻求。本文介绍了 ASAM,一种通过对抗调整来提高 SAM性能的新型方法。我们利用自然 adversarial 示例的潜力,受到它们在自然语言处理领域成功实施的影响。通过使用稳定的扩散模型,我们扩展了 SA-1B 数据集中的子集(1%),生成了更具有代表性的自然变化实例,而不是传统的不可见的扰动。我们的方法保持了对 adversarial 示例的 photorealism,并确保与原始掩码注释保持一致,从而保留分割任务的完整性。经过我们进行的广泛评估,ASAM 在各种分割任务上都取得了显著的改进,而无需增加数据或架构修改。我们广泛的评估结果证实,ASAM在分割任务上建立了新的基准,从而促进了计算机视觉基础模型的发展。我们的项目页面在这个链接。
https://arxiv.org/abs/2405.00256
Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.
在文档图像中的表格检测是一个关键的任务,涉及表格的识别和定位。尽管最近在深度学习领域的进步大大提高了这一任务的准确性,但仍然高度依赖大型带标签数据集进行有效的训练。为克服这一挑战,已经出现了几種半监督方法,通常采用基于卷积神经网络(CNN)的检测器以及非最大抑制(NMS)等后处理技术。然而,该领域的最新进展已经将重点转向基于Transformer的技术,消除了NMS的需要,并强调了对象查询和注意机制。之前的研究集中在两个关键领域以提高基于Transformer的检测器的质量:优化对象查询和优化注意机制。然而,增加对象查询可能会引入冗余,而调整注意机制可能会增加复杂性。为了应对这些挑战,我们引入了一种半监督方法,使用了SAM-DETR,一种用于精确将对象查询与目标特征对齐的新颖方法。我们的方法在减少误检率和提高表格检测性能方面取得了显著的降幅,特别是在具有多样表格结构的复杂文档中。这项工作在半监督环境中提供了更高效和准确的表格检测。
https://arxiv.org/abs/2405.00187
In recent years, zero-shot learning has attracted the focus of many researchers, due to its flexibility and generality. Many approaches have been proposed to achieve the zero-shot classification of the point clouds for 3D object understanding, following the schema of CLIP. However, in the real world, the point clouds could be extremely sparse, dramatically limiting the effectiveness of the 3D point cloud encoders, and resulting in the misalignment of point cloud features and text embeddings. To the point cloud encoders to fit the extremely sparse point clouds without re-running the pre-training procedure which could be time-consuming and expensive, in this work, we propose an unsupervised model adaptation approach to enhance the point cloud encoder for the extremely sparse point clouds. We propose a novel fused-cross attention layer that expands the pre-trained self-attention layer with additional learnable tokens and attention blocks, which effectively modifies the point cloud features while maintaining the alignment between point cloud features and text embeddings. We also propose a complementary learning-based self-distillation schema that encourages the modified features to be pulled apart from the irrelevant text embeddings without overfitting the feature space to the observed text embeddings. Extensive experiments demonstrate that the proposed approach effectively increases the zero-shot capability on extremely sparse point clouds, and overwhelms other state-of-the-art model adaptation approaches.
近年来,由于其灵活性和普适性,零样本学习(Zero-Shot Learning)吸引了许多研究人员的关注。为了实现3D物体理解中点云的零样本分类,许多方法提出了基于CLIP的方案。然而,在现实生活中,点云可能非常稀疏,极大地限制了3D点云编码器的有效性,并导致点云特征与文本嵌入之间的不匹配。为了适应稀疏的点云,避免重新进行预训练,我们在这个工作中提出了一个无监督的模型适应方法,以增强适应稀疏点云的点云编码器。我们提出了一个新颖的融合跨注意层,通过增加可学习标记和注意力模块,扩展了预训练的自注意力层,有效地修改点云特征,同时保持点云特征与文本嵌入之间的对齐。我们还提出了一个基于互补学习的自监督损失模式,鼓励修改后的特征从相关的文本嵌入中分离出来,以避免对特征空间对观察到的文本嵌入过拟合。大量实验证明,与最先进的模型适应方法相比,所提出的方案在稀疏点云上显著提高了零样本能力,并超越了其他方法。
https://arxiv.org/abs/2404.19639
Detecting diseases from social media has diverse applications, such as public health monitoring and disease spread detection. While language models (LMs) have shown promising performance in this domain, there remains ongoing research aimed at refining their discriminating representations. In this paper, we propose a novel method that integrates Contrastive Learning (CL) with language modeling to address this challenge. Our approach introduces a self-augmentation method, wherein hidden representations of the model are augmented with their own representations. This method comprises two branches: the first branch, a traditional LM, learns features specific to the given data, while the second branch incorporates augmented representations from the first branch to encourage generalization. CL further refines these representations by pulling pairs of original and augmented versions closer while pushing other samples away. We evaluate our method on three NLP datasets encompassing binary, multi-label, and multi-class classification tasks involving social media posts related to various diseases. Our approach demonstrates notable improvements over traditional fine-tuning methods, achieving up to a 2.48% increase in F1-score compared to baseline approaches and a 2.1% enhancement over state-of-the-art methods.
检测社交媒体上的疾病具有多种应用,例如公共卫生监测和疾病传播检测。虽然语言模型(LMs)在此领域显示出有前景的表现,但仍然有持续的研究旨在改进它们的区分性表示。在本文中,我们提出了一种将对比学习(CL)与语言建模相结合的新方法,以解决这个挑战。我们的方法引入了一种自增强方法,其中模型的隐藏表示被自身表示增强。这个方法包括两个分支:第一个分支是一个传统的LM,学习给定数据的特征,而第二个分支从第一个分支中引入增强表示,以鼓励泛化。CL通过将原始和增强版本之间的对对距离拉近,同时将其他样本推开,进一步优化这些表示。我们在三个涵盖二元、多标签和多分类任务的社会媒体帖子相关的NLP数据集上评估我们的方法。我们的方法在传统微调方法上取得了显著的改进,与基线方法相比,F1分数提高了2.48%,与最先进的 methods相比,F1分数提高了2.1%。
https://arxiv.org/abs/2405.01597
Diffusion models have emerged as effective tools for generating diverse and high-quality content. However, their capability in high-resolution image generation, particularly for panoramic images, still faces challenges such as visible seams and incoherent transitions. In this paper, we propose TwinDiffusion, an optimized framework designed to address these challenges through two key innovations: Crop Fusion for quality enhancement and Cross Sampling for efficiency optimization. We introduce a training-free optimizing stage to refine the similarity of the adjacent image areas, as well as an interleaving sampling strategy to yield dynamic patches during the cropping process. A comprehensive evaluation is conducted to compare TwinDiffusion with the existing methods, considering factors including coherence, fidelity, compatibility, and efficiency. The results demonstrate the superior performance of our approach in generating seamless and coherent panoramas, setting a new standard in quality and efficiency for panoramic image generation.
扩散模型已成为生成多样且高质量内容的有效工具。然而,其在高分辨率图像生成方面,特别是全景图像,仍然面临着一些挑战,如可见的拼接和不相干的过渡。在本文中,我们提出了TwinDiffusion,一种通过两个关键创新来解决这些挑战的优化框架:裁剪融合用于质量增强和交叉采样用于效率优化。我们引入了一个无需训练的优化阶段来精炼相邻图像区域的相似性,以及一个跨采样策略,以便在裁剪过程中产生动态补丁。对现有方法进行了全面的评估,考虑了包括一致性、忠实性、兼容性和效率在内的因素。结果表明,我们的方法在生成无缝和一致的全景图像方面表现出卓越的性能,为全景图像生成树立了新的质量和技术标准。
https://arxiv.org/abs/2404.19475
Ensuring intelligible speech communication for hearing assistive devices in low-latency scenarios presents significant challenges in terms of speech enhancement, coding and transmission. In this paper, we propose novel solutions for low-latency joint speech transmission and enhancement, leveraging deep neural networks (DNNs). Our approach integrates two state-of-the-art DNN architectures for low-latency speech enhancement and low-latency analog joint source-channel-based transmission, creating a combined low-latency system and jointly training both systems in an end-to-end approach. Due to the computational demands of the enhancement system, this order is suitable when high computational power is unavailable in the decoder, like hearing assistive devices. The proposed system enables the configuration of total latency, achieving high performance even at latencies as low as 3 ms, which is typically challenging to attain. The simulation results provide compelling evidence that a joint enhancement and transmission system is superior to a simple concatenation system in diverse settings, encompassing various wireless channel conditions, latencies, and background noise scenarios.
为了在低延迟场景中确保听觉辅助设备的可理解语音通信,我们在本文中提出了新的解决方案,利用深度神经网络(DNN)进行低延迟语音增强和传输。我们的方法将两个最先进的DNN架构(低延迟语音增强和低延迟模拟联合源-通道-基于传输)集成到一个综合的低延迟系统中,并使用端到端方法同时训练这两个系统。由于增强系统的计算需求较高,当解码器的计算能力无法满足要求时(例如助听辅助设备),这种顺序是合适的。所提出的系统可以实现总延迟的配置,即使在延迟高达3毫秒时,性能也仍然很高,这是通常很难达到的。仿真结果提供了有力的证据,表明联合增强和传输系统在各种场景中优于简单的串联系统,包括各种无线信道条件、延迟和背景噪声场景。
https://arxiv.org/abs/2404.19375
A vision-based drone-to-drone detection system is crucial for various applications like collision avoidance, countering hostile drones, and search-and-rescue operations. However, detecting drones presents unique challenges, including small object sizes, distortion, occlusion, and real-time processing requirements. Current methods integrating multi-scale feature fusion and temporal information have limitations in handling extreme blur and minuscule objects. To address this, we propose a novel coarse-to-fine detection strategy based on vision transformers. We evaluate our approach on three challenging drone-to-drone detection datasets, achieving F1 score enhancements of 7%, 3%, and 1% on the FL-Drones, AOT, and NPS-Drones datasets, respectively. Additionally, we demonstrate real-time processing capabilities by deploying our model on an edge-computing device. Our code will be made publicly available.
基于视觉的无人机对无人机检测系统对于各种应用,如避障、应对敌对无人机和搜索与救援任务至关重要。然而,检测无人机存在独特的挑战,包括小物体尺寸、畸变、遮挡和实时处理需求。目前将多尺度特征融合和时间信息相结合的方法在处理极端模糊和微小物体方面存在局限。为了应对这一挑战,我们提出了一个基于视觉变压器的全新粗-到细检测策略。我们在FL-Drones、AOT和NPS-Drones等三个具有挑战性的无人机对无人机检测数据集上进行了评估,分别实现了FL-Drones数据集的F1得分提高7%、AOT数据集的F1得分提高3%和NPS-Drones数据集的F1得分提高1%。此外,通过将我们的模型部署在边缘计算设备上,我们还展示了实时处理能力。我们的代码将公开发布。
https://arxiv.org/abs/2404.19276
Health literacy is crucial to supporting good health and is a major national goal. Audio delivery of information is becoming more popular for informing oneself. In this study, we evaluate the effect of audio enhancements in the form of information emphasis and pauses with health texts of varying difficulty and we measure health information comprehension and retention. We produced audio snippets from difficult and easy text and conducted the study on Amazon Mechanical Turk (AMT). Our findings suggest that emphasis matters for both information comprehension and retention. When there is no added pause, emphasizing significant information can lower the perceived difficulty for difficult and easy texts. Comprehension is higher (54%) with correctly placed emphasis for the difficult texts compared to not adding emphasis (50%). Adding a pause lowers perceived difficulty and can improve retention but adversely affects information comprehension.
健康素养对于支持良好的健康和实现国家战略至关重要。获取信息的音频交付变得越来越受欢迎,以告知自己。在这项研究中,我们评估了音频增强在形式为信息强调和停顿的健康文本上的效果,并测量了健康信息的理解和保留。我们在难易不同的文本上产生了音频片段,并在亚马逊 Mechanical Turk(AMT)上进行了研究。我们的研究结果表明,强调对于信息理解和保留都至关重要。没有添加停顿时,强调重要信息可以降低对于难易文本的感知难度。在难文本上,正确的强调位置可以提高理解程度(54%),而不添加强调时(50%),理解程度更高。添加停顿可以降低感知难度并提高保留,但同时也会对信息理解产生不利影响。
https://arxiv.org/abs/2404.19119
This paper introduces YOLOv8-TO, a novel approach for reverse engineering of topology-optimized structures into interpretable geometric parameters using the YOLOv8 instance segmentation model. Density-based topology optimization methods require post-processing to convert the optimal density distribution into a parametric representation for design exploration and integration with CAD tools. Traditional methods such as skeletonization struggle with complex geometries and require manual intervention. YOLOv8-TO addresses these challenges by training a custom YOLOv8 model to automatically detect and reconstruct structural components from binary density distributions. The model is trained on a diverse dataset of both optimized and random structures generated using the Moving Morphable Components method. A custom reconstruction loss function based on the dice coefficient of the predicted geometry is used to train the new regression head of the model via self-supervised learning. The method is evaluated on test sets generated from different topology optimization methods, including out-of-distribution samples, and compared against a skeletonization approach. Results show that YOLOv8-TO significantly outperforms skeletonization in reconstructing visually and structurally similar designs. The method showcases an average improvement of 13.84% in the Dice coefficient, with peak enhancements reaching 20.78%. The method demonstrates good generalization to complex geometries and fast inference times, making it suitable for integration into design workflows using regular workstations. Limitations include the sensitivity to non-max suppression thresholds. YOLOv8-TO represents a significant advancement in topology optimization post-processing, enabling efficient and accurate reverse engineering of optimized structures for design exploration and manufacturing.
本文介绍了一种名为YOLOv8-TO的新方法,用于使用YOLOv8实例分割模型将拓扑优化结构反向工程为可解释的几何参数。密度基于拓扑优化方法需要后处理将最优密度分布转换为设计探索和CAD工具集成所需的参数表示。传统方法如骨架化在复杂几何图形上挣扎,并需要手动干预。YOLOv8-TO通过训练自适应检测和重构结构的YOLOv8模型来解决这些挑战。模型在通过自监督学习训练的新回归头的基础上进行训练,同时使用基于 dice 系数的自适应重构损失函数进行训练。该方法在从不同拓扑优化方法产生的测试集中进行评估,包括离散样本,并将其与骨架化方法进行比较。结果表明,YOLOv8-TO在重构视觉和结构相似的设计方面显著优于骨架化方法。该方法在 Dice 系数上展示了13.84%的改进,峰值增强达到20.78%。该方法具有良好的对复杂几何的泛化能力,并且具有快速的推理时间,使其适用于使用常规工作台进行设计工作流程的集成。局限性包括对非最大抑制阈值的敏感性。YOLOv8-TO在拓扑优化后处理方面取得了显著的进展,实现了对优化结构的高效且准确的逆向工程,以进行设计探索和制造。
https://arxiv.org/abs/2404.18763
Recent developments in neural rendering techniques have greatly enhanced the rendering of photo-realistic 3D scenes across both academic and commercial fields. The latest method, known as 3D Gaussian Splatting (3D-GS), has set new benchmarks for rendering quality and speed. Nevertheless, the limitations of 3D-GS become pronounced in synthesizing new viewpoints, especially for views that greatly deviate from those seen during training. Additionally, issues such as dilation and aliasing arise when zooming in or out. These challenges can all be traced back to a single underlying issue: insufficient sampling. In our paper, we present a bootstrapping method that significantly addresses this problem. This approach employs a diffusion model to enhance the rendering of novel views using trained 3D-GS, thereby streamlining the training process. Our results indicate that bootstrapping effectively reduces artifacts, as well as clear enhancements on the evaluation metrics. Furthermore, we show that our method is versatile and can be easily integrated, allowing various 3D reconstruction projects to benefit from our approach.
近年来,神经渲染技术的发展已经极大地提高了照片写实3D场景的渲染效果,无论是学术还是商业领域。最新的方法,称为3D高斯展平(3D-GS)方法,为渲染质量和速度设定了新的基准。然而,3D-GS的局限性在生成新的视角时表现得尤为突出,特别是对于在训练过程中观察到的视角有着很大偏离的视角。此外,在缩放时会出现扩散和混叠等问题。这些问题都可以追溯到一个根本问题:不足的采样。在本文中,我们提出了一个 bootstrapping 方法,显著地解决了这个问题。这种方法采用扩散模型来增强使用训练后的3D-GS生成新视角,从而简化训练过程。我们的结果表明,通过bootstrap有效地减少了伪影,同时在评估指标上显示出明显的增强。此外,我们还证明了我们的方法是灵活的,可以轻松地与其他3D重建项目集成,从而使各种3D项目都能从我们的方法中受益。
https://arxiv.org/abs/2404.18669
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at this https URL.
MTL是一种学习范式,有效利用任务特定和共享信息来同时解决多个相关任务。与STL相比,MTL提供了一系列增强训练过程和推理效率的优势。MTL的关键优势包括简化模型架构、性能提升和跨领域泛化。在过去的二十年中,MTL已经成为许多领域广泛认可的灵活有效的解决方案,包括CV、自然语言处理、推荐系统、疾病预后和诊断、以及机器人领域。本次调查全面回顾了MTL的发展历程,从传统方法的尖端技术到深度学习的最新趋势,以及预训练基础模型的最新趋势。我们的调查系统地分类MTL技术为五个关键领域:正则化、关系学习、特征传播、优化和预训练。这种分类不仅按时间顺序描述了MTL的发展,还深入研究了每个领域的各种专业策略。此外,调查揭示了MTL如何从处理固定任务转变为更加灵活的方法,摆脱了任务或模型约束。它探讨了任务提示的和无条件的训练概念,以及ZSL(零样本学习)的能力,揭示了这一历史悠久的值得称赞的学习范式所蕴含的潜力。总的来说,我们希望通过这次调查为研究社区提供MTL从1997年创立到2023年的全面概述。我们关注当前的挑战,展望未来的机遇,以一种全面的方式揭示MTL研究在各个领域的机会和潜在途径。这个项目在https://这个URL上公开可用。
https://arxiv.org/abs/2404.18961
Recently, deep learning models have achieved excellent performance in hyperspectral image (HSI) classification. Among the many deep models, Transformer has gradually attracted interest for its excellence in modeling the long-range dependencies of spatial-spectral features in HSI. However, Transformer has the problem of quadratic computational complexity due to the self-attention mechanism, which is heavier than other models and thus has limited adoption in HSI processing. Fortunately, the recently emerging state space model-based Mamba shows great computational efficiency while achieving the modeling power of Transformers. Therefore, in this paper, we make a preliminary attempt to apply the Mamba to HSI classification, leading to the proposed spectral-spatial Mamba (SS-Mamba). Specifically, the proposed SS-Mamba mainly consists of spectral-spatial token generation module and several stacked spectral-spatial Mamba blocks. Firstly, the token generation module converts any given HSI cube to spatial and spectral tokens as sequences. And then these tokens are sent to stacked spectral-spatial mamba blocks (SS-MB). Each SS-MB block consists of two basic mamba blocks and a spectral-spatial feature enhancement module. The spatial and spectral tokens are processed separately by the two basic mamba blocks, respectively. Besides, the feature enhancement module modulates spatial and spectral tokens using HSI sample's center region information. In this way, the spectral and spatial tokens cooperate with each other and achieve information fusion within each block. The experimental results conducted on widely used HSI datasets reveal that the proposed model achieves competitive results compared with the state-of-the-art methods. The Mamba-based method opens a new window for HSI classification.
近年来,在超光谱图像(HSI)分类中,深度学习模型已经取得了非常好的性能。在众多深度模型中,Transformer因其 在HSI中建模空间光谱特征的长距离依赖而逐渐受到关注。然而,由于自注意力机制,Transformer 的计算复杂度为二次方,这使得它对HSI处理的应用受到了限制。幸运的是,基于状态空间模型的Mamba模型在计算效率上表现出色,同时具有Transformer的建模能力。因此,在本文中,我们尝试将Mamba应用于HSI分类,导致提出了光谱空间Mamba(SS-Mamba)。具体来说,SS-Mamba主要由光谱空间令牌生成模块和几个堆叠的光谱空间Mamba块组成。首先,令牌生成模块将任意HSI立方体转换为空间和光谱令牌序列。然后,这些令牌被发送到堆叠的光谱空间Mamba块(SS-MB)。每个SS-MB块由两个基本Mamba块和一个光谱空间特征增强模块组成。空间和光谱令牌分别通过基本Mamba块进行处理。此外,特征增强模块通过HSI样本中心区域信息对空间和光谱令牌进行调整。以这种方式,空间和光谱令牌相互合作,在每个块内实现信息融合。在广泛使用的HSI数据集上进行的实验结果表明,与最先进的方法相比,所提出的模型具有竞争力的结果。基于Mamba的方法为HSI分类打开了一个新的窗口。
https://arxiv.org/abs/2404.18401
This paper proposes a photorealistic real-time dense 3D mapping system that utilizes a learning-based image enhancement method and mesh-based map representation. Due to the characteristics of the underwater environment, where problems such as hazing and low contrast occur, it is hard to apply conventional simultaneous localization and mapping (SLAM) methods. Furthermore, for sensitive tasks like inspecting cracks, photorealistic mapping is very important. However, the behavior of Autonomous Underwater Vehicle (AUV) is computationally constrained. In this paper, we utilize a neural network-based image enhancement method to improve pose estimation and mapping quality and apply a sliding window-based mesh expansion method to enable lightweight, fast, and photorealistic mapping. To validate our results, we utilize real-world and indoor synthetic datasets. We performed qualitative validation with the real-world dataset and quantitative validation by modeling images from the indoor synthetic dataset as underwater scenes.
本文提出了一种利用基于学习的图像增强方法和基于网格的地图表示的等距实时三维映射系统。由于水下环境的特性,例如雾和低对比度问题,因此很难应用传统的同时定位和映射(SLAM)方法。此外,对于诸如检查裂纹等敏感任务,等距实时映射非常重要。然而,自主水下车辆(AUV)的行为是计算受限的。在本文中,我们利用基于神经网络的图像增强方法来提高姿态估计和映射质量,并采用滑动窗口基础的网格扩展方法来实现轻量、快速和等距实时映射。为了验证我们的结果,我们利用真实世界和室内合成数据集。我们通过真实世界数据集进行定性评估,并通过将室内合成数据集中的图像建模为水下场景进行定量评估。
https://arxiv.org/abs/2404.18395
With the evolution of Text-to-Image (T2I) models, the quality defects of AI-Generated Images (AIGIs) pose a significant barrier to their widespread adoption. In terms of both perception and alignment, existing models cannot always guarantee high-quality results. To mitigate this limitation, we introduce G-Refine, a general image quality refiner designed to enhance low-quality images without compromising the integrity of high-quality ones. The model is composed of three interconnected modules: a perception quality indicator, an alignment quality indicator, and a general quality enhancement module. Based on the mechanisms of the Human Visual System (HVS) and syntax trees, the first two indicators can respectively identify the perception and alignment deficiencies, and the last module can apply targeted quality enhancement accordingly. Extensive experimentation reveals that when compared to alternative optimization methods, AIGIs after G-Refine outperform in 10+ quality metrics across 4 databases. This improvement significantly contributes to the practical application of contemporary T2I models, paving the way for their broader adoption. The code will be released on this https URL.
随着Text-to-Image(T2I)模型的进化,人工智能生成图像(AIGIs)的质量缺陷对它们的应用范围构成了重大障碍。在感知和配准方面,现有的模型不能保证高质量结果。为了减轻这一限制,我们引入了G-Refine,一种通用的图像质量修复算法,它可以在不牺牲高质量图像的 integrity 的前提下增强低质量图像。该模型由三个相互连接的模块组成:感知质量指示器、配准质量指示器和通用质量增强模块。根据人类视觉系统(HVS)机制和语法树,前两个指示器可以分别识别感知和配准不足,最后一个模块可以相应地应用有针对性的质量增强。大量实验发现,与 alternative optimization methods 相比,经过 G-Refine 优化后的 AIGIs 在 4 个数据库中的 10+ 质量指标上表现出优越性能。这一改进 significantly contributes to the practical application of contemporary T2I models,为它们的更广泛应用铺平道路。代码将在这个https URL上发布。
https://arxiv.org/abs/2404.18343
With the escalating frequency of floods posing persistent threats to human life and property, satellite remote sensing has emerged as an indispensable tool for monitoring flood hazards. SpaceNet8 offers a unique opportunity to leverage cutting-edge artificial intelligence technologies to assess these hazards. A significant contribution of this research is its application of Apache Sedona, an advanced platform specifically designed for the efficient and distributed processing of large-scale geospatial data. This platform aims to enhance the efficiency of error analysis, a critical aspect of improving flood damage detection accuracy. Based on Apache Sedona, we introduce a novel approach that addresses the challenges associated with inaccuracies in flood damage detection. This approach involves the retrieval of cases from historical flood events, the adaptation of these cases to current scenarios, and the revision of the model based on clustering algorithms to refine its performance. Through the replication of both the SpaceNet8 baseline and its top-performing models, we embark on a comprehensive error analysis. This analysis reveals several main sources of inaccuracies. To address these issues, we employ data visual interpretation and histogram equalization techniques, resulting in significant improvements in model metrics. After these enhancements, our indicators show a notable improvement, with precision up by 5%, F1 score by 2.6%, and IoU by 4.5%. This work highlights the importance of advanced geospatial data processing tools, such as Apache Sedona. By improving the accuracy and efficiency of flood detection, this research contributes to safeguarding public safety and strengthening infrastructure resilience in flood-prone areas, making it a valuable addition to the field of remote sensing and disaster management.
随着洪水频率的不断上升,对人类生命和财产持续构成威胁,卫星遥感已成为监测洪水危害的不可或缺的工具。空间网络8提供了利用最先进的人工智能技术评估这些危害的独特机会。这项研究的重要贡献是应用了Apache Sedona,这是一个专门为大规模地理空间数据的高效和分布式处理而设计的先进平台。该平台旨在提高误差分析的效率,这是提高洪水损伤检测准确性的关键方面。基于Apache Sedona,我们引入了一种解决洪水损伤检测不准确性的新方法。这种方法涉及从历史洪水事件中检索案例,将这些案例适应到当前场景,并根据聚类算法修订模型以提高其性能。通过复制空间网络8基线和最佳模型的复制,我们进行全面的误差分析。分析揭示了几个主要的不准确来源。为了解决这些问题,我们采用数据可视化和直方图均衡技术,导致模型指标显著提高。这些增强之后,我们的指标显示,精度提高了5%,F1得分提高了2.6%,IoU提高了4.5%。这项工作突出了先进地理空间数据处理工具,如Apache Sedona的重要性。通过提高洪水检测的准确性和效率,这项研究为防洪减灾领域保障公共安全和提高基础设施韧性做出了重要贡献,成为遥感和灾害管理领域的宝贵财富。
https://arxiv.org/abs/2404.18235
Video frame interpolation, the process of synthesizing intermediate frames between sequential video frames, has made remarkable progress with the use of event cameras. These sensors, with microsecond-level temporal resolution, fill information gaps between frames by providing precise motion cues. However, contemporary Event-Based Video Frame Interpolation (E-VFI) techniques often neglect the fact that event data primarily supply high-confidence features at scene edges during multi-modal feature fusion, thereby diminishing the role of event signals in optical flow (OF) estimation and warping refinement. To address this overlooked aspect, we introduce an end-to-end E-VFI learning method (referred to as EGMR) to efficiently utilize edge features from event signals for motion flow and warping enhancement. Our method incorporates an Edge Guided Attentive (EGA) module, which rectifies estimated video motion through attentive aggregation based on the local correlation of multi-modal features in a coarse-to-fine strategy. Moreover, given that event data can provide accurate visual references at scene edges between consecutive frames, we introduce a learned visibility map derived from event data to adaptively mitigate the occlusion problem in the warping refinement process. Extensive experiments on both synthetic and real datasets show the effectiveness of the proposed approach, demonstrating its potential for higher quality video frame interpolation.
视频帧插值,即在序列视频帧之间合成中间帧的过程,在事件相机中的应用取得了显著的进展。这些传感器具有微秒级的时域分辨率,通过提供精确的运动提示来填补帧之间的信息缺口。然而,当代基于事件的视频帧插值(E-VFI)方法通常忽视了一个事实,即事件数据在多模态特征融合过程中主要提供高置信度的特征,从而减弱了事件信号在光流估计和畸变校正中的作用。为了解决这个问题,我们引入了一种端到端的E-VFI学习方法(称为EGMR),以有效地利用事件信号的边缘特征进行运动流和畸变增强。我们的方法包括一个边缘引导注意(EGA)模块,它基于多模态特征在粗到细策略下的局部相关性进行 attentive 聚合来纠正估计的视频运动。此外,由于事件数据可以在连续帧之间的场景边缘提供准确的视觉参考,我们引入了一个基于事件数据的 learned visibility map,用于自适应地减轻畸变校正过程中的遮挡问题。在合成和真实数据集上进行的大量实验证明了我们提出的方法的有效性,表明了它有潜力实现更高质量的视频帧插值。
https://arxiv.org/abs/2404.18156
Generative AI (GenAI) has witnessed remarkable progress in recent years and demonstrated impressive performance in various generation tasks in different domains such as computer vision and computational design. Many researchers have attempted to integrate GenAI into visualization framework, leveraging the superior generative capacity for different operations. Concurrently, recent major breakthroughs in GenAI like diffusion model and large language model have also drastically increase the potential of GenAI4VIS. From a technical perspective, this paper looks back on previous visualization studies leveraging GenAI and discusses the challenges and opportunities for future research. Specifically, we cover the applications of different types of GenAI methods including sequence, tabular, spatial and graph generation techniques for different tasks of visualization which we summarize into four major stages: data enhancement, visual mapping generation, stylization and interaction. For each specific visualization sub-task, we illustrate the typical data and concrete GenAI algorithms, aiming to provide in-depth understanding of the state-of-the-art GenAI4VIS techniques and their limitations. Furthermore, based on the survey, we discuss three major aspects of challenges and research opportunities including evaluation, dataset, and the gap between end-to-end GenAI and generative algorithms. By summarizing different generation algorithms, their current applications and limitations, this paper endeavors to provide useful insights for future GenAI4VIS research.
生成式人工智能(GenAI)在近年来取得了显著的进步,并在各种领域展示了令人印象深刻的生成能力,如计算机视觉和计算设计。许多研究人员试图将GenAI集成到可视化框架中,利用其在不同操作中的优越生成能力。同时,GenAI最近的重大突破如扩散模型和大语言模型也极大地提高了GenAI4VIS的潜力。从技术角度来看,本文回顾了以前使用GenAI进行可视化研究,并讨论了未来研究的挑战和机遇。具体来说,我们涵盖了不同类型的GenAI方法的各个任务,包括序列、表格、空间和图生成技术,总结为四个主要阶段:数据增强、视觉映射生成、风格化和交互。对于每个具体的可视化子任务,我们分别展示了典型的数据和具体的GenAI算法,旨在深入理解最先进的GenAI4VIS技术及其局限性。此外,根据调查,我们讨论了评估、数据集和研究差距(端到端GenAI与生成算法之间的差距)这三个方面的挑战和机遇。通过总结不同的生成算法、它们的当前应用和局限性,本文旨在为未来的GenAI4VIS研究提供有益的启示。
https://arxiv.org/abs/2404.18144
In this paper, we reveal the two sides of data augmentation: enhancements in closed-set recognition correlate with a significant decrease in open-set recognition. Through empirical investigation, we find that multi-sample-based augmentations would contribute to reducing feature discrimination, thereby diminishing the open-set criteria. Although knowledge distillation could impair the feature via imitation, the mixed feature with ambiguous semantics hinders the distillation. To this end, we propose an asymmetric distillation framework by feeding teacher model extra raw data to enlarge the benefit of teacher. Moreover, a joint mutual information loss and a selective relabel strategy are utilized to alleviate the influence of hard mixed samples. Our method successfully mitigates the decline in open-set and outperforms SOTAs by 2%~3% AUROC on the Tiny-ImageNet dataset and experiments on large-scale dataset ImageNet-21K demonstrate the generalization of our method.
在本文中,我们揭示了数据增强的两种方面:关闭集识别的增强与开放集识别的显著降低相关。通过实证调查,我们发现基于多个样本的增强将有助于减少特征歧视,从而降低开放集标准。尽管通过模仿可以损害特征,但具有模糊语义特征的混合特征会阻碍分类。因此,我们提出了一个非对称的分类框架,通过向教师模型提供额外的原始数据来增加教师模型的利益。此外,我们还采用了联合互信息损失和有选择地重置策略来减轻 hard mixed 样本的影响。我们的方法成功地减轻了开放集和跑在现有 SOTA 上的下降,在 Tiny-ImageNet 数据集上的 AUROC 评估结果比肩 SOTAs,同时在 ImageNet-21K 大规模数据集上的实验也证明了我们的方法的泛化能力。
https://arxiv.org/abs/2404.19527