After pre-training by generating the next word conditional on previous words, the Language Model (LM) acquires the ability of In-Context Learning (ICL) that can learn a new task conditional on the context of the given in-context examples (ICEs). Similarly, visually-conditioned Language Modelling is also used to train Vision-Language Models (VLMs) with ICL ability. However, such VLMs typically exhibit weaker classification abilities compared to contrastive learning-based models like CLIP, since the Language Modelling objective does not directly contrast whether an object is paired with a text. To improve the ICL of classification, using more ICEs to provide more knowledge is a straightforward way. However, this may largely increase the selection time, and more importantly, the inclusion of additional in-context images tends to extend the length of the in-context sequence beyond the processing capacity of a VLM. To alleviate these limitations, we propose to manipulate the label space of each ICE to increase its knowledge density, allowing for fewer ICEs to convey as much information as a larger set would. Specifically, we propose two strategies which are Label Distribution Enhancement and Visual Descriptions Enhancement to improve In-context classification performance on diverse datasets, including the classic ImageNet and more fine-grained datasets like CUB-200. Specifically, using our approach on ImageNet, we increase accuracy from 74.70\% in a 4-shot setting to 76.21\% with just 2 shots. surpassing CLIP by 0.67\%. On CUB-200, our method raises 1-shot accuracy from 48.86\% to 69.05\%, 12.15\% higher than CLIP. The code is given in https://anonymous.4open.science/r/MLS_ICC.
在通过生成先前的单词的下一个词的条件下进行预训练后,语言模型(LM)获得了在给定上下文例子(ICEs)上进行新任务学习的(ICL)能力。同样,视觉条件下的语言建模也被用来训练具有 ICL 能力的 Vision-Language 模型(VLMs)。然而,这种 VLMs 通常表现出与基于对比学习模型的 CLIP 等相比较弱的分类能力,因为语言建模目标不直接比较一个对象是否与文本配对。为了提高 ICL 的分类能力,使用更多的 ICE 提供更多的知识是一个直接的方法。然而,这可能会大大增加选择时间,更重要的是,添加更多的上下文图像通常会使得 VLM 的上下文序列长度超过其处理能力。为了减轻这些限制,我们提出了一种操纵每个 ICE 的标签空间以增加其知识密度的方法,从而允许更少的 ICE 提供与更大的集合相同的信息。具体来说,我们提出了两种策略:标签分布增强和视觉描述增强,以提高在各种数据集上的 In-Context 分类性能,包括经典的 ImageNet 和更精细的数据集如 CUB-200。具体来说,在 ImageNet 上,我们使用我们的方法将准确率从 74.70% 提高到 76.21%,超越了 CLIP 0.67%。在 CUB-200 上,我们的方法将 1 shot 准确率从 48.86% 提高到 69.05%,比 CLIP 高出 12.15%。代码可以从 https://anonymous.4open.science/r/MLS_ICC.
https://arxiv.org/abs/2312.00351
Stereo matching, a pivotal technique in computer vision, plays a crucial role in robotics, autonomous navigation, and augmented reality. Despite the development of numerous impressive methods in recent years, replicating their results and determining the most suitable architecture for practical application remains challenging. Addressing this gap, our paper introduces a comprehensive benchmark focusing on practical applicability rather than solely on performance enhancement. Specifically, we develop a flexible and efficient stereo matching codebase, called OpenStereo. OpenStereo includes training and inference codes of more than 12 network models, making it, to our knowledge, the most complete stereo matching toolbox available. Based on OpenStereo, we conducted experiments on the SceneFlow dataset and have achieved or surpassed the performance metrics reported in the original paper. Additionally, we conduct an in-depth revisitation of recent developments in stereo matching through ablative experiments. These investigations inspired the creation of StereoBase, a simple yet strong baseline model. Our extensive comparative analyses of StereoBase against numerous contemporary stereo matching methods on the SceneFlow dataset demonstrate its remarkably strong performance. The source code is available at this https URL.
立体匹配,计算机视觉中的一个关键技术,在机器人学、自主导航和增强现实等领域中发挥着重要作用。尽管近年来出现了许多令人印象深刻的算法,但复制其结果并确定最合适的实际应用架构仍然具有挑战性。为了填补这一空白,我们的论文介绍了一个全面的立体匹配基准,重点关注其实用性而不是仅仅关注性能提升。具体来说,我们开发了一个灵活且高效的立体匹配代码库,名为OpenStereo。OpenStereo包括超过12个网络模型的训练和推理代码,据我们所知,这是目前最完整的立体匹配工具箱。基于OpenStereo,我们在SceneFlow数据集上进行了实验,并取得了或超过了原始论文中报告的性能指标。此外,我们通过消减实验深入研究了立体匹配领域近期的发展。这些调查启发了StereoBase的创建,这是一个简单而强大的基线模型。我们对StereoBase与 SceneFlow 数据集中的众多当代立体匹配方法进行广泛的比较分析,证明了其惊人的性能。源代码可在此处访问:https://www.kazuhiko.net/StereoBase/
https://arxiv.org/abs/2312.00343
Stein's paradox holds considerable sway in high-dimensional statistics, highlighting that the sample mean, traditionally considered the de facto estimator, might not be the most efficacious in higher dimensions. To address this, the James-Stein estimator proposes an enhancement by steering the sample means toward a more centralized mean vector. In this paper, first, we establish that normalization layers in deep learning use inadmissible estimators for mean and variance. Next, we introduce a novel method to employ the James-Stein estimator to improve the estimation of mean and variance within normalization layers. We evaluate our method on different computer vision tasks: image classification, semantic segmentation, and 3D object classification. Through these evaluations, it is evident that our improved normalization layers consistently yield superior accuracy across all tasks without extra computational burden. Moreover, recognizing that a plethora of shrinkage estimators surpass the traditional estimator in performance, we study two other prominent shrinkage estimators: Ridge and LASSO. Additionally, we provide visual representations to intuitively demonstrate the impact of shrinkage on the estimated layer statistics. Finally, we study the effect of regularization and batch size on our modified batch normalization. The studies show that our method is less sensitive to batch size and regularization, improving accuracy under various setups.
Steins paradox 在高维统计中具有相当大的影响力,表明传统上被认为是实质估价器的样本均值可能不是在更高维中最为有效的。为解决这个问题,詹姆斯 - 斯坦恩估计器提出了一种通过将样本均值引导向更集中均值向量来增强的方法。在本文中,首先,我们证明了深度学习中的归一化层使用不合法的均值和方差估价器。接下来,我们引入了一种新方法,使用詹姆斯 - 斯坦恩估计器来改进归一化层中均值和方差的估计。我们对不同的计算机视觉任务进行评估:图像分类、语义分割和 3D 对象分类。通过这些评估,可以很明显地看出我们改进的归一化层在所有任务上都具有卓越的准确性和不需要额外的计算负担。此外,认识到许多收缩估计器在性能上超过了传统估计器,我们研究了另外两个重要的收缩估计器:Ridge 和 LASSO。此外,我们还提供了可视化表示,直观地展示了收缩对估计层统计量的影响。最后,我们研究了正则化和批量大小的影响,我们的修改后的批均正常化方法在不同设置下的准确性得到提高。研究表明,我们的方法对于批量大小的变化和正则化的影响较小,从而提高了准确性。
https://arxiv.org/abs/2312.00313
Polyp segmentation plays a vital role in accurately locating polyps at an early stage, which holds significant clinical importance for the prevention of colorectal cancer. Various polyp segmentation methods have been developed using fully-supervised deep learning techniques. However, pixel-wise annotation for polyp images by physicians during the diagnosis is both time-consuming and expensive. Moreover, visual foundation models such as the Segment Anything Model (SAM) have shown remarkable performance. Nevertheless, directly applying SAM to medical segmentation may not produce satisfactory results due to the inherent absence of medical knowledge. In this paper, we propose a novel SAM-guided Collaborative Learning Network (SAM-CLNet) for scribble-supervised polyp segmentation, enabling a collaborative learning process between our segmentation network and SAM to boost the model performance. Specifically, we first propose a Cross-level Enhancement and Aggregation Network (CEA-Net) for weakly-supervised polyp segmentation. Within CEA-Net, we propose a Cross-level Enhancement Module (CEM) that integrates the adjacent features to enhance the representation capabilities of different resolution features. Additionally, a Feature Aggregation Module (FAM) is employed to capture richer features across multiple levels. Moreover, we present a box-augmentation strategy that combines the segmentation maps generated by CEA-Net with scribble annotations to create more precise prompts. These prompts are then fed into SAM, generating segmentation SAM-guided masks, which can provide additional supervision to train CEA-Net effectively. Furthermore, we present an Image-level Filtering Mechanism to filter out unreliable SAM-guided masks. Extensive experimental results show that our SAM-CLNet outperforms state-of-the-art weakly-supervised segmentation methods.
翻译 聚类分割在准确定位早期阶段的肿瘤组织方面具有重要意义,这对预防直肠癌等恶性肿瘤的发病具有临床意义。已经开发了使用完全监督的深度学习技术各种聚类分割方法。然而,医生在诊断过程中对肿瘤图像的像素级注释既耗时又昂贵。此外,如Segment Anything Model (SAM)这样的视觉基础模型已经表现出显著的性能。然而,直接将SAM应用于医学分割可能导致不满意的结果,因为其固有的缺乏医学知识。在本文中,我们提出了一个新颖的SAM引导的协同学习网络(SAM-CLNet)用于手绘监督聚类分割,使我们的分割网络与SAM进行协同学习,提高模型性能。具体来说,我们首先提出了一个弱监督聚类分割器。在CEA-Net中,我们提出了一个跨层增强模块(CEM),该模块整合了不同分辨率特征的相邻特征,从而增强表示能力。此外,我们还采用了一个特征聚合模块(FAM)来捕捉多个层次上的更丰富特征。此外,我们提出了一个结合Cea-Net分割图和手绘注释创建更精确提示的框增强策略。这些提示随后输入SAM,生成指导SAM的聚类SAM-引导掩码,可以为训练Cea-Net提供额外的监督。此外,我们还提出了一个图像级别筛选机制以过滤不可靠的SAM指导掩码。大量的实验结果表明,我们的SAM-CLNet超越了最先进的弱监督分割方法。
https://arxiv.org/abs/2312.00312
Currently, to further improve visual enjoyment, Ultra-High-Definition (UHD) images are catching wide attention. Here, UHD images are usually referred to as having a resolution greater than or equal to $3840 \times 2160$. However, since the imaging equipment is subject to environmental noise or equipment jitter, UHD images are prone to contrast degradation, blurring, low dynamic range, etc. To address these issues, a large number of algorithms for UHD image enhancement have been proposed. In this paper, we introduce the current state of UHD image enhancement from two perspectives, one is the application field and the other is the technology. In addition, we briefly explore its trends.
目前,为了进一步提高视觉享受,超高清(UHD)图像引起了广泛关注。在这里,UHD图像通常被认为具有分辨率为3840×2160及以上的分辨率。然而,由于成像设备受环境噪声或设备抖动的影响,UHD图像容易发生对比度降低、模糊、低动态范围等问题。为解决这些问题,已经提出了许多UHD图像增强算法的改进措施。在本文中,我们从应用领域和技术两个方面介绍了UHD图像增强的现状。此外,我们还简要探讨了其发展趋势。
https://arxiv.org/abs/2312.00250
Dataset distillation aims to generate a smaller but representative subset from a large dataset, which allows a model to be trained efficiently, meanwhile evaluating on the original testing data distribution to achieve decent performance. Many prior works have aimed to align with diverse aspects of the original datasets, such as matching the training weight trajectories, gradient, feature/BatchNorm distributions, etc. In this work, we show how to distill various large-scale datasets such as full ImageNet-1K/21K under a conventional input resolution of 224$\times$224 to achieve the best accuracy over all previous approaches, including SRe$^2$L, TESLA and MTT. To achieve this, we introduce a simple yet effective ${\bf C}$urriculum ${\bf D}$ata ${\bf A}$ugmentation ($\texttt{CDA}$) during data synthesis that obtains the accuracy on large-scale ImageNet-1K and 21K with 63.2% under IPC (Images Per Class) 50 and 36.1% under IPC 20, respectively. Finally, we show that, by integrating all our enhancements together, the proposed model beats the current state-of-the-art by more than 4% Top-1 accuracy on ImageNet-1K/21K and for the first time, reduces the gap to its full-data training counterpart to less than absolute 15%. Moreover, this work represents the inaugural success in dataset distillation on larger-scale ImageNet-21K under the standard 224$\times$224 resolution. Our code and distilled ImageNet-21K dataset of 20 IPC, 2K recovery budget are available at this https URL.
数据集萃取的目标是从一个大数据集中生成一个较小的但具有代表性的子集,使得模型能够在同时训练效率的同时,通过评估原始测试数据分布来达到较好的性能。许多先前的研究试图与原始数据集的多样性方面对齐,例如匹配训练权重轨迹、梯度、特征/批归一化分布等。在本文中,我们展示了如何通过采用224$\times$224的常规输入分辨率对完整ImageNet-1K/21K等大型数据集进行萃取,以实现最佳精度,包括SRe$^2$L、TESLA和MTT。为了实现这一目标,我们在数据生成过程中引入了一个简单而有效的CDA数据增强(CDA)策略,它在大规模ImageNet-1K和21K上分别获得了63.2%的IPC(图像类/类)50和36.1%的IPC20的准确率。最后,我们证明了,将所有我们的增强措施相结合,所提出的模型在ImageNet-1K/21K上的Top-1准确率比肩甚至超过了当前最先进的水平,并且首次将差距减少到其全数据训练对应的绝对值小于15%。此外,本文还代表在标准224$\times$224分辨率下对ImageNet-21K进行数据萃取的 inaugural成功。我们的代码和萃取的ImageNet-21K数据集(20个类,2K恢复预算)可以从此链接获取:https://www.academia.edu/39411041/Database_Distillation_on_Larger_ImageNet_21K_Resolution
https://arxiv.org/abs/2311.18838
Diffusion models have achieved significant success in image and video generation. This motivates a growing interest in video editing tasks, where videos are edited according to provided text descriptions. However, most existing approaches only focus on video editing for short clips and rely on time-consuming tuning or inference. We are the first to propose Video Instruction Diffusion (VIDiff), a unified foundation model designed for a wide range of video tasks. These tasks encompass both understanding tasks (such as language-guided video object segmentation) and generative tasks (video editing and enhancement). Our model can edit and translate the desired results within seconds based on user instructions. Moreover, we design an iterative auto-regressive method to ensure consistency in editing and enhancing long videos. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively. More examples can be found at our website this https URL.
扩散模型已经在图像和视频生成方面取得了显著的成功。这激励了对视频编辑任务的浓厚兴趣,其中根据提供的文本描述对视频进行编辑。然而,大多数现有方法仅关注对短片段视频的编辑,并且依赖于耗时的调整或推理。我们是最先提出Video Instruction Diffusion(VIDiff)的团队,这是一个专为各种视频任务而设计的统一基础模型。这些任务包括理解任务(如语言指导的视频物体分割)和生成任务(视频编辑和增强)。根据用户指令,我们的模型可以在几秒钟内编辑和翻译所需的结果。此外,我们设计了一个迭代自回归方法,以确保编辑和增强长视频时的一致性。我们在网站https:// this URL上提供了令人信服的生成结果,无论是质保还是量保。更多例子可以在我们的网站上找到。
https://arxiv.org/abs/2311.18837
Underwater object detection is a crucial and challenging problem in marine engineering and aquatic robot. The difficulty is partly because of the degradation of underwater images caused by light selective absorption and scattering. Intuitively, enhancing underwater images can benefit high-level applications like underwater object detection. However, it is still unclear whether all object detectors need underwater image enhancement as pre-processing. We therefore pose the questions "Does underwater image enhancement really improve underwater object detection?" and "How does underwater image enhancement contribute to underwater object detection?". With these two questions, we conduct extensive studies. Specifically, we use 18 state-of-the-art underwater image enhancement algorithms, covering traditional, CNN-based, and GAN-based algorithms, to pre-process underwater object detection data. Then, we retrain 7 popular deep learning-based object detectors using the corresponding results enhanced by different algorithms, obtaining 126 underwater object detection models. Coupled with 7 object detection models retrained using raw underwater images, we employ these 133 models to comprehensively analyze the effect of underwater image enhancement on underwater object detection. We expect this study can provide sufficient exploration to answer the aforementioned questions and draw more attention of the community to the joint problem of underwater image enhancement and underwater object detection. The pre-trained models and results are publicly available and will be regularly updated. Project page: this https URL.
水下物体检测是海洋工程和水下机器人领域的一个关键而具有挑战性的问题。部分困难是因为水下图像由于光选择性吸收和散射而降解。直观地,提高水下图像有助于水下物体检测等高 level 应用。然而,目前尚不清楚所有物体检测器是否需要水下图像增强作为预处理。因此,我们提出了问题:“水下图像增强是否确实能提高水下物体检测?”和“水下图像增强如何对水下物体检测做出贡献?”为了回答这两个问题,我们进行了广泛的研究。具体来说,我们使用了18个最先进的 underwater 图像增强算法,包括传统基于 CNN 的算法、基于 GAN 的算法,对水下物体检测数据进行预处理。然后,我们使用相应算法增强结果的水下图像,重新训练7个流行的深度学习物体检测器,获得了126个水下物体检测模型。与使用原始水下图像重新训练的7个物体检测器相结合,我们将这些133个模型用于全面分析水下图像增强对水下物体检测的影响。我们期望这项研究可以为回答上述问题提供足够的探索,并吸引更多社区成员关注水下图像增强和水下物体检测的联合问题。预训练模型和结果是公开可用的,并将定期更新。项目页面:https:// this URL。
https://arxiv.org/abs/2311.18814
We extend JAX with the capability to automatically differentiate higher-order functions (functionals and operators). By representing functions as a generalization of arrays, we seamlessly use JAX's existing primitive system to implement higher-order functions. We present a set of primitive operators that serve as foundational building blocks for constructing several key types of functionals. For every introduced primitive operator, we derive and implement both linearization and transposition rules, aligning with JAX's internal protocols for forward and reverse mode automatic differentiation. This enhancement allows for functional differentiation in the same syntax traditionally use for functions. The resulting functional gradients are themselves functions ready to be invoked in python. We showcase this tool's efficacy and simplicity through applications where functional derivatives are indispensable. The source code of this work is released at this https URL .
我们通过实现高阶函数(函数和操作)的自动差分,扩展了 JAX 的功能。通过将函数表示为数组的推广,我们无缝地利用了 JAX 现有 primitive 系统来实现高阶函数。我们呈现了一组基本构建模块,用于构建多个关键类型的函数。对于每种引入的基本操作符,我们推导并实现了线性化和转置规则,符合 JAX 的前馈和反向模式自动差分内部协议。这项增强使得在相同语法下传统用于函数的功能差分成为可能。通过在需要功能差分的地方使用这个工具,我们展示了其效果和简洁性。本工作的源代码已发布在以下链接上:https://www.example.com/ 。
https://arxiv.org/abs/2311.18727
Signal-dependent beamformers are advantageous over signal-independent beamformers when the acoustic scenario - be it real-world or simulated - is straightforward in terms of the number of sound sources, the ambient sound field and their dynamics. However, in the context of augmented reality audio using head-worn microphone arrays, the acoustic scenarios encountered are often far from straightforward. The design of robust, high-performance, adaptive beamformers for such scenarios is an on-going challenge. This is due to the violation of the typically required assumptions on the noise field caused by, for example, rapid variations resulting from complex acoustic environments, and/or rotations of the listener's head. This work proposes a multi-channel speech enhancement algorithm which utilises the adaptability of signal-dependent beamformers while still benefiting from the computational efficiency and robust performance of signal-independent super-directive beamformers. The algorithm has two stages. (i) The first stage is a hybrid beamformer based on a dictionary of weights corresponding to a set of noise field models. (ii) The second stage is a wide-band subspace post-filter to remove any artifacts resulting from (i). The algorithm is evaluated using both real-world recordings and simulations of a cocktail-party scenario. Noise suppression, intelligibility and speech quality results show a significant performance improvement by the proposed algorithm compared to the baseline super-directive beamformer. A data-driven implementation of the noise field dictionary is shown to provide more noise suppression, and similar speech intelligibility and quality, compared to a parametric dictionary.
信号相关波形器在声场(无论是实世界还是模拟)对声源数量,环境声场及其动态的直观性方面优于信号无关波形器。然而,在头戴式麦克风阵列的增强现实音频使用中,所遇到的声场往往与直观性相去甚远。为这类场景设计健壮、高性能、自适应波形器是一个持续的挑战。这是因为由复杂声场导致的两类快速变化,如快速变化的复杂声场和/或头部旋转引起的,通常是不需要的。 本文提出了一种多通道语音增强算法,利用信号相关波形器的适应性,同时 still 得益于信号无关超指向波形器的计算效率和稳健性能。该算法有两个阶段。 (i) 第一个阶段是一个基于字典权重的混合波形器,对应于一组噪声场模型。 (ii) 第二个阶段是一个带宽扩展子空间后滤波器,以消除(i)中的任何伪像。 该算法通过使用实世界录音和模拟 cocktail-party 场景的增强现实音频来评估其性能。噪声抑制、可听性和语音质量结果表明,与基线超指向波形器相比,所提出的算法具有显著的性能改进。数据驱动的实现方法显示出比参数字典更强的噪声抑制、类似的语音可听性和语音质量。
https://arxiv.org/abs/2311.18689
Influence Maximization is the task of selecting optimal nodes maximising the influence spread in social networks. This study proposes a Discretized Quantum-based Salp Swarm Algorithm (DQSSA) for optimizing influence diffusion in social networks. By discretizing meta-heuristic algorithms and infusing them with quantum-inspired enhancements, we address issues like premature convergence and low efficacy. The proposed method, guided by quantum principles, offers a promising solution for Influence Maximisation. Experiments on four real-world datasets reveal DQSSA's superior performance as compared to established cutting-edge algorithms.
影响最大化是在社交网络中选择最优节点,以最大化影响传播的任务。本研究提出了一种基于离散化的量子基于人工神经网络的 swarm 算法(DQSSA)来优化社交网络中的影响扩散。通过将元启发式算法离散化并注入量子启示的增强,我们解决了诸如过早收敛和低效性等问题。由量子原理指导的方法为影响最大化提供了一个有前景的解决方案。在四个真实世界数据集上的实验表明,DQSSA 相对于已有的顶级算法具有卓越的性能。
https://arxiv.org/abs/2311.18676
Precisely predicting the future trajectories of surrounding traffic participants is a crucial but challenging problem in autonomous driving, due to complex interactions between traffic agents, map context and traffic rules. Vector-based approaches have recently shown to achieve among the best performances on trajectory prediction benchmarks. These methods model simple interactions between traffic agents but don't distinguish between relation-type and attributes like their distance along the road. Furthermore, they represent lanes only by sequences of vectors representing center lines and ignore context information like lane dividers and other road elements. We present a novel approach for vector-based trajectory prediction that addresses these shortcomings by leveraging three crucial sources of information: First, we model interactions between traffic agents by a semantic scene graph, that accounts for the nature and important features of their relation. Second, we extract agent-centric image-based map features to model the local map context. Finally, we generate anchor paths to enforce the policy in multi-modal prediction to permitted trajectories only. Each of these three enhancements shows advantages over the baseline model HoliGraph.
精确预测周围交通参与者的未来轨迹是自动驾驶中的一个关键但具有挑战性的问题,因为交通代理、地图上下文和交通规则之间的复杂相互作用。基于向量的方法在最近的研究中显示出在轨迹预测基准测试中的最佳性能。这些方法简单地建模了交通代理之间的交互,但没有区分关系类型和道路上的属性。此外,它们仅通过表示中心线序列的向量来表示车道,而忽略了上下文信息如分道符和其他道路元素。我们提出了一个全新的基于向量的轨迹预测方法,通过利用三个关键信息来源来解决这些问题:首先,我们通过语义场景图来建模交通代理之间的交互,考虑了它们关系自然和重要特征;其次,我们提取了以图像为基础的代理中心化地图特征来建模局部地图上下文;最后,我们生成了约束路径,以在多模态预测中仅允许所允许的轨迹。这三个增强都超过了基线模型HoliGraph的优势。
https://arxiv.org/abs/2311.18553
This paper explores advancements in high-fidelity personalized image generation through the utilization of pre-trained text-to-image diffusion models. While previous approaches have made significant strides in generating versatile scenes based on text descriptions and a few input images, challenges persist in maintaining the subject fidelity within the generated images. In this work, we introduce an innovative algorithm named HiFi Tuner to enhance the appearance preservation of objects during personalized image generation. Our proposed method employs a parameter-efficient fine-tuning framework, comprising a denoising process and a pivotal inversion process. Key enhancements include the utilization of mask guidance, a novel parameter regularization technique, and the incorporation of step-wise subject representations to elevate the sample fidelity. Additionally, we propose a reference-guided generation approach that leverages the pivotal inversion of a reference image to mitigate unwanted subject variations and artifacts. We further extend our method to a novel image editing task: substituting the subject in an image through textual manipulations. Experimental evaluations conducted on the DreamBooth dataset using the Stable Diffusion model showcase promising results. Fine-tuning solely on textual embeddings improves CLIP-T score by 3.6 points and improves DINO score by 9.6 points over Textual Inversion. When fine-tuning all parameters, HiFi Tuner improves CLIP-T score by 1.2 points and improves DINO score by 1.2 points over DreamBooth, establishing a new state of the art.
本文探讨了通过利用预训练的文本到图像扩散模型在高质量个性化图像生成方面的进展。虽然以前的方法已经在基于文本描述和少量输入图像生成多样化场景方面取得了显著进展,但在生成图像中保持主题一致仍然具有挑战性。在这项工作中,我们引入了一种名为HiFi Tuner的创新算法,以增强在个性化图像生成中对象的视觉效果。我们的方法采用了一种参数高效的微调框架,包括去噪过程和关键反演过程。关键改进包括使用掩码指导、一种新的参数 Regularization 技术以及将级联主题表示融入其中以提高样本 fidelity。此外,我们还提出了一个参考引导生成方法,它利用参考图像的关键反演来减轻不必要的主题变化和伪影。我们将方法扩展到一个新的图像编辑任务:通过文本操作替换图像中的主题。在DreamBooth数据集上使用Stable Diffusion模型进行实验评估,结果表明,仅通过文本嵌入进行微调可以提高CLIP-T得分3.6分,提高DINO得分9.6分。当微调所有参数时,HiFi Tuner提高CLIP-T得分1.2分,提高DINO得分1.2分,超过DreamBooth,确立了新的领先地位。
https://arxiv.org/abs/2312.00079
Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4 1-channel dataset. The experimental results demonstrate a significant relative reduction in word error rate (WER).
随着单声道语音增强(SE)技术的进步,语音的感知质量得到了显著提高。然而,将这些技术集成到自动语音识别(ASR)系统中并没有产生预期的性能提升,主要原因是SE过程中引入了失真。在本文中,我们提出了一种名为FAT-HuBERT的新方法,利用失真无关自监督学习(SSL)来增强ASR的鲁棒性。为了解决SE前端引入的失真,我们引入了级联模块,将来自观测到的嘈杂信号和增强信号的特征融合在一起。在训练过程中,SE前端从模型池中随机选择。我们评估FAT-HuBERT在从LibriSpeech产生的模拟嘈杂语音以及CHiME-4 1-channel数据集中的真实世界嘈杂语音上的性能。实验结果表明,与原始语音相比,WER降低了相当大的比例。
https://arxiv.org/abs/2311.17790
We introduce VISUAL EMBEDDED INSTRUCTION (VIM), a new framework designed to evaluate the visual instruction following capability of Multimodal Large Language Models (MLLMs). As illustrated in Figure 2, VIM challenges the MLLMs by embedding the instructions into the visual scenes, demanding strong visual interpretative skills for instruction following. We adapt VIM to various benchmarks, including VQAv2, MME, MM-Vet, and RefCOCO series, compose a VIM bench, and probe diverse MLLMs across three distinct in-context learning settings: Zero Shot, One Shot, and Pair Shot. We observe that there is a significant performance disparity between the open-source MLLMs and GPT-4V, implying that their proficiency in visual instruction comprehension is not up to par. Our results highlight a promising direction for the enhancement of MLLMs capabilities on instruction following. We aim VIM to serve as a useful norm for advancing the state of the art and driving further progress in the field.
我们提出了VISUAL EMBEDDED INSTRUCTION(VIM)框架,这是一种新的方法,旨在评估多模态大型语言模型(MLLMs)的视觉指令跟随能力。如图2所示,VIM通过将指令嵌入到视觉场景中,挑战了MLLMs,要求具有强大的视觉解释能力来进行指令跟随。我们将VIM适应各种基准测试,包括VQAv2、MME、MM-Vet和RefCOCO系列,构建了一个VIM基准,并探究了三个不同的上下文学习场景:零击、一步击和成对击。我们观察到,开源MLLM和GPT-4V之间的性能差异非常显著,这表明它们在视觉指令理解方面的能力并不足够。我们的结果表明,VIM在提高MLLM指令跟随能力方面具有前景。我们的目标是,VIM将成为推动该领域进一步发展的有用的基准。
https://arxiv.org/abs/2311.17647
Tiger conservation necessitates the strategic deployment of multifaceted initiatives encompassing the preservation of ecological habitats, anti-poaching measures, and community involvement for sustainable growth in the tiger population. With the advent of artificial intelligence, tiger surveillance can be automated using object detection. In this paper, an accurate illumination invariant framework is proposed based on EnlightenGAN and YOLOv8 for tiger detection. The fine-tuned YOLOv8 model achieves a mAP score of 61% without illumination enhancement. The illumination enhancement improves the mAP by 0.7%. The approaches elevate the state-of-the-art performance on the ATRW dataset by approximately 6% to 7%.
老虎保护需要采取全面措施,包括生态栖息地的保护、反盗猎措施和社区参与,以实现可持续增长。随着人工智能的到来,可以通过对象检测来自动化老虎监测。在本文中,基于EnlightenGAN和YOLOv8的准确照明不变框架被提出,用于老虎检测。经过微调的YOLOv8模型在没有照明增强的情况下,实现了61%的mAP得分。照明增强提高了mAP约0.7%。这些方法将ATRW数据集上的最先进性能提高了约6%至7%。
https://arxiv.org/abs/2311.17552
The Manchu language, with its roots in the historical Manchurian region of Northeast China, is now facing a critical threat of extinction, as there are very few speakers left. In our efforts to safeguard the Manchu language, we introduce Mergen, the first-ever attempt at a Manchu-Korean Machine Translation (MT) model. To develop this model, we utilize valuable resources such as the Manwen Laodang(a historical book) and a Manchu-Korean dictionary. Due to the scarcity of a Manchu-Korean parallel dataset, we expand our data by employing word replacement guided by GloVe embeddings, trained on both monolingual and parallel texts. Our approach is built around an encoder-decoder neural machine translation model, incorporating a bi-directional Gated Recurrent Unit (GRU) layer. The experiments have yielded promising results, showcasing a significant enhancement in Manchu-Korean translation, with a remarkable 20-30 point increase in the BLEU score.
满语,其根源在中国东北部的歷史滿洲地區,现在面临着灭绝的危险,因为剩下的话务者非常少。为了保护满语,我们引进了满汉机器翻译(MT)模型的第一个模型——Mergen。为了开发这个模型,我们利用了宝贵的资源,如《满文老档》(a historical book)和满汉词典。由于满汉平行数据集的稀缺性,我们通过使用GloVe嵌入进行词替换,在单语和并行文本上进行训练。我们的方法基于编码器-解码器神经机器翻译模型,包括双向Gated Recurrent Unit(GRU)层。实验结果表明,满汉翻译取得了很好的效果,BLEU得分提高了20-30分。
https://arxiv.org/abs/2311.17492
Analyzing the genomic information from the Pan-Cancer database can help us understand cancer-related factors and contribute to the cancer diagnosis and prognosis. However, existing computational methods and deep learning methods can not effectively find the deep correlations between tens of thousands of genes, which leads to precision loss. In this paper, we proposed a novel pretrained model called Gene-MOE to learn the general feature representations of the Pan-Cancer dataset and transfer the pretrained weights to the downstream tasks. The Gene-MOE fully exploits the mixture of expert (MOE) layers to learn rich feature representations of high-dimensional genes. At the same time, we build a mixture of attention expert (MOAE) model to learn the deep semantic relationships within genetic features. Finally, we proposed a new self-supervised pretraining strategy including loss function design, data enhancement, and optimization strategy to train the Gene-MOE and further improve the performance for the downstream analysis. We carried out cancer classification and survival analysis experiments based on the Gene-MOE. According to the survival analysis results on 14 cancer types, using Gene-MOE outperformed state-of-the-art models on 12 cancer types. According to the classification results, the total accuracy of the classification model for 33 cancer classifications reached 95.2\%. Through detailed feature analysis, we found the Gene-MOE model can learn rich feature representations of high-dimensional genes.
通过分析来自Pan-Cancer数据库的基因组信息可以帮助我们了解与癌症相关的因素,并为癌症的诊断和预后做出贡献。然而,现有的计算方法和深度学习方法无法有效地找到成千上万基因之间的深度相关性,导致精度损失。在本文中,我们提出了一个名为Gene-MOE的新预训练模型,用于学习Pan-Cancer数据集中的通用特征表示,并将预训练权重传递给下游任务。Gene-MOE充分利用了专家(MOE)层来学习高维度基因的丰富特征表示。同时,我们构建了一个注意力专家(MOAE)模型,用于在遗传特征内学习深层次语义关系。最后,我们提出了一个新的自监督预训练策略,包括损失函数设计、数据增强和优化策略,以训练Gene-MOE并进一步改善下游分析的性能。我们对Gene-MOE进行了癌症分类和生存分析实验。根据14种癌症类型的生存分析结果,使用Gene-MOE在12种癌症类型上超过了最先进的模型。根据分类结果,33种癌症类别的分类模型的总准确度达到95.2%。通过详细特征分析,我们发现Gene-MOE模型可以学习高维度基因的丰富特征表示。
https://arxiv.org/abs/2311.17401
Monitoring wildfires is an essential step in minimizing their impact on the planet, understanding the many negative environmental, economic, and social consequences. Recent advances in remote sensing technology combined with the increasing application of artificial intelligence methods have improved real-time, high-resolution fire monitoring. This study explores two proposed approaches based on the U-Net model for automating and optimizing the burned-area mapping process. Denoted 128 and AllSizes (AS), they are trained on datasets with a different class balance by cropping input images to different sizes. They are then applied to Landsat imagery and time-series data from two fire-prone regions in Chile. The results obtained after enhancement of model performance by hyperparameter optimization demonstrate the effectiveness of both approaches. Tests based on 195 representative images of the study area show that increasing dataset balance using the AS model yields better performance. More specifically, AS exhibited a Dice Coefficient (DC) of 0.93, an Omission Error (OE) of 0.086, and a Commission Error (CE) of 0.045, while the 128 model achieved a DC of 0.86, an OE of 0.12, and a CE of 0.12. These findings should provide a basis for further development of scalable automatic burned-area mapping tools.
监测野火是对减少其对地球影响的重要步骤,了解其对许多负面的环境、经济和社会后果产生的影响。利用遥感技术的人工智能方法的进步使得高分辨率、实时火灾监测得以实现。本研究探讨了两种基于U-Net模型的自动化和优化烧过面积映射过程的方法。这被称为128和AllSizes(AS),通过裁剪输入图像到不同的大小来训练不同类别的数据集。然后将它们应用于智利两个易燃区域的土地卫星图像和时间序列数据。通过优化超参数,增强模型性能,结果表明两种方法都有效。基于研究区域195个代表性的图像的测试显示,使用AS模型来增加数据集平衡带来了更好的性能。 具体来说,AS模型的Dice Coefficient(DC)为0.93,Omission Error(OE)为0.086,Commission Error(CE)为0.045,而128模型的DC为0.86,OE为0.12,CE为0.12。这些发现为开发可扩展的自动烧过面积映射工具提供了依据。
https://arxiv.org/abs/2311.17368
Traditional Chinese medicine (TCM) prescription is the most critical form of TCM treatment, and uncovering the complex nonlinear relationship between symptoms and TCM is of great significance for clinical practice and assisting physicians in diagnosis and treatment. Although there have been some studies on TCM prescription generation, these studies consider a single factor and directly model the symptom-prescription generation problem mainly based on symptom descriptions, lacking guidance from TCM knowledge. To this end, we propose a RoBERTa and Knowledge Enhancement model for Prescription Generation of Traditional Chinese Medicine (RoKEPG). RoKEPG is firstly pre-trained by our constructed TCM corpus, followed by fine-tuning the pre-trained model, and the model is guided to generate TCM prescriptions by introducing four classes of knowledge of TCM through the attention mask matrix. Experimental results on the publicly available TCM prescription dataset show that RoKEPG improves the F1 metric by about 2% over the baseline model with the best results.
传统中医药(TCM)处方是TCM治疗的最关键形式,揭示症状与TCM之间的复杂非线性关系对临床实践和帮助医生进行诊断和治疗具有很大的意义。尽管已经有一些关于TCM处方生成的研究,但这些研究仅考虑了一个因素,并主要基于症状描述来建模症状处方生成问题,缺乏TCM知识指导。因此,我们提出了一个RoBERTa和知识增强模型用于传统中医药处方生成(RoKEPG)。RoKEPG首先通过构建TCM语料库进行预训练,然后对预训练模型进行微调,最后通过引入四种TCM知识的类别通过注意力掩码矩阵来指导模型生成TCM处方。在公开可用的TCM处方数据集上进行的实验结果表明,RoKEPG在基线模型具有最佳结果的情况下,将F1得分提高了约2%。
https://arxiv.org/abs/2311.17307