In this work, we address two limitations of existing conditional diffusion models: their slow inference speed due to the iterative denoising process and their reliance on paired data for model fine-tuning. To tackle these issues, we introduce a general method for adapting a single-step diffusion model to new tasks and domains through adversarial learning objectives. Specifically, we consolidate various modules of the vanilla latent diffusion model into a single end-to-end generator network with small trainable weights, enhancing its ability to preserve the input image structure while reducing overfitting. We demonstrate that, for unpaired settings, our model CycleGAN-Turbo outperforms existing GAN-based and diffusion-based methods for various scene translation tasks, such as day-to-night conversion and adding/removing weather effects like fog, snow, and rain. We extend our method to paired settings, where our model pix2pix-Turbo is on par with recent works like Control-Net for Sketch2Photo and Edge2Image, but with a single-step inference. This work suggests that single-step diffusion models can serve as strong backbones for a range of GAN learning objectives. Our code and models are available at this https URL.
在这项工作中,我们解决了现有条件扩散模型的两个局限:由于迭代去噪过程导致其推理速度较慢,以及它们依赖于成对数据进行模型微调。为了应对这些问题,我们引入了一种通过对抗学习目标将单步扩散模型适应新任务和领域的通用方法。具体来说,我们将各种模块整合到一个具有小训练权重的单端到端生成器网络中,提高其保留输入图像结构的能力,同时减少过拟合。我们证明了,对于未配对设置,我们的模型CycleGAN-Turbo在各种场景平移任务中优于现有的基于GAN和扩散的方法,如日夜转换和添加/删除天气效果(如雾、雪和雨)。我们将我们的方法扩展到配对设置,其中我们的模型pix2pix-Turbo与近期的类似工作 Control-Net for Sketch2Photo和Edge2Image相当,但只有一个步骤的推理。这项工作表明,单步扩散模型可以作为各种GAN学习目标的强大骨架。我们的代码和模型可以从该https URL获取。
https://arxiv.org/abs/2403.12036
Open-domain 3D object synthesis has been lagging behind image synthesis due to limited data and higher computational complexity. To bridge this gap, recent works have investigated multi-view diffusion but often fall short in either 3D consistency, visual quality, or efficiency. This paper proposes MVEdit, which functions as a 3D counterpart of SDEdit, employing ancestral sampling to jointly denoise multi-view images and output high-quality textured meshes. Built on off-the-shelf 2D diffusion models, MVEdit achieves 3D consistency through a training-free 3D Adapter, which lifts the 2D views of the last timestep into a coherent 3D representation, then conditions the 2D views of the next timestep using rendered views, without uncompromising visual quality. With an inference time of only 2-5 minutes, this framework achieves better trade-off between quality and speed than score distillation. MVEdit is highly versatile and extendable, with a wide range of applications including text/image-to-3D generation, 3D-to-3D editing, and high-quality texture synthesis. In particular, evaluations demonstrate state-of-the-art performance in both image-to-3D and text-guided texture generation tasks. Additionally, we introduce a method for fine-tuning 2D latent diffusion models on small 3D datasets with limited resources, enabling fast low-resolution text-to-3D initialization.
开放域3D物体合成因为数据有限和计算复杂度高而落后于图像合成。为了弥合这一差距,最近的工作调查了多视角扩散,但往往在3D一致性、视觉质量或效率方面都存在不足。本文提出MVEdit,作为SDEdit的3D counterpart,通过祖本抽样在多视角图像中共同去噪并输出高质量纹理网格。基于现成2D扩散模型,MVEdit通过无训练的3D适配器在2D视图上实现3D一致性,然后通过渲染视图对下一个时间步的2D视图进行条件处理,实现高质量纹理合成。只需2-5分钟的推理时间,这一框架在质量和速度方面的权衡比分数蒸馏更好。MVEdit具有高度的可扩展性和可定制性,包括文本/图像到3D生成、3D-to-3D编辑和高质量纹理合成等多种应用。特别是,评估显示在图像到3D和文本引导纹理生成任务上达到了最先进水平。此外,我们提出了一种在有限资源的小3D数据集上微调2D潜在扩散模型的方法,实现了快速低分辨率文本到3D初始化的过程。
https://arxiv.org/abs/2403.12032
As the range of applications for Large Language Models (LLMs) continues to grow, the demand for effective serving solutions becomes increasingly critical. Despite the versatility of LLMs, no single model can optimally address all tasks and applications, particularly when balancing performance with cost. This limitation has led to the development of LLM routing systems, which combine the strengths of various models to overcome the constraints of individual LLMs. Yet, the absence of a standardized benchmark for evaluating the performance of LLM routers hinders progress in this area. To bridge this gap, we present ROUTERBENCH, a novel evaluation framework designed to systematically assess the efficacy of LLM routing systems, along with a comprehensive dataset comprising over 405k inference outcomes from representative LLMs to support the development of routing strategies. We further propose a theoretical framework for LLM routing, and deliver a comparative analysis of various routing approaches through ROUTERBENCH, highlighting their potentials and limitations within our evaluation framework. This work not only formalizes and advances the development of LLM routing systems but also sets a standard for their assessment, paving the way for more accessible and economically viable LLM deployments. The code and data are available at this https URL.
随着大型语言模型(LLMs)的应用范围不断扩展,有效的服务解决方案的需求变得越来越关键。尽管LLMs具有多样性,但没有一个模型可以最好地解决所有任务和应用,特别是在平衡性能和成本时。这一局限性导致开发了LLM路由系统,它们结合了各种模型的优势,克服了单个LLM的约束。然而,缺乏对评估LLM路由器性能的标准化基准,阻碍了该领域的发展。为了弥合这一差距,我们提出了ROUTERBENCH,一种旨在系统地评估LLM路由系统有效性的新评估框架,以及一个由来自代表性LLM的超过405k个推理结果组成的全面数据集,以支持制定路由策略的开发。我们进一步提出了LLM路由的理论框架,并通过ROUTERBENCH进行了各种路由方法的比较分析,强调它们在我们评估框架中的潜力和局限性。这项工作不仅正式化和推进了LLM路由系统的发展,而且还为它们的评估设定了一个标准,为更易获得且具有经济可行性的LLM部署铺平道路。代码和数据可在此链接下载:https://url.cn/xyz6h
https://arxiv.org/abs/2403.12031
The field of neural rendering has witnessed significant progress with advancements in generative models and differentiable rendering techniques. Though 2D diffusion has achieved success, a unified 3D diffusion pipeline remains unsettled. This paper introduces a novel framework called LN3Diff to address this gap and enable fast, high-quality, and generic conditional 3D generation. Our approach harnesses a 3D-aware architecture and variational autoencoder (VAE) to encode the input image into a structured, compact, and 3D latent space. The latent is decoded by a transformer-based decoder into a high-capacity 3D neural field. Through training a diffusion model on this 3D-aware latent space, our method achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation across various datasets. Moreover, it surpasses existing 3D diffusion methods in terms of inference speed, requiring no per-instance optimization. Our proposed LN3Diff presents a significant advancement in 3D generative modeling and holds promise for various applications in 3D vision and graphics tasks.
神经渲染领域在生成模型和可分离渲染技术的进步下取得了显著的进展。尽管2D扩散取得了成功,但统一的3D扩散管道仍然不确定。本文介绍了一种名为LN3Diff的新框架,以解决这一空白并实现快速、高质和高通量的条件3D生成。我们的方法利用了3D感知的架构和变分自编码器(VAE)对输入图像进行编码,将其转换为结构化、紧凑的3D潜在空间。通过将潜在空间解码为基于Transformer的解码器,我们的方法在ShapeNet上实现了与3D生成和单目3D复原和条件3D生成相关的最先进性能。此外,它在推理速度方面超过了现有的3D扩散方法,无需对每个实例进行优化。我们提出的LN3Diff在3D生成建模方面取得了显著的进展,并为各种3D视觉和图形任务的各种应用带来了巨大的潜力。
https://arxiv.org/abs/2403.12019
Diffusion models are the main driver of progress in image and video synthesis, but suffer from slow inference speed. Distillation methods, like the recently introduced adversarial diffusion distillation (ADD) aim to shift the model from many-shot to single-step inference, albeit at the cost of expensive and difficult optimization due to its reliance on a fixed pretrained DINOv2 discriminator. We introduce Latent Adversarial Diffusion Distillation (LADD), a novel distillation approach overcoming the limitations of ADD. In contrast to pixel-based ADD, LADD utilizes generative features from pretrained latent diffusion models. This approach simplifies training and enhances performance, enabling high-resolution multi-aspect ratio image synthesis. We apply LADD to Stable Diffusion 3 (8B) to obtain SD3-Turbo, a fast model that matches the performance of state-of-the-art text-to-image generators using only four unguided sampling steps. Moreover, we systematically investigate its scaling behavior and demonstrate LADD's effectiveness in various applications such as image editing and inpainting.
扩散模型是图像和视频合成进步的主要驱动力,但它们在推理速度方面较慢。像最近引入的对抗扩散蒸馏(ADD)方法旨在将模型从多视角到单步推理的转变,尽管这需要由于其依赖固定预训练的DINOv2判别器而产生昂贵且难以优化。我们引入了潜在对抗扩散蒸馏(LADD),一种克服ADD局限性的新蒸馏方法。与基于像素的ADD不同,LADD利用预训练的潜在扩散模型的生成特征。这种方法简化了训练,提高了性能,使得能够实现高分辨率的多角度图像合成。我们将LADD应用于Stable Diffusion 3(8B),以获得SD3-Turbo,一种仅使用四个无指导采样步骤就能与最先进的文本到图像生成器的性能相匹敌的快速模型。此外,我们系统地研究了其扩展行为,并在图像编辑和修复等各种应用中证明了LADD的有效性。
https://arxiv.org/abs/2403.12015
When exploring new areas, robotic systems generally exclusively plan and execute controls over geometry that has been directly measured. When entering space that was previously obstructed from view such as turning corners in hallways or entering new rooms, robots often pause to plan over the newly observed space. To address this we present SceneScene, a real-time 3D diffusion model for synthesizing 3D occupancy information from partial observations that effectively predicts these occluded or out of view geometries for use in future planning and control frameworks. SceneSense uses a running occupancy map and a single RGB-D camera to generate predicted geometry around the platform at runtime, even when the geometry is occluded or out of view. Our architecture ensures that SceneSense never overwrites observed free or occupied space. By preserving the integrity of the observed map, SceneSense mitigates the risk of corrupting the observed space with generative predictions. While SceneSense is shown to operate well using a single RGB-D camera, the framework is flexible enough to extend to additional modalities. SceneSense operates as part of any system that generates a running occupancy map `out of the box', removing conditioning from the framework. Alternatively, for maximum performance in new modalities, the perception backbone can be replaced and the model retrained for inference in new applications. Unlike existing models that necessitate multiple views and offline scene synthesis, or are focused on filling gaps in observed data, our findings demonstrate that SceneSense is an effective approach to estimating unobserved local occupancy information at runtime. Local occupancy predictions from SceneSense are shown to better represent the ground truth occupancy distribution during the test exploration trajectories than the running occupancy map.
在探索新领域时,机器人系统通常仅规划并执行直接测量的几何形状的控制。当进入以前无法观察到的空间,如走廊角或进入新房间时,机器人通常会暂停规划,对新的观察空间进行规划。为解决这个问题,我们提出了SceneScene,一个实时3D扩散模型,用于合成来自部分观察结果的3D占用信息,有效地预测这些被遮挡或看不见的几何形状,用于未来的规划和控制框架。SceneSense使用一个运行中的占用图和单个RGB-D相机生成预测的周围平台形状,即使几何形状被遮挡或看不见。通过保留观察地图的完整性,SceneSense减轻了生成预测对观察空间进行污染的风险。尽管SceneSense使用单个RGB-D相机表现良好,但该框架足够灵活,可以扩展到其他模块。SceneSense作为任何生成运行中占用图的系统的一部分运行,从框架中移除条件。或者,为了在新技术上实现最大性能,可以将感知骨架替换为新的模型,并在新应用中重新训练模型。与现有的模型需要多个视图或离线场景合成,或关注填充观察数据中的缺口不同,我们的研究结果表明,SceneSense是一种在运行时估计未观察到局部占用信息的有价值的 approach。SceneSense从SceneSense生成的局部占用预测在测试探索轨迹上更好地代表了真实值占用分布。
https://arxiv.org/abs/2403.11985
Prior parameter distributions provide an elegant way to represent prior expert and world knowledge for informed learning. Previous work has shown that using such informative priors to regularize probabilistic deep learning (DL) models increases their performance and data-efficiency. However, commonly used sampling-based approximations for probabilistic DL models can be computationally expensive, requiring multiple inference passes and longer training times. Promising alternatives are compute-efficient last layer kernel approximations like spectral normalized Gaussian processes (SNGPs). We propose a novel regularization-based continual learning method for SNGPs, which enables the use of informative priors that represent prior knowledge learned from previous tasks. Our proposal builds upon well-established methods and requires no rehearsal memory or parameter expansion. We apply our informed SNGP model to the trajectory prediction problem in autonomous driving by integrating prior drivability knowledge. On two public datasets, we investigate its performance under diminishing training data and across locations, and thereby demonstrate an increase in data-efficiency and robustness to location-transfers over non-informed and informed baselines.
先验参数分布提供了一种优雅的方式来表示先验专家和世界知识,以进行知情的学习。之前的工作已经表明,使用这样的有信息性的先验来对概率深度学习(DL)模型进行正则化可以提高它们的性能和数据效率。然而,通常使用的基于采样的概率DL模型的近似方法在计算上可能是昂贵的,需要多次推理迭代和更长的训练时间。有前途的替代方法是计算效率的最后一层核近似,如斯皮尔曼正态分布 Gaussian processes (SNGPs)。我们提出了一种新的基于先验的连续学习方法 for SNGPs,该方法允许使用从以前任务中获得的 informative先验。我们的提议基于成熟的方法,不需要练习记忆或参数扩展。我们将我们的 informed SNGP 模型应用于自动驾驶轨迹预测问题,通过将以前的知识集成到模型中来实现。在两个公开的数据集上,我们研究了在训练数据减少的情况下其性能,以及在不同位置上的表现。从而,我们证明了数据效率和对抗非知觉和知觉先验基线的增强。
https://arxiv.org/abs/2403.11966
Unmanned Aerial Vehicles (UAVs) are gaining popularity in civil and military applications. However, uncontrolled access to restricted areas threatens privacy and security. Thus, prevention and detection of UAVs are pivotal to guarantee confidentiality and safety. Although active scanning, mainly based on radars, is one of the most accurate technologies, it can be expensive and less versatile than passive inspections, e.g., object recognition. Dynamic vision sensors (DVS) are bio-inspired event-based vision models that leverage timestamped pixel-level brightness changes in fast-moving scenes that adapt well to low-latency object detection. This paper presents F-UAV-D (Fast Unmanned Aerial Vehicle Detector), an embedded system that enables fast-moving drone detection. In particular, we propose a setup to exploit DVS as an alternative to RGB cameras in a real-time and low-power configuration. Our approach leverages the high-dynamic range (HDR) and background suppression of DVS and, when trained with various fast-moving drones, outperforms RGB input in suboptimal ambient conditions such as low illumination and fast-moving scenes. Our results show that F-UAV-D can (i) detect drones by using less than <15 W on average and (ii) perform real-time inference (i.e., <50 ms) by leveraging the CPU and GPU nodes of our edge computer.
无人机(UAVs)在民用和军事应用中越来越受欢迎。然而,未受控制的访问受限制区域会威胁隐私和安全。因此,预防无人机(UAVs)的检测和检测是确保机密性和安全性的关键。尽管主动扫描,主要基于雷达,是最准确的无人机检测技术之一,但它可能昂贵且不如被动检测(例如物体识别)灵活。动态视觉传感器(DVS)是受生物启发的基于事件的视觉模型,它利用快速移动场景中时间戳级像素级的亮度变化来适应低延迟的物体检测。本文介绍了一种名为F-UAV-D的嵌入式系统,可实现快速移动无人机检测。特别,我们提出了一个利用DVS作为实时和低功耗配置中RGB摄像头的替代品的设置。我们的方法利用DVS的高动态范围(HDR)和背景抑制特性,并在各种快速移动无人机上训练时,在低照度和快速移动场景下优于RGB输入。我们的结果表明,F-UAV-D可以(i)通过平均使用能量低于<15瓦来检测无人机,(ii)通过利用边缘计算机的CPU和GPU节点进行实时推理(即<50毫秒)实现。
https://arxiv.org/abs/2403.11875
This paper presents a novel approach to address the challenging problem of autonomous on-ramp merging, where a self-driving vehicle needs to seamlessly integrate into a flow of vehicles on a multi-lane highway. We introduce the Lane-keeping, Lane-changing with Latent-state Inference and Safety Controller (L3IS) agent, designed to perform the on-ramp merging task safely without comprehensive knowledge about surrounding vehicles' intents or driving styles. We also present an augmentation of this agent called AL3IS that accounts for observation delays, allowing the agent to make more robust decisions in real-world environments with vehicle-to-vehicle (V2V) communication delays. By modeling the unobservable aspects of the environment through latent states, such as other drivers' intents, our approach enhances the agent's ability to adapt to dynamic traffic conditions, optimize merging maneuvers, and ensure safe interactions with other vehicles. We demonstrate the effectiveness of our method through extensive simulations generated from real traffic data and compare its performance with existing approaches. L3IS shows a 99.90\% success rate in a challenging on-ramp merging case generated from the real US Highway 101 data. We further perform a sensitivity analysis on AL3IS to evaluate its robustness against varying observation delays, which demonstrates an acceptable performance of 93.84\% success rate in 1-second V2V communication delay.
本文提出了一种解决自动驾驶车辆在多车道高速公路上无缝集成的问题的新方法,该方法可以让自动驾驶车辆在不具备周围车辆意图或驾驶风格全面知识的情况下安全地进行高速公路上的自动驾驶车道的并入。我们引入了Lane-keeping,Lane-changing with Latent-state Inference and Safety Controller (L3IS)代理,旨在在没有全面了解周围车辆意图或驾驶风格的情况下安全地进行高速公路上的自动驾驶车道并入。我们还介绍了一种名为AL3IS的增强代理,用于考虑观察延迟,使得代理在现实环境中能够做出更稳健的决策,同时考虑车辆与车辆(V2V)通信延迟。通过通过潜在状态建模环境的不透明方面,如其他驾驶员的意图,我们的方法增强了代理适应动态交通状况、优化并入技巧以及确保与其他车辆的安全互动的能力。我们通过从实际交通数据中生成广泛的仿真来验证我们的方法的有效性,并将其性能与现有方法进行比较。L3IS在从实际美国101高速公路数据生成的具有挑战性的并入案例中表现出了99.90%的成功率。我们进一步对AL3IS进行了敏感性分析,以评估其对不同观察延迟的鲁棒性,其性能在1秒的V2V通信延迟下达到了93.84%的成功率。
https://arxiv.org/abs/2403.11852
Existing parameter-efficient fine-tuning (PEFT) methods have achieved significant success on vision transformers (ViTs) adaptation by improving parameter efficiency. However, the exploration of enhancing inference efficiency during adaptation remains underexplored. This limits the broader application of pre-trained ViT models, especially when the model is computationally extensive. In this paper, we propose Dynamic Tuning (DyT), a novel approach to improve both parameter and inference efficiency for ViT adaptation. Specifically, besides using the lightweight adapter modules, we propose a token dispatcher to distinguish informative tokens from less important ones, allowing the latter to dynamically skip the original block, thereby reducing the redundant computation during inference. Additionally, we explore multiple design variants to find the best practice of DyT. Finally, inspired by the mixture-of-experts (MoE) mechanism, we introduce an enhanced adapter to further boost the adaptation performance. We validate DyT across various tasks, including image/video recognition and semantic segmentation. For instance, DyT achieves comparable or even superior performance compared to existing PEFT methods while evoking only 71%-85% of their FLOPs on the VTAB-1K benchmark.
现有的参数高效的微调(PEFT)方法通过提高参数效率在视觉 transformer(ViT)的适应中取得了显著的成功。然而,在迁移适应过程中提高推理效率的研究仍然鲜被探索。这限制了预训练 ViT 模型的更广泛应用,尤其是当模型具有很高的计算复杂度时。在本文中,我们提出了动态调整(DyT),一种提高 ViT 适应参数和推理效率的新方法。具体来说,除了使用轻量级适配器模块外,我们还提出了一个标记分配器来区分有用的标记和无用的标记,允许后者在原始块动态跳过,从而在推理过程中减少冗余计算。此外,我们探索了多种设计变体,以找到最佳的 DyT 实践。最后,受到混合专家(MoE)机制的启发,我们引入了增强型适配器,进一步提高了迁移适应性能。我们对 DyT 在各种任务(包括图像/视频识别和语义分割)上的性能进行了评估。例如,DyT 实现了与现有 PEFT 方法相当甚至更好的性能,同时仅激活了它们 FLOPs 的 71%-85%。
https://arxiv.org/abs/2403.11808
3D reconstruction has been widely used in autonomous navigation fields of mobile robotics. However, the former research can only provide the basic geometry structure without the capability of open-world scene understanding, limiting advanced tasks like human interaction and visual navigation. Moreover, traditional 3D scene understanding approaches rely on expensive labeled 3D datasets to train a model for a single task with supervision. Thus, geometric reconstruction with zero-shot scene understanding i.e. Open vocabulary 3D Understanding and Reconstruction, is crucial for the future development of mobile robots. In this paper, we propose OpenOcc, a novel framework unifying the 3D scene reconstruction and open vocabulary understanding with neural radiance fields. We model the geometric structure of the scene with occupancy representation and distill the pre-trained open vocabulary model into a 3D language field via volume rendering for zero-shot inference. Furthermore, a novel semantic-aware confidence propagation (SCP) method has been proposed to relieve the issue of language field representation degeneracy caused by inconsistent measurements in distilled features. Experimental results show that our approach achieves competitive performance in 3D scene understanding tasks, especially for small and long-tail objects.
3D重建已经在移动机器人领域广泛应用。然而,前期的研究仅能提供基本的几何结构,而缺乏开放世界场景理解的能力,限制了复杂任务(如人交互和视觉导航)的发展。此外,传统的三维场景理解方法依赖于昂贵的标注3D数据集来训练一个监督模型进行单一任务的建模。因此,在移动机器人未来的发展中,几何重建(即开放词汇3D理解和重建)至关重要。在本文中,我们提出了OpenOcc,一个将3D场景重建和开放词汇理解与神经元辐射场统一起来的新框架。我们用占有率表示场景的几何结构,并通过体积渲染实现零散测量的开放词汇预训练模型到3D语言场的转化。此外,还提出了一种新颖的语义感知信心传播(SCP)方法,以解决由于离散测量引起的语言场表示退化的问题。实验结果表明,我们的方法在3D场景理解任务中具有竞争力的性能,特别是对于小目标和长尾物体。
https://arxiv.org/abs/2403.11796
The existing methods for evaluating the inference abilities of Large Language Models (LLMs) have been results-centric, making it difficult to assess the inference process. We introduce a new approach using the Abstract and Reasoning Corpus (ARC) dataset to evaluate the inference and contextual understanding abilities of large language models in a process-centric manner. ARC demands rigorous logical structures for problem-solving, making it a benchmark that facilitates the comparison of model inference abilities with humans. Experimental results confirm that while large language models possess weak inference abilities, they still lag in terms of logical coherence, compositionality, and productivity. Our experiments highlight the reasoning capabilities of LLMs, proposing development paths for achieving human-level reasoning.
目前评估大型语言模型(LLMs)推理能力的方法都是结果导向的,这使得评估推理过程变得困难。我们使用抽象与推理语料库(ARC)数据集提出了一种新的方法,以过程为中心评估大型语言模型的推理和上下文理解能力。ARC要求问题求解具有严谨的逻辑结构,使其成为比较模型推理能力与人类的一个基准。实验结果证实,尽管大型语言模型具有弱的推理能力,但在逻辑连贯性、组合性和生产率方面仍然落后。我们的实验突出了LLM的推理能力,提出了实现人类水平推理的开发途径。
https://arxiv.org/abs/2403.11793
Recently, feature relation learning has drawn widespread attention in cross-spectral image patch matching. However, existing related research focuses on extracting diverse relations between image patch features and ignores sufficient intrinsic feature representations of individual image patches. Therefore, an innovative relational representation learning idea is proposed for the first time, which simultaneously focuses on sufficiently mining the intrinsic features of individual image patches and the relations between image patch features. Based on this, we construct a lightweight Relational Representation Learning Network (RRL-Net). Specifically, we innovatively construct an autoencoder to fully characterize the individual intrinsic features, and introduce a Feature Interaction Learning (FIL) module to extract deep-level feature relations. To further fully mine individual intrinsic features, a lightweight Multi-dimensional Global-to-Local Attention (MGLA) module is constructed to enhance the global feature extraction of individual image patches and capture local dependencies within global features. By combining the MGLA module, we further explore the feature extraction network and construct an Attention-based Lightweight Feature Extraction (ALFE) network. In addition, we propose a Multi-Loss Post-Pruning (MLPP) optimization strategy, which greatly promotes network optimization while avoiding increases in parameters and inference time. Extensive experiments demonstrate that our RRL-Net achieves state-of-the-art (SOTA) performance on multiple public datasets. Our code will be made public later.
近年来,在跨光谱图像补丁匹配中,特征关系学习已经引起了广泛的关注。然而,现有相关研究仅关注从图像补丁特征中提取多样关系,而忽略了个体图像补丁的充分固有特征表示。因此,我们提出了一个新的关系表示学习想法,该想法同时关注个体图像补丁的固有特征和图像补丁特征之间的关系。基于此,我们构建了一个轻量级的关系表示学习网络(RRL-Net)。具体来说,我们创新地构建了一个自编码器,全面描述了个体固有特征,并引入了特征交互学习(FIL)模块来提取深度层特征关系。为了进一步挖掘个体固有特征,我们构建了一个轻量多维全局到局部关注(MGLA)模块,增强了个体图像补丁的全局特征提取,并捕捉到全局特征中的局部依赖关系。通过结合MGLA模块,我们进一步探索了特征提取网络,并构建了基于注意力的轻量级特征提取(ALFE)网络。此外,我们提出了一个多损失后修剪(MLPP)优化策略,可以在网络优化时大大降低参数和推理时间,同时在保持性能的同时提高网络的训练效率。大量实验证明,我们的RRL-Net在多个公共数据集上实现了最先进的(SOTA)性能。我们的代码稍后公开发布。
https://arxiv.org/abs/2403.11751
Extracting semantic information from generated text is a useful tool for applications such as automated fact checking or retrieval augmented generation. Currently, this requires either separate models during inference, which increases computational cost, or destructive fine-tuning of the language model. Instead, we propose directly embedding information extraction capabilities into pre-trained language models using probing classifiers, enabling efficient simultaneous text generation and information extraction. For this, we introduce an approach called EMBER and show that it enables named entity recognition in decoder-only language models without fine-tuning them and while incurring minimal additional computational cost at inference time. Specifically, our experiments using GPT-2 show that EMBER maintains high token generation rates during streaming text generation, with only a negligible decrease in speed of around 1% compared to a 43.64% slowdown measured for a baseline using a separate NER model. Code and data are available at this https URL.
提取语义信息从生成文本中是一个有用的工具,可以应用于诸如自动事实检查或检索增强生成等应用。目前,这需要使用推理期间分离模型,这会增加计算成本,或者对语言模型进行破坏性微调。相反,我们提出了一种将信息提取功能直接嵌入预训练语言模型中的方法,使用提示分类器,实现高效的同时文本生成和信息提取。为此,我们介绍了一种称为EMBER的方法,并证明了它可以在无需微调的情况下,使命名实体识别在解码语言模型中实现,同时仅在推理时造成极少量的额外计算成本。具体来说,我们的实验使用GPT-2表明,EMBER在流式文本生成中保持高 token 生成率,仅比使用单独的NER模型进行测量时的速度下降略有减少,约为1%。代码和数据可在此处获得:https://www.aclweb.org/anthology/N22-11965/14914260/
https://arxiv.org/abs/2403.11747
Within the area of speech enhancement, there is an ongoing interest in the creation of neural systems which explicitly aim to improve the perceptual quality of the processed audio. In concert with this is the topic of non-intrusive (i.e. without clean reference) speech quality prediction, for which neural networks are trained to predict human-assigned quality labels directly from distorted audio. When combined, these areas allow for the creation of powerful new speech enhancement systems which can leverage large real-world datasets of distorted audio, by taking inference of a pre-trained speech quality predictor as the sole loss function of the speech enhancement system. This paper aims to identify a potential pitfall with this approach, namely hallucinations which are introduced by the enhancement system `tricking' the speech quality predictor.
在言语增强领域,人们越来越关注创建旨在明确提高处理音频的感知质量的神经系统。与此同时,非侵入性(即无干净参考)语音质量预测是一个热门话题,神经网络被训练直接从扭曲的音频中预测人类分配的质量标签。当将这些领域结合起来时,允许创建强大的新言语增强系统,通过将预训练的语音质量预测器作为言语增强系统的唯一损失函数来实现。本文旨在识别这种方法的一个潜在缺陷,即由增强系统“欺骗”语音质量预测器引起的幻觉。
https://arxiv.org/abs/2403.11732
Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks: generating a set of coherent sources, creating accompaniments, and performing source separation. Despite their versatility, they require estimating the joint distribution over the sources, necessitating pre-separated musical data, which is rarely available, and fixing the number and type of sources at training time. This paper generalizes MSDM to arbitrary time-domain diffusion models conditioned on text embeddings. These models do not require separated data as they are trained on mixtures, can parameterize an arbitrary number of sources, and allow for rich semantic control. We propose an inference procedure enabling the coherent generation of sources and accompaniments. Additionally, we adapt the Dirac separator of MSDM to perform source separation. We experiment with diffusion models trained on Slakh2100 and MTG-Jamendo, showcasing competitive generation and separation results in a relaxed data setting.
多源扩散模型(MSDM)允许进行组合音乐生成任务:生成一系列连贯的源,创作和声,并进行源分离。尽管它们具有多样性,但它们需要估计源之间的联合分布,这就需要预先分离的音乐数据,而这些数据很少可用,并需要在训练时确定源的数量和类型。本文将MSDM扩展到任意时间域扩散模型,条件是基于文本嵌入。这些模型不需要分离的数据,因为它们在混合中训练,可以参数化任意数量的源,并允许实现丰富的语义控制。我们提出了一个推理过程,以实现连贯的源和和声生成。此外,我们将MSDM的狄拉克分离器适应用于源分离。我们在Slakh2100和MTG-Jamendo上训练的扩散模型进行了实验,展示了在放松数据设置下的竞争生成和分离结果。
https://arxiv.org/abs/2403.11706
Visual encoding constitutes the basis of large multimodal models (LMMs) in understanding the visual world. Conventional LMMs process images in fixed sizes and limited resolutions, while recent explorations in this direction are limited in adaptivity, efficiency, and even correctness. In this work, we first take GPT-4V and LLaVA-1.5 as representative examples and expose systematic flaws rooted in their visual encoding strategy. To address the challenges, we present LLaVA-UHD, a large multimodal model that can efficiently perceive images in any aspect ratio and high resolution. LLaVA-UHD includes three key components: (1) An image modularization strategy that divides native-resolution images into smaller variable-sized slices for efficient and extensible encoding, (2) a compression module that further condenses image tokens from visual encoders, and (3) a spatial schema to organize slice tokens for LLMs. Comprehensive experiments show that LLaVA-UHD outperforms established LMMs trained with 2-3 orders of magnitude more data on 9 benchmarks. Notably, our model built on LLaVA-1.5 336x336 supports 6 times larger (i.e., 672x1088) resolution images using only 94% inference computation, and achieves 6.4 accuracy improvement on TextVQA. Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5). We make the data and code publicly available at this https URL.
视觉编码构成了理解大型多模态模型(LMMs)理解视觉世界的基础。传统的LMMs在固定尺寸和有限分辨率的情况下处理图像。而这一方向近期的探索在适应性、效率和甚至正确性方面都有限。在这项工作中,我们首先以GPT-4V和LaVA-1.5为例,揭示了它们视觉编码策略中所存在的系统缺陷。为了应对这些挑战,我们提出了LaVA-UHD,一种大型多模态模型,可以在任何比例和分辨率下有效地感知图像。LaVA-UHD包括三个关键组件:(1)一种图像模块化策略,将原分辨率图像分割为小且可变大小的块以实现高效的扩展和可伸缩编码;(2)一个压缩模块,进一步压缩视觉编码器的图像块;(3)一个空间模式,为LLM组织块状编码器的块状编码器。全面的实验结果表明,LaVA-UHD在9个基准测试上的表现优于使用2-3倍数据训练的现有LMM。值得注意的是,基于LaVA-1.5的模型在仅使用94%的推理计算时支持6倍于(即672x1088)分辨率的图像,并且实现了TextVQA的6.4%的准确率提升。此外,该模型可以在学术环境中以更高效的方式进行训练,在8个A100 GPU上仅用23小时(而LaVA-1.5需要26小时),我们将数据和代码公开发布在https:// this URL上。
https://arxiv.org/abs/2403.11703
We study zero-shot instance navigation, in which the agent navigates to a specific object without using object annotations for training. Previous object navigation approaches apply the image-goal navigation (ImageNav) task (go to the location of an image) for pretraining, and transfer the agent to achieve object goals using a vision-language model. However, these approaches lead to issues of semantic neglect, where the model fails to learn meaningful semantic alignments. In this paper, we propose a Prioritized Semantic Learning (PSL) method to improve the semantic understanding ability of navigation agents. Specifically, a semantic-enhanced PSL agent is proposed and a prioritized semantic training strategy is introduced to select goal images that exhibit clear semantic supervision and relax the reward function from strict exact view matching. At inference time, a semantic expansion inference scheme is designed to preserve the same granularity level of the goal-semantic as training. Furthermore, for the popular HM3D environment, we present an Instance Navigation (InstanceNav) task that requires going to a specific object instance with detailed descriptions, as opposed to the Object Navigation (ObjectNav) task where the goal is defined merely by the object category. Our PSL agent outperforms the previous state-of-the-art by 66% on zero-shot ObjectNav in terms of success rate and is also superior on the new InstanceNav task. Code will be released at https://anonymous.4open. science/r/PSL/.
我们研究零样本实例导航,其中代理商在没有使用对象注释进行训练的情况下,导航到特定的目标物体。 previous object navigation approaches apply the image-goal navigation (ImageNav) task (到达图像的位置)进行预训练,并使用视觉语言模型将代理器转移到实现目标物体。然而,这些方法导致语义忽视的问题,即模型无法学习有意义的语义对齐。在本文中,我们提出了一个优先语义学习(PSL)方法来提高导航代理的语义理解能力。具体来说,我们提出了一个语义增强的PSL代理和一个优先语义训练策略,以选择具有清晰语义监督并减轻奖励函数的精确视匹配的目标图像。在推理时,我们设计了一个语义扩展推理方案,以保留训练时的目标语义级别。此外,对于流行的HM3D环境,我们还提出了一个实例导航(InstanceNav)任务,需要对特定的物体实例进行详细的描述,而不是仅仅定义目标物体类别。我们的PSL代理在零样本ObjectNav中的成功率方面优于前状态-of-the-art,并且在新的InstanceNav任务中也非常出色。代码将在https://anonymous.4open.science/ r/PSL/中发布。
https://arxiv.org/abs/2403.11650
Context Optimization (CoOp) has emerged as a simple yet effective technique for adapting CLIP-like vision-language models to downstream image recognition tasks. Nevertheless, learning compact context with satisfactory base-to-new, domain and cross-task generalization ability while adapting to new tasks is still a challenge. To tackle such a challenge, we propose a lightweight yet generalizable approach termed Compositional Kronecker Context Optimization (CK-CoOp). Technically, the prompt's context words in CK-CoOp are learnable vectors, which are crafted by linearly combining base vectors sourced from a dictionary. These base vectors consist of a non-learnable component obtained by quantizing the weights in the token embedding layer, and a learnable component constructed by applying Kronecker product on several learnable tiny matrices. Intuitively, the compositional structure mitigates the risk of overfitting on training data by remembering more pre-trained knowledge. Meantime, the Kronecker product breaks the non-learnable restrictions of the dictionary, thereby enhancing representation ability with minimal additional parameters. Extensive experiments confirm that CK-CoOp achieves state-of-the-art performance under base-to-new, domain and cross-task generalization evaluation, but also has the metrics of fewer learnable parameters and efficient training and inference speed.
上下文优化(CoOp)作为一种简单而有效的技术,将类似于CLIP的视觉语言模型适应下游图像识别任务。然而,在适应新任务的同时,学习具有满意的基于到新、领域和跨任务泛化能力的紧凑上下文仍然具有挑战性。为解决这一挑战,我们提出了一个轻量级且具有泛化能力的通用方法,称为组合Kronecker上下文优化(CK-CoOp)。从技术上讲,CK-CoOp的提示上下文单词是可学习的有向量,这些有向量是由词嵌入层中量化权重的线性组合而成的。这些基础向量包括通过Kronecker乘积生成的一些可学习的小矩阵的非学习able组件,以及通过应用Kronecker乘积在这些可学习的小矩阵上构建的可学习组件。直观上,组合结构通过记住更多的预训练知识来减轻过拟合在训练数据上的风险。同时,Kronecker乘积打破了字典的非学习able限制,从而通过最小的额外参数增强表示能力。大量实验证实,CK-CoOp在基于到新、领域和跨任务泛化评估中实现了最先进的性能,但同时也具有较少的可学习参数和高效的训练和推理速度。
https://arxiv.org/abs/2403.11631
Reconstructing photo-realistic drivable human avatars from multi-view image sequences has been a popular and challenging topic in the field of computer vision and graphics. While existing NeRF-based methods can achieve high-quality novel view rendering of human models, both training and inference processes are time-consuming. Recent approaches have utilized 3D Gaussians to represent the human body, enabling faster training and rendering. However, they undermine the importance of the mesh guidance and directly predict Gaussians in 3D space with coarse mesh guidance. This hinders the learning procedure of the Gaussians and tends to produce blurry textures. Therefore, we propose UV Gaussians, which models the 3D human body by jointly learning mesh deformations and 2D UV-space Gaussian textures. We utilize the embedding of UV map to learn Gaussian textures in 2D space, leveraging the capabilities of powerful 2D networks to extract features. Additionally, through an independent Mesh network, we optimize pose-dependent geometric deformations, thereby guiding Gaussian rendering and significantly enhancing rendering quality. We collect and process a new dataset of human motion, which includes multi-view images, scanned models, parametric model registration, and corresponding texture maps. Experimental results demonstrate that our method achieves state-of-the-art synthesis of novel view and novel pose. The code and data will be made available on the homepage this https URL once the paper is accepted.
从多视角图像序列中重构逼真的驾驶人类虚拟助手是一个在计算机视觉和图形学领域中热门而具有挑战性的话题。虽然现有的基于NeRF的方法可以实现高质量的新视图人体模型的渲染,但训练和推理过程都是时间密集的。最近的方法利用3D高斯来表示人体,使得训练和渲染过程更快。然而,它们削弱了网格指导的重要性,并直接预测3D空间中的高斯分布。这阻碍了Gaussians的学习过程,并导致生成的纹理模糊。因此,我们提出了UV Gaussians,通过联合学习网格变形和2D UV空间高斯纹理来建模3D人体。我们利用UV纹理的嵌入来学习2D空间中的高斯纹理,并利用强大的2D网络提取特征。此外,通过独立的网格网络,我们优化了与姿态相关的几何变形,从而引导高斯渲染,显著提高了渲染质量。我们收集和处理了一个新的人体运动数据集,包括多视角图像、扫描模型、参数模型注册和相应的纹理图。实验结果表明,我们的方法实现了新视图和高姿的逼真合成。代码和数据将在论文被接受后放在主页上这个https://URL上。
https://arxiv.org/abs/2403.11589