Street-view image has been widely applied as a crucial mobile mapping data source. The inpainting of street-view images is a critical step for street-view image processing, not only for the privacy protection, but also for the urban environment mapping applications. This paper presents a novel Deep Neural Network (DNN), multi-scale semantic prior Feature guided image inpainting Network (MFN) for inpainting street-view images, which generate static street-view images without moving objects (e.g., pedestrians, vehicles). To enhance global context understanding, a semantic prior prompter is introduced to learn rich semantic priors from large pre-trained model. We design the prompter by stacking multiple Semantic Pyramid Aggregation (SPA) modules, capturing a broad range of visual feature patterns. A semantic-enhanced image generator with a decoder is proposed that incorporates a novel cascaded Learnable Prior Transferring (LPT) module at each scale level. For each decoder block, an attention transfer mechanism is applied to capture long-term dependencies, and the semantic prior features are fused with the image features to restore plausible structure in an adaptive manner. Additionally, a background-aware data processing scheme is adopted to prevent the generation of hallucinated objects within holes. Experiments on Apolloscapes and Cityscapes datasets demonstrate better performance than state-of-the-art methods, with MAE, and LPIPS showing improvements of about 9.5% and 41.07% respectively. Visual comparison survey among multi-group person is also conducted to provide performance evaluation, and the results suggest that the proposed MFN offers a promising solution for privacy protection and generate more reliable scene for urban applications with street-view images.
翻译:街景图像已经被广泛应用于作为移动地图数据的关键来源。修复街景图像的关键步骤是保护隐私,同时为城市环境图应用提供更加可靠的图像。本文提出了一种新颖的Deep Neural Network(DNN),多尺度语义先验引导图像修复网络(MFN)用于修复街景图像,该网络生成静止的街景图像,且不含移动物体(例如行人,车辆)。为了增强全局上下文理解,引入了一个语义先验提示,用于从大型预训练模型中学习丰富的语义先验。我们通过堆叠多个语义金字塔聚合(SPA)模块来设计提示器,捕捉广泛的视觉特征模式。提出了一种具有新意的级联可学习先验转移(LPT)模块的语义增强图像生成器,每个级别都包含一个新型的级联可学习先验转移(LPT)模块。对于每个解码器模块,应用了关注转移机制来捕捉长距离依赖关系,将语义先验特征与图像特征融合以按适应方式恢复 plausible结构。此外,还采用了自适应背景感知数据处理方案,以防止在洞穴中生成异构物体。在Apolloscapes和Cityscapes数据集上的实验表明,该方法的表现优于最先进的方法,MAE和LPIPS分别实现了约9.5%和41.07%的改进。还进行了多组人物之间的视觉比较调查,以提供性能评估,结果表明,与最先进方法相比,该MFN为隐私保护和生成更可靠的城市场景提供了有前景的解决方案。
https://arxiv.org/abs/2405.10504
In the facial expression recognition task, researchers always get low accuracy of expression classification due to a small amount of training samples. In order to solve this kind of problem, we proposes a new data augmentation method named MixCut. In this method, we firstly interpolate the two original training samples at the pixel level in a random ratio to generate new samples. Then, pixel removal is performed in random square regions on the new samples to generate the final training samples. We evaluated the MixCut method on Fer2013Plus and RAF-DB. With MixCut, we achieved 85.63% accuracy in eight-label classification on Fer2013Plus and 87.88% accuracy in seven-label classification on RAF-DB, effectively improving the classification accuracy of facial expression image recognition. Meanwhile, on Fer2013Plus, MixCut achieved performance improvements of +0.59%, +0.36%, and +0.39% compared to the other three data augmentation methods: CutOut, Mixup, and CutMix, respectively. MixCut improves classification accuracy on RAF-DB by +0.22%, +0.65%, and +0.5% over these three data augmentation methods.
在面部表情识别任务中,研究人员总是由于训练样本数量较少而导致表情分类准确度较低。为了解决这类问题,我们提出了一种名为MixCut的新数据增强方法。在这种方法中,我们首先在随机比例下对原始训练样本的像素级进行插值,以生成新的样本。然后,在新样本上进行随机的正方形区域像素消除,以生成最终的训练样本。我们在Fer2013Plus和RAF-DB上评估了MixCut方法。使用MixCut,我们在Fer2013Plus上的八分类准确度为85.63%,在RAF-DB上的七分类准确度为87.88%,有效地提高了面部表情图像识别的分类准确性。同时,在Fer2013Plus上,MixCut相对于其他三种数据增强方法:CutOut、Mixup和CutMix,分别实现了+0.59%、+0.36%和+0.39%的性能提升。MixCut通过+0.22%、+0.65%和+0.5%的提高,使得RAF-DB上的分类准确度比这三种方法更高。
https://arxiv.org/abs/2405.10489
Fully supervised deep learning approaches have demonstrated impressive accuracy in sea ice classification, but their dependence on high-resolution labels presents a significant challenge due to the difficulty of obtaining such data. In response, our weakly supervised learning method provides a compelling alternative by utilizing lower-resolution regional labels from expert-annotated ice charts. This approach achieves exceptional pixel-level classification performance by introducing regional loss representations during training to measure the disparity between predicted and ice chart-derived sea ice type distributions. Leveraging the AI4Arctic Sea Ice Challenge Dataset, our method outperforms the fully supervised U-Net benchmark, the top solution of the AutoIce challenge, in both mapping resolution and class-wise accuracy, marking a significant advancement in automated operational sea ice mapping.
完全监督的深度学习方法已经在海冰分类中展示了令人印象深刻的准确性,但由于获得这种数据的高难度,它们对高分辨率标签的依赖带来了显著的挑战。为了回应这一挑战,我们的弱监督学习方法提供了一个引人注目的替代方法,利用了专家编写的冰图中的较低分辨率区域标签。通过在训练过程中引入区域损失表示来衡量预测和冰图海冰类型分布之间的差异,这种方法实现了惊人的像素级别分类性能。利用AI4Arctic Sea Ice Challenge Dataset,我们的方法在映射分辨率和平类准确性方面都超过了完全监督的U-Net基准和AutoIce挑战的顶级解决方案,标志着在自动操作海冰映射方面取得了显著的进展。
https://arxiv.org/abs/2405.10456
Locating an object in a sequence of frames, given its appearance in the first frame of the sequence, is a hard problem that involves many stages. Usually, state-of-the-art methods focus on bringing novel ideas in the visual encoding or relational modelling phases. However, in this work, we show that bounding box regression from learned joint search and template features is of high importance as well. While previous methods relied heavily on well-learned features representing interactions between search and template, we hypothesize that the receptive field of the input convolutional bounding box network plays an important role in accurately determining the object location. To this end, we introduce two novel bounding box regression networks: inception and deformable. Experiments and ablation studies show that our inception module installed on the recent ODTrack outperforms the latter on three benchmarks: the GOT-10k, the UAV123 and the OTB2015.
在序列帧中定位一个对象,已知其在序列的第一个帧中的出现,是一个具有很多阶段的复杂问题。通常,最先进的方法会关注在视觉编码或关系建模阶段引入新颖想法。然而,在我们的工作中,我们证明了从学习联合搜索和模板特征中进行边界框回归非常重要。虽然以前的方法主要依赖经过良好训练的表示搜索和模板之间交互的特征,但我们假设输入卷积边界框网络的接收域在准确确定对象位置方面发挥着重要作用。为此,我们引入了两个新颖的边界框回归网络:Inception和Deformable。实验和消融研究结果表明,我们安装在最新OdTrack上的Inception模块在三个基准测试(GOT-10k、UAV123和OTB2015)上的表现优于后者。
https://arxiv.org/abs/2405.10444
Single object tracking is a vital task of many applications in critical fields. However, it is still considered one of the most challenging vision tasks. In recent years, computer vision, especially object tracking, witnessed the introduction or adoption of many novel techniques, setting new fronts for performance. In this survey, we visit some of the cutting-edge techniques in vision, such as Sequence Models, Generative Models, Self-supervised Learning, Unsupervised Learning, Reinforcement Learning, Meta-Learning, Continual Learning, and Domain Adaptation, focusing on their application in single object tracking. We propose a novel categorization of single object tracking methods based on novel techniques and trends. Also, we conduct a comparative analysis of the performance reported by the methods presented on popular tracking benchmarks. Moreover, we analyze the pros and cons of the presented approaches and present a guide for non-traditional techniques in single object tracking. Finally, we suggest potential avenues for future research in single-object tracking.
单对象跟踪是许多关键领域应用程序中的关键任务。然而,它仍然被认为是视觉领域中最具有挑战性的任务之一。在最近几年里,计算机视觉,特别是对象跟踪,见证了许多新颖技术的引入或采用,为性能设立了新的前沿。在本次调查中,我们访问了视觉领域的一些尖端技术,如序列模型、生成模型、自监督学习、无监督学习、强化学习、元学习、连续学习和领域自适应,重点关注它们在单对象跟踪中的应用。我们提出了一种基于新颖技术和趋势的新分类方法来对单对象跟踪方法进行分类。此外,我们比较了所提出的方法的性能,并分析了它们的优势和不足之处。最后,我们为单对象跟踪的非传统技术未来的研究提供了建议。
https://arxiv.org/abs/2405.10439
This paper addresses the problem of diversity-aware sign language production, where we want to give an image (or sequence) of a signer and produce another image with the same pose but different attributes (\textit{e.g.} gender, skin color). To this end, we extend the variational inference paradigm to include information about the pose and the conditioning of the attributes. This formulation improves the quality of the synthesised images. The generator framework is presented as a UNet architecture to ensure spatial preservation of the input pose, and we include the visual features from the variational inference to maintain control over appearance and style. We generate each body part with a separate decoder. This architecture allows the generator to deliver better overall results. Experiments on the SMILE II dataset show that the proposed model performs quantitatively better than state-of-the-art baselines regarding diversity, per-pixel image quality, and pose estimation. Quantitatively, it faithfully reproduces non-manual features for signers.
本文解决了多样意识手语生产中的问题,即我们想要生成一个手语者的图像(或序列),并生成具有相同姿势的不同属性(例如性别、肤色)的另一个图像。为此,我们将变分推理范式扩展到包括姿势和属性的约束信息。这样的表示形式提高了合成图像的质量。生成器框架被设计为UNet架构,以确保输入姿势的空间保留,并包括来自变分推理的视觉特征,以保持对外观和风格的控制。我们为每个身体部分使用单独的解码器。这种架构允许生成器在总体结果方面取得更好的性能。在SMILE II数据集上的实验表明,与最先进的基准相比,所提出的模型在多样性、每像素图像质量和姿势估计方面表现出更好的性能。在定量上,它忠实于手语者的非手动特征。
https://arxiv.org/abs/2405.10423
The Unmanned Aerial Vehicles (UAVs) market has been significantly growing and Considering the availability of drones at low-cost prices the possibility of misusing them, for illegal purposes such as drug trafficking, spying, and terrorist attacks posing high risks to national security, is rising. Therefore, detecting and tracking unauthorized drones to prevent future attacks that threaten lives, facilities, and security, become a necessity. Drone detection can be performed using different sensors, while image-based detection is one of them due to the development of artificial intelligence techniques. However, knowing unauthorized drone types is one of the challenges due to the lack of drone types datasets. For that, in this paper, we provide a dataset of various drones as well as a comparison of recognized object detection models on the proposed dataset including YOLO algorithms with their different versions, like, v3, v4, and v5 along with the Detectronv2. The experimental results of different models are provided along with a description of each method. The collected dataset can be found in this https URL
无人机(UAVs)市场已经显著增长。考虑到低成本无人机(如货运无人机、农业无人机等)的可用性,非法用途(如毒品走私、间谍活动、恐怖主义袭击等)对国家安全的威胁越来越高,因此,检测和追踪未经授权的无人机以防止未来可能对生命、设施和安全的威胁,变得势在必行。 无人机检测可以通过各种传感器进行,而图像识别检测就是其中之一,因为人工智能技术的发展。然而,由于缺乏无人机类型的数据集,知道未经授权的无人机类型是一个挑战。 为了解决这个问题,本文提供了一个包括各种无人机的数据集,以及不同版本的YOLO算法,如v3、v4和v5,以及Detectronv2,并提供了各种模型的实验结果以及方法的描述。收集的數據可以通過此处的链接找到:https://www.example.com/
https://arxiv.org/abs/2405.10398
Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: this https URL.
之前的研究主要针对3D场景理解开发了针对特定任务或需要任务特定微调的专用模型。在本文中,我们提出了Grounded 3D-LLM,探讨了3D大型多模态模型(3D LMM)在统一生成框架中合并各种3D视觉任务的潜力。该模型使用场景参考词作为特殊名词短语来引用3D场景,从而处理跨3D和文本数据的中断序列。它为将3D视觉任务翻译成语言格式提供了自然的方法,使用任务特定的指令模板。为了方便在后续语言建模中使用参考词,我们通过通过现有的物体标签标签进行 bootstrapping 的大型规模 grounded 语言数据集来汇总场景-文本关系。然后,我们引入了 Contrastive Language-Scene 预训练(CLASP)来有效利用这些数据,从而将3D视觉与语言模型相结合。我们的全面评估涵盖了包括密集注解和3D QA等 open-ended 任务,以及包括物体检测和语言 grounding 等 close-ended 任务。在多个 3D 基准测试中进行实验,揭示了 Grounded 3D-LLM 的领先性能和广泛的适用性。代码和数据集将在项目页面上发布:此链接。
https://arxiv.org/abs/2405.10370
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifically, ToF sensors with a uniform field of illumination, which can output dense depth but have low resolution, are typically used for close-range measurements. In contrast, LiDARs, which emit laser pulses and can only capture sparse depth, are usually employed for long-range detection. In the two cases, depth quality improvement for RGB guided ToF imaging corresponds to two sub-tasks: guided depth super-resolution and guided depth completion. In light of the recent significant boost to the field provided by deep learning, this paper comprehensively reviews the works related to RGB guided ToF imaging, including network structures, learning strategies, evaluation metrics, benchmark datasets, and objective functions. Besides, we present quantitative comparisons of state-of-the-art methods on widely used benchmark datasets. Finally, we discuss future trends and the challenges in real applications for further research.
将RGB相机集成到ToF成像系统中已成为感知现实世界的重要技术。RGB引导ToF成像系统对于多个应用场景至关重要,包括面部抗伪造、轮廓检测和轨迹预测。根据工作范围的不同,RGB引导ToF成像系统的实现方案是不同的。具体来说,具有均匀场照明的ToF传感器通常用于近距离测量。相反,激光雷达,它们只能捕获稀疏深度,通常用于远距离检测。在这两种情况下,RGB引导ToF成像系统的深度质量改进相当于两个子任务:引导深度超分辨率 和引导深度完成。 鉴于最近深度学习在领域提供的重大提升,本文全面回顾了与RGB引导ToF成像相关的论文,包括网络结构、学习策略、评估指标、基准数据集和目标函数。此外,我们还在广泛使用的基准数据集上对最先进的方法进行了定量比较。最后,我们讨论了在实际应用中未来的趋势和挑战,为进一步研究提供了指导。
https://arxiv.org/abs/2405.10357
In this work, we recover the underlying 3D structure of non-geometrically consistent scenes. We focus our analysis on hand-drawn images from cartoons and anime. Many cartoons are created by artists without a 3D rendering engine, which means that any new image of a scene is hand-drawn. The hand-drawn images are usually faithful representations of the world, but only in a qualitative sense, since it is difficult for humans to draw multiple perspectives of an object or scene 3D consistently. Nevertheless, people can easily perceive 3D scenes from inconsistent inputs! In this work, we correct for 2D drawing inconsistencies to recover a plausible 3D structure such that the newly warped drawings are consistent with each other. Our pipeline consists of a user-friendly annotation tool, camera pose estimation, and image deformation to recover a dense structure. Our method warps images to obey a perspective camera model, enabling our aligned results to be plugged into novel-view synthesis reconstruction methods to experience cartoons from viewpoints never drawn before. Our project page is https://toon3d.studio/.
在这项工作中,我们恢复了非几何一致性场景的潜在3D结构。我们的分析重点在于手绘的动漫图像。许多动漫是由没有3D渲染引擎的艺术家创作的,这意味着任何场景的新图像都是手绘的。手绘图像通常忠实于现实世界,但只有从定性意义上说,因为人类很难 consistently绘制物体或场景的多个视角。然而,人们可以很容易地从不一致的输入中感知3D场景!在这项工作中,我们纠正了2D绘图不一致性,以恢复一个可信的3D结构,使得新扭曲的图像相互一致。我们的流程包括一个用户友好的注释工具、目标姿态估计和图像变形以恢复密集结构。我们的方法扭曲图像以遵守透视相机模型,使我们得到的结果可以插入到从未见过的观点合成重建方法中,从从未体验过的角度合成卡通。我们的项目页面是https://toon3d.studio/。
https://arxiv.org/abs/2405.10320
Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods directly optimize control points of vector graphics paths, often resulting in intersecting or jagged paths due to the lack of geometry constraints. To overcome these limitations, we propose a novel neural path representation by designing a dual-branch Variational Autoencoder (VAE) that learns the path latent space from both sequence and image modalities. By optimizing the combination of neural paths, we can incorporate geometric constraints while preserving expressivity in generated SVGs. Furthermore, we introduce a two-stage path optimization method to improve the visual and topological quality of generated SVGs. In the first stage, a pre-trained text-to-image diffusion model guides the initial generation of complex vector graphics through the Variational Score Distillation (VSD) process. In the second stage, we refine the graphics using a layer-wise image vectorization strategy to achieve clearer elements and structure. We demonstrate the effectiveness of our method through extensive experiments and showcase various applications. The project page is this https URL.
向量图形广泛应用于数字艺术领域,并因其可扩展性和分层特性而受到设计师的青睐。然而,创建和编辑向量图形需要创造力和设计专业知识,使得这是一个耗时任务。近年来,在文本到向量(T2V)生成方面的先进技术旨在使这个过程更加容易。然而,现有的T2V方法直接优化向量图形的控制点,往往导致路径交叉或交错,由于缺乏几何约束。为了克服这些限制,我们提出了一种新颖的神经路径表示方法,通过设计一种双分支变分自编码器(VAE),从序列和图像模态中学习路径潜在空间。通过优化神经路径的组合,我们可以同时包含几何约束,同时保留生成的SVG的表现力。此外,我们还引入了双阶段路径优化方法,以改善生成的SVG的视觉和拓扑质量。在第一阶段,预训练的文本到图像扩散模型指导复杂向量图形通过变分分数蒸馏(VSD)过程进行初始生成。在第二阶段,我们通过分层图像向量化策略来优化图形,以实现更清晰的元素和结构。我们通过广泛的实验来证明我们方法的的有效性,并展示各种应用。这个项目页面是 https://url.com/。
https://arxiv.org/abs/2405.10317
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
视觉上下文学习(ICL)由于通过类比推理完成各种任务的能力而成为一个有前景的研究领域。然而,基于训练的视觉ICL在泛化到未见过的任务方面存在局限性,需要收集多样任务数据集。另一方面,基于推理的视觉ICL方法仅依赖文本提示,无法从给定的例子中捕捉到细微的上下文信息,并且将图像从图像到文本提示的转换过程中需要花费时间。为了应对这些挑战,我们提出了Analogist,一种新颖的基于推理的视觉ICL方法,利用预训练的文本到图像扩散模型来探索图像和文本提示技术。 在视觉提示方面,我们提出了自注意力克隆(SAC)方法,以引导图像示例之间的细粒度结构级类比。在文本提示方面,我们利用GPT-4V的视觉推理能力高效生成文本提示,并引入跨注意掩码(CAM)操作,以增强由文本提示引导的语义级类比的精度。我们的方法是出类拔萃的,不需要微调或优化。它也具有通用性和灵活性,能够以上下文方式执行各种视觉任务。大量实验证明,我们的方法在质量和数量上优于现有方法。
https://arxiv.org/abs/2405.10316
Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at this https URL .
3D重建技术的进步使得高质量的3D捕捉成为可能,但需要用户收集数百到数千张图像来创建3D场景。我们提出了一种名为CAT3D的方法,通过使用多视角扩散模型模拟这种现实世界的捕捉过程,来创建任何3D物体。给定任意数量的输入图像和一组目标新视角,我们的模型生成场景中高度一致的新视角。这些生成的视图可以作为输入,用于具有实时渲染能力的稳健3D重建技术,产生可以从任何视角渲染的3D表示。CAT3D可以在不到一分钟的时间内创建整个3D场景,并超越了现有方法在单张图像和少数视角3D场景创建方面的表现。请查看我们的项目页面,以查看结果和交互式演示。https://url.com/cat3d 。
https://arxiv.org/abs/2405.10314
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at this https URL
本文介绍了IDEA研究开发的高级开放集物体检测模型Grounding DINO 1.5,该模型的目标是提高开放集物体检测的“边缘”。该系列包括两个模型:Grounding DINO 1.5 Pro,一种高性能模型,旨在在广泛的场景中提高泛化能力,以及Grounding DINO 1.5 Edge,一种专注于更快速度要求的许多需要边缘部署的应用程序的低延迟模型。Grounding DINO 1.5 Pro模型通过扩展模型架构、集成增强的视觉骨架和扩展训练数据集(带有 grounding 注释的超过2000万图像)来超越其前辈,从而实现更丰富的语义理解。Grounding DINO 1.5 Edge模型,虽然设计为具有较低特征缩放的高效模型,但在全面的数据集上训练仍具有稳健的检测能力。实验结果证明了Grounding DINO 1.5的有效性,Grounding DINO 1.5 Pro模型在COCO检测基准上获得了54.3的AP,在LVIS-minival零散转移基准上获得了55.7的AP,创造了新的开放集物体检测纪录。此外,当使用TensorRT优化Grounding DINO 1.5 Edge模型时,其速度达到75.2 FPS,同时取得了LVIS-minival基准上的零散转移性能为36.2 AP,使其更适合边缘计算场景。模型示例和API演示将会在这个链接上发布。
https://arxiv.org/abs/2405.10300
In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
在这项工作中,我们的目标有两个:大规模词汇连续手语识别(CSLR)和手语检索。为此,我们引入了一个多任务Transformer模型CSLR2,它能够将手语序列和口头语言文本联合嵌入空间中的输出。为了在大型词汇设置中实现CSLR评估,我们引入了新的数据集注释,这些注释已经手动收集。这些提供了六个小时测试视频的连续手语级注释,将公开发布。我们证明了,通过仔细选择损失函数,同时训练CSLR和检索任务,可以提高性能——检索通过提供上下文来提高CSLR性能,而CSLR通过更细粒度的监督来提高检索。我们还进一步展示了利用大型词汇数据集如BOBSL的优势,如手语级伪标签和英文字幕。我们的模型在两个任务上都显著超过了前人水平。
https://arxiv.org/abs/2405.10266
Deep learning models, particularly Convolutional Neural Networks (CNNs), have demonstrated exceptional performance in diagnosing skin diseases, often outperforming dermatologists. However, they have also unveiled biases linked to specific demographic traits, notably concerning diverse skin tones or gender, prompting concerns regarding fairness and limiting their widespread deployment. Researchers are actively working to ensure fairness in AI-based solutions, but existing methods incur an accuracy loss when striving for fairness. To solve this issue, we propose a `two-biased teachers' (i.e., biased on different sensitive attributes) based approach to transfer fair knowledge into the student network. Our approach mitigates biases present in the student network without harming its predictive accuracy. In fact, in most cases, our approach improves the accuracy of the baseline model. To achieve this goal, we developed a weighted loss function comprising biasing and debiasing loss terms. We surpassed available state-of-the-art approaches to attain fairness and also improved the accuracy at the same time. The proposed approach has been evaluated and validated on two dermatology datasets using standard accuracy and fairness evaluation measures. We will make source code publicly available to foster reproducibility and future research.
深度学习模型,特别是卷积神经网络(CNNs),在诊断皮肤疾病方面表现出色,往往超过皮肤科医生。然而,它们也揭示了与特定人口特征相关的偏见,尤其是关于不同肤色或性别,引发了关于公平性和平行分布的担忧,限制了它们的应用范围。研究人员正在积极努力确保基于AI的解决方案具有公平性,但现有的方法在追求公平的同时会导致准确性下降。为解决这个问题,我们提出了一个“两个有偏见的老师”(即在不同的敏感属性上存在偏见)的方法,将公平知识传递给学生网络。我们的方法在不影响预测准确性的同时缓解了学生网络中的偏见。事实上,在大多数情况下,我们的方法提高了基线模型的准确性。为了实现这一目标,我们开发了一个由偏度和去偏度损失项组成的加权损失函数。我们超过了现有最先进的方法的公平性,同时提高了准确性。所提出的方法已在两个皮肤病数据集上进行了评估和验证,使用了标准准确性和公平性评估指标。我们将公开源代码,以促进可重复性和未来的研究。
https://arxiv.org/abs/2405.10256
As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.
翻译:计算病理学中的基础模型承诺要解锁精确医学中新的临床决策支持系统和模型的开发。然而,大多数临床分析(以一或多个整片图像的水平定义)与现有基础模型之间的差距很大,后者在整片图像中处理成千上万的图像块。在训练网络以跨越大量 tile 的跨多个整片图像的信息方面,这些模型的影响受到限制。在这项工作中,我们提出了一个用于 H&E 染色组织学的基础模型,PRISM,它基于 Virchow 图像块嵌入并利用临床报告文本进行预训练。利用块嵌入,PRISM 产生具有生成临床报告能力的多模式使用。利用文本提示,PRISM 实现零散癌症检测和亚型检测性能,其效果接近并超过了有监督聚合器的模型。使用线性分类器的块嵌入,PRISM 超越了有监督聚合器模型。此外,我们还证明了 PRISM 块编码器的微调使得生物标志物预测的标签效率训练,这种任务通常缺乏足够的训练数据。基于 PRISM 的聚合器初始化并仅使用 10% 的训练数据进行训练,可以击败使用所有数据的开环基线模型。
https://arxiv.org/abs/2405.10254
Brain lesion segmentation plays an essential role in neurological research and diagnosis. As brain lesions can be caused by various pathological alterations, different types of brain lesions tend to manifest with different characteristics on different imaging modalities. Due to this complexity, brain lesion segmentation methods are often developed in a task-specific manner. A specific segmentation model is developed for a particular lesion type and imaging modality. However, the use of task-specific models requires predetermination of the lesion type and imaging modality, which complicates their deployment in real-world scenarios. In this work, we propose a universal foundation model for 3D brain lesion segmentation, which can automatically segment different types of brain lesions for input data of various imaging modalities. We formulate a novel Mixture of Modality Experts (MoME) framework with multiple expert networks attending to different imaging modalities. A hierarchical gating network combines the expert predictions and fosters expertise collaboration. Furthermore, we introduce a curriculum learning strategy during training to avoid the degeneration of each expert network and preserve their specialization. We evaluated the proposed method on nine brain lesion datasets, encompassing five imaging modalities and eight lesion types. The results show that our model outperforms state-of-the-art universal models and provides promising generalization to unseen datasets.
翻译:脑部病变分割在神经科学研究和诊断中起着关键作用。由于脑部病变可以由多种病理改变引起,不同类型的脑部病变在不同的成像方式上表现出的特点也不同。由于这种复杂性,通常在特定任务下开发脑部病变分割方法。为特定病变类型和成像方式开发特定的分割模型。然而,使用任务特定模型需要预先确定病变类型和成像方式,这使得它们在现实场景中难以部署。在这项工作中,我们提出了一个通用的基于病变部位的3D脑部病变分割模型,可以自动为各种成像方式下的输入数据分割不同类型的脑部病变。我们提出了一个名为混合模式专家(MoME)框架的多专家网络,关注不同的成像方式。一个分层的筛选网络结合专家预测并促进专家之间的专业知识。此外,在训练过程中引入了学习策略,以避免每个专家网络的退化,并保留其专业化的特色。我们对该方法对九个脑部病变数据集进行了评估,包括五种成像方式和八个病变类型。结果显示,我们的模型超越了最先进的通用模型,并为未见过的数据集提供了有前景的泛化。
https://arxiv.org/abs/2405.10246
We identify an issue in multi-task learnable compression, in which a representation learned for one task does not positively contribute to the rate-distortion performance of a different task as much as expected, given the estimated amount of information available in it. We interpret this issue using the predictive $\mathcal{V}$-information framework. In learnable scalable coding, previous work increased the utilization of side-information for input reconstruction by also rewarding input reconstruction when learning this shared representation. We evaluate the impact of this idea in the context of input reconstruction more rigorously and extended it to other computer vision tasks. We perform experiments using representations trained for object detection on COCO 2017 and depth estimation on the Cityscapes dataset, and use them to assist in image reconstruction and semantic segmentation tasks. The results show considerable improvements in the rate-distortion performance of the assisted tasks. Moreover, using the proposed representations, the performance of the base tasks are also improved. Results suggest that the proposed method induces simpler representations that are more compatible with downstream processes.
我们在多任务学习压缩中识别出一个问题,即在给定可用的信息量的情况下,为某一任务学习的表示对另一任务的目标速率失真性能的贡献不如预期。我们使用预测$\mathcal{V}$-信息框架来解释这个问题。在可学习可扩展编码中,以前的工作通过在学习和共享表示时奖励输入重建来增加了侧信息用于输入复原的利用率。我们将在输入复原的背景下更严格地评估这个想法,并将其扩展到其他计算机视觉任务上。我们在COCO 2017上为物体检测训练的表示和城市景观数据集上为深度估计训练的表示,并使用它们来协助图像复原和语义分割任务。实验结果表明,辅助任务的速率失真性能得到了显著的提高。此外,使用所提出的表示,基本任务的性能也得到了提高。结果表明,与所提出的方法相关的更简单的表示对下游过程更兼容。
https://arxiv.org/abs/2405.10244