Pre-training on Internet data has proven to be a key ingredient for broad generalization in many modern ML systems. What would it take to enable such capabilities in robotic reinforcement learning (RL)? Offline RL methods, which learn from datasets of robot experience, offer one way to leverage prior data into the robotic learning pipeline. However, these methods have a "type mismatch" with video data (such as Ego4D), the largest prior datasets available for robotics, since video offers observation-only experience without the action or reward annotations needed for RL methods. In this paper, we develop a system for leveraging large-scale human video datasets in robotic offline RL, based entirely on learning value functions via temporal-difference learning. We show that value learning on video datasets learns representations that are more conducive to downstream robotic offline RL than other approaches for learning from video data. Our system, called V-PTR, combines the benefits of pre-training on video data with robotic offline RL approaches that train on diverse robot data, resulting in value functions and policies for manipulation tasks that perform better, act robustly, and generalize broadly. On several manipulation tasks on a real WidowX robot, our framework produces policies that greatly improve over prior methods. Our video and additional details can be found at this https URL
在互联网数据上进行预训练已经成为许多现代机器学习系统的广泛泛化的关键要素。要在这些机器人强化学习(RL)系统中实现这种能力,需要什么步骤? offline RL方法通过学习机器人经验的数据集来提供一种利用先前数据进入机器人学习管道的方法。然而,这些方法与视频数据(例如 Ego4D)存在“类型不匹配”,因为它们仅提供观察体验,而缺乏 RL 方法所需的行动或奖励注释。在本文中,我们开发了一种系统,旨在利用大规模人类视频数据集在机器人 offline RL 中利用,完全基于基于时间差学习的值函数学习。我们表明,在视频数据上进行值函数学习学习表示,这些表示对于机器人后端 offline RL 方法来说更加有利于下游,而其他从视频数据学习的方法则不如这些方法。我们的系统称为 V-PTR,将预训练视频数据和机器人离线 RL 方法的优点相结合,训练 diverse 机器人数据的数据集,从而产生更好的操作值函数和政策,表现更好、行为更稳健,并广泛泛化。在真实的 WidowX 机器人的几个操作任务中,我们的框架产生的政策比先前方法显著提高。我们的视频和其他详细信息可以在 this https URL 中找到。
https://arxiv.org/abs/2309.13041
Neural radiance fields (NeRF) have revolutionized the field of image-based view synthesis. However, NeRF uses straight rays and fails to deal with complicated light path changes caused by refraction and reflection. This prevents NeRF from successfully synthesizing transparent or specular objects, which are ubiquitous in real-world robotics and A/VR applications. In this paper, we introduce the refractive-reflective field. Taking the object silhouette as input, we first utilize marching tetrahedra with a progressive encoding to reconstruct the geometry of non-Lambertian objects and then model refraction and reflection effects of the object in a unified framework using Fresnel terms. Meanwhile, to achieve efficient and effective anti-aliasing, we propose a virtual cone supersampling technique. We benchmark our method on different shapes, backgrounds and Fresnel terms on both real-world and synthetic datasets. We also qualitatively and quantitatively benchmark the rendering results of various editing applications, including material editing, object replacement/insertion, and environment illumination estimation. Codes and data are publicly available at this https URL.
神经网络辐射场(NeRF)已经彻底改变了基于图像视图合成的领域。然而,NeRF使用直线光线,并无法处理由折射和反射引起的复杂的光路径变化。这导致NeRF无法成功合成透明或闪耀的物体,它们在现实世界机器人和虚拟现实应用中无处不在。在本文中,我们介绍了折射反射域。将物体轮廓作为输入,我们首先使用逐步编码的立方体重构非Lambertian物体的几何形状,然后使用费斯涅尔术语在一个统一框架中模型物体的折射和反射效果。同时,为了高效且有效地减少失真,我们提出了一个虚拟锥超采样技术。我们在不同的形状、背景和费斯涅尔术语的现实世界和合成数据集上对我们的算法进行了基准测试。我们还定性和定量基准了各种编辑应用程序的渲染结果,包括材料编辑、物体替换/插入和环境照明估计。代码和数据在这个httpsURL上公开可用。
https://arxiv.org/abs/2309.13039
Hand-crafted image quality metrics, such as PSNR and SSIM, are commonly used to evaluate model privacy risk under reconstruction attacks. Under these metrics, reconstructed images that are determined to resemble the original one generally indicate more privacy leakage. Images determined as overall dissimilar, on the other hand, indicate higher robustness against attack. However, there is no guarantee that these metrics well reflect human opinions, which, as a judgement for model privacy leakage, are more trustworthy. In this paper, we comprehensively study the faithfulness of these hand-crafted metrics to human perception of privacy information from the reconstructed images. On 5 datasets ranging from natural images, faces, to fine-grained classes, we use 4 existing attack methods to reconstruct images from many different classification models and, for each reconstructed image, we ask multiple human annotators to assess whether this image is recognizable. Our studies reveal that the hand-crafted metrics only have a weak correlation with the human evaluation of privacy leakage and that even these metrics themselves often contradict each other. These observations suggest risks of current metrics in the community. To address this potential risk, we propose a learning-based measure called SemSim to evaluate the Semantic Similarity between the original and reconstructed images. SemSim is trained with a standard triplet loss, using an original image as an anchor, one of its recognizable reconstructed images as a positive sample, and an unrecognizable one as a negative. By training on human annotations, SemSim exhibits a greater reflection of privacy leakage on the semantic level. We show that SemSim has a significantly higher correlation with human judgment compared with existing metrics. Moreover, this strong correlation generalizes to unseen datasets, models and attack methods.
人工制作的图像质量指标,例如PSNR和SSIM,在重建攻击下通常用于评估模型隐私风险。在这些指标下,确定的重构图像通常表示更多的隐私泄露。另一方面,确定的整然差异图像则表示更强的抵御攻击能力。然而,没有保证这些指标很好地反映了人类的意见,作为模型隐私泄露的判断,它们更加可靠。在本文中,我们全面研究了这些人工制作的指标对人类对重构图像的隐私信息感知的准确性的符合性。在5个数据集,包括自然图像、人脸和精细类别,我们使用4个现有的攻击方法从多个分类模型中重构图像,并为每个重构图像询问多个人类标注者是否可识别。我们的研究表明,人工制作的指标仅与人类评估隐私泄露的微弱相关,甚至这些指标本身也常常互相矛盾。这些观察暗示了社区当前指标的风险。为了应对这些潜在风险,我们提出了一种基于学习的指标,称为SemSim,以评估原始和重构图像语义相似性。SemSim使用标准三因素损失进行训练,使用原始图像作为参考,其中一个可识别的重构图像作为正样本,一个不可识别的重构图像作为负样本。通过训练人类标注,SemSim表现出在语义层面上更多的隐私泄露反映。我们表明,SemSim与人类判断的相关性比现有的指标高得多。此外,这种强相关性可以扩展到未观测的数据集、模型和攻击方法。
https://arxiv.org/abs/2309.13038
Medical imaging plays a crucial role in modern healthcare by providing non-invasive visualisation of internal structures and abnormalities, enabling early disease detection, accurate diagnosis, and treatment planning. This study aims to explore the application of deep learning models, particularly focusing on the UNet architecture and its variants, in medical image segmentation. We seek to evaluate the performance of these models across various challenging medical image segmentation tasks, addressing issues such as image normalization, resizing, architecture choices, loss function design, and hyperparameter tuning. The findings reveal that the standard UNet, when extended with a deep network layer, is a proficient medical image segmentation model, while the Res-UNet and Attention Res-UNet architectures demonstrate smoother convergence and superior performance, particularly when handling fine image details. The study also addresses the challenge of high class imbalance through careful preprocessing and loss function definitions. We anticipate that the results of this study will provide useful insights for researchers seeking to apply these models to new medical imaging problems and offer guidance and best practices for their implementation.
医学影像在现代医学中发挥着关键作用,通过非侵入性可视化内部结构和异常,能够早期检测疾病、准确诊断和治疗方案规划。本研究旨在探索深度学习模型在医学图像分割中的应用,特别是关注UNet架构及其变体的应用。我们希望评估这些模型在不同挑战性的医学图像分割任务中的表现,解决图像标准化、尺寸调整、架构选择、损失函数设计以及超参数调优等方面的问题。研究结果表明,标准UNet在加入深度网络层后是一种优秀的医学图像分割模型,而Res-UNet和Attention Res-UNet架构则表现出更平滑的收敛和提高性能,特别是在处理精细图像细节时。研究还通过仔细预处理和损失函数定义解决了高类别不平衡的挑战。我们预计,这项研究的结果将为寻求将这些模型应用于新的医学影像问题的研究提供有用的见解,并指导其实施。
https://arxiv.org/abs/2309.13013
The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
https://arxiv.org/abs/2309.13006
This paper introduces the Point Cloud Network (PCN) architecture, a novel implementation of linear layers in deep learning networks, and provides empirical evidence to advocate for its preference over the Multilayer Perceptron (MLP) in linear layers. We train several models, including the original AlexNet, using both MLP and PCN architectures for direct comparison of linear layers (Krizhevsky et al., 2012). The key results collected are model parameter count and top-1 test accuracy over the CIFAR-10 and CIFAR-100 datasets (Krizhevsky, 2009). AlexNet-PCN16, our PCN equivalent to AlexNet, achieves comparable efficacy (test accuracy) to the original architecture with a 99.5% reduction of parameters in its linear layers. All training is done on cloud RTX 4090 GPUs, leveraging pytorch for model construction and training. Code is provided for anyone to reproduce the trials from this paper.
本论文介绍了点云网络(PCN)架构,这是一种在深度学习网络中采用线性层的独特实现方式,并提供了实证证据,支持其对线性层中多层感知器(MLP)的偏好。我们训练了多个模型,包括原始AlexNet模型,同时使用MLP和PCN架构进行线性层的直接比较(Krizhevsky等人,2012)。收集的主要结果是模型参数计数和在CIFAR-10和CIFAR-100数据集上的最佳1%测试准确率(Krizhevsky,2009)。AlexNet-PCN16是我们的AlexNet的PCN等价物,其线性层参数减少99.5%。所有训练都在云端RTX 4090GPU上完成,利用PyTorch进行模型构建和训练。代码为任何人提供,用于复制本论文的实验。
https://arxiv.org/abs/2309.12996
In the realm of research, the detection/recognition of text within images/videos captured by cameras constitutes a highly challenging problem for researchers. Despite certain advancements achieving high accuracy, current methods still require substantial improvements to be applicable in practical scenarios. Diverging from text detection in images/videos, this paper addresses the issue of text detection within license plates by amalgamating multiple frames of distinct perspectives. For each viewpoint, the proposed method extracts descriptive features characterizing the text components of the license plate, specifically corner points and area. Concretely, we present three viewpoints: view-1, view-2, and view-3, to identify the nearest neighboring components facilitating the restoration of text components from the same license plate line based on estimations of similarity levels and distance metrics. Subsequently, we employ the CnOCR method for text recognition within license plates. Experimental results on the self-collected dataset (PTITPlates), comprising pairs of images in various scenarios, and the publicly available Stanford Cars Dataset, demonstrate the superiority of the proposed method over existing approaches.
在研究领域,相机捕获的图像/视频中的文字检测/识别构成了一个高度挑战的问题,尽管某些方法已经实现了高精度,但当前的方法仍然需要在实际应用场景中进行大量改进。与图像/视频中的文字检测不同,本文通过整合多个不同视角的图像帧来解决 license plate 中的文字检测问题。对于每个视角,该方法提取了描述性特征,描述了 license plate 中文字组件的特征,特别是角落点和区域。具体来说,我们展示了三个视角:view-1、view-2、view-3,以确定最接近的相邻组件,通过相似度和距离度量估计来实现文字组件从同一 license plate 线条的恢复。随后,我们采用 CnOCR 方法在 license plate 中的文字识别。对自收集的数据集(PTITPlates)进行了实验结果,该数据集包括各种场景下的两个图像对,以及公开可用的 Stanford 汽车数据集,证明了该方法相对于现有方法的优越性。
https://arxiv.org/abs/2309.12972
The detailed images produced by Magnetic Resonance Imaging (MRI) provide life-critical information for the diagnosis and treatment of prostate cancer. To provide standardized acquisition, interpretation and usage of the complex MRI images, the PI-RADS v2 guideline was proposed. An automated segmentation following the guideline facilitates consistent and precise lesion detection, staging and treatment. The guideline recommends a division of the prostate into four zones, PZ (peripheral zone), TZ (transition zone), DPU (distal prostatic urethra) and AFS (anterior fibromuscular stroma). Not every zone shares a boundary with the others and is present in every slice. Further, the representations captured by a single model might not suffice for all zones. This motivated us to design a dual-branch convolutional neural network (CNN), where each branch captures the representations of the connected zones separately. Further, the representations from different branches act complementary to each other at the second stage of training, where they are fine-tuned through an unsupervised loss. The loss penalises the difference in predictions from the two branches for the same class. We also incorporate multi-task learning in our framework to further improve the segmentation accuracy. The proposed approach improves the segmentation accuracy of the baseline (mean absolute symmetric distance) by 7.56%, 11.00%, 58.43% and 19.67% for PZ, TZ, DPU and AFS zones respectively.
磁共振成像(MRI)生成的详细图像为前列腺癌的诊断和治疗提供了生命中最重要的信息。为了提供标准化的获取、解释和使用复杂的MRI图像的标准操作,我们提出了PI-RADS v2指南。遵循指南的自动分割有助于一致性和精确的 Lesion 检测、分期和治疗。指南建议将前列腺癌分为四个区域,PZ(周围区域)、TZ(过渡区域)、DPU(远程前列腺癌尿管)和AFS(前部肌肉基质)。不是每个区域都与其他人共享边界,每个切片都包含。此外,一个模型 captured 的表示可能不足以涵盖所有区域。这激励我们设计一种双分支卷积神经网络(CNN),其中每个分支分别捕获连接区域的表示。此外,不同分支的表示在训练的第二阶段相互补充,通过无监督损失进行优化。该损失惩罚两个分支对同一类预测的差异。我们还在我们的框架中引入了多任务学习,以进一步改进分割精度。我们建议的方法改进了基线(平均绝对对称距离)的分割精度,分别为7.56%、11.00%、58.43%、19.67% PZ、TZ、DPU和AFS区域。
https://arxiv.org/abs/2309.12970
Open-set object detection aims at detecting arbitrary categories beyond those seen during training. Most recent advancements have adopted the open-vocabulary paradigm, utilizing vision-language backbones to represent categories with language. In this paper, we introduce DE-ViT, an open-set object detector that employs vision-only DINOv2 backbones and learns new categories through example images instead of language. To improve general detection ability, we transform multi-classification tasks into binary classification tasks while bypassing per-class inference, and propose a novel region propagation technique for localization. We evaluate DE-ViT on open-vocabulary, few-shot, and one-shot object detection benchmark with COCO and LVIS. For COCO, DE-ViT outperforms the open-vocabulary SoTA by 6.9 AP50 and achieves 50 AP50 in novel classes. DE-ViT surpasses the few-shot SoTA by 15 mAP on 10-shot and 7.2 mAP on 30-shot and one-shot SoTA by 2.8 AP50. For LVIS, DE-ViT outperforms the open-vocabulary SoTA by 2.2 mask AP and reaches 34.3 mask APr. Code is available at this https URL.
开放集对象检测的目标是检测训练期间未观察到的任意类别。最近的进展都采用了开放词汇表范式,利用视觉语言骨架来表示类别用语言表示。在本文中,我们介绍了DE-ViT,它是一个开放集对象检测器,使用仅视觉的DINOv2骨架和通过绕过每个类别的推断来学习新类别,而不是使用语言。为了改善一般性检测能力,我们将多分类任务转换为二进制分类任务,而绕过每个类别的推断,并提出了一种新的区域传播技术来进行定位。我们评估了DE-ViT在开放词汇表、少量样本和一次性检测基准上的表现,与COCO和LVIS进行比较。对于COCO,DE-ViT在开放词汇表SoTA上比SoTA表现更好,在新类中达到了50AP50。DE-ViT在10次检测、30次检测和一次性检测SoTA上超过SoTA的15mAP、7.2mAP和2.8AP50。对于LVIS,DE-ViT比开放词汇表SoTA表现更好,达到了34.3 mask APr。代码在此httpsURL上可用。
https://arxiv.org/abs/2309.12969
Collaborative perception, which greatly enhances the sensing capability of connected and autonomous vehicles (CAVs) by incorporating data from external resources, also brings forth potential security risks. CAVs' driving decisions rely on remote untrusted data, making them susceptible to attacks carried out by malicious participants in the collaborative perception system. However, security analysis and countermeasures for such threats are absent. To understand the impact of the vulnerability, we break the ground by proposing various real-time data fabrication attacks in which the attacker delivers crafted malicious data to victims in order to perturb their perception results, leading to hard brakes or increased collision risks. Our attacks demonstrate a high success rate of over 86\% on high-fidelity simulated scenarios and are realizable in real-world experiments. To mitigate the vulnerability, we present a systematic anomaly detection approach that enables benign vehicles to jointly reveal malicious fabrication. It detects 91.5% of attacks with a false positive rate of 3% in simulated scenarios and significantly mitigates attack impacts in real-world scenarios.
协同感知技术通过从外部资源中整合数据,大大提高了连接和自主车辆(CAV)的感知能力,但也带来了潜在的安全风险。 CAV 的驾驶决策依赖于远程不可信的数据,使其容易受到在协同感知系统中恶意参与者的攻击。然而,对此类威胁的安全分析和对策却不存在。为了理解漏洞的影响,我们提出了各种实时数据伪造攻击,攻击者向受害者发送精心构造的恶意数据,以干扰其感知结果,导致强硬刹车或增加碰撞风险。我们的攻击在高保真的模拟场景中表现出超过 86% 的成功率和真实的实验可以实现。为了缓解漏洞的影响,我们提出了一种系统性异常检测方法,使良性车辆能够共同揭露恶意伪造。该方法在模拟场景中检测到 91.5% 的攻击,但假阳性率仅为 3%,在真实的场景中显著减轻了攻击影响。
https://arxiv.org/abs/2309.12955
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.
在计算机断层扫描(CT)生成中,重建内核一致性至关重要,因为 underlying CT texture 在 quantitative image analysis 中可能会影响测量结果。一致性(即内核转换)最小化由于不一致的重建内核引起的测量差异。现有方法研究在一家或多家制造商中一致性 CT 扫描。但是,这些方法需要具有空间和行为上的匹配的硬和软的重建内核的配对扫描。此外,需要在制造商内部不同内核配对之间训练大量模型。在本研究中,我们采用一个无配对的图像转换方法,以研究来自不同制造商的重建内核之间的一致性,并通过构建多路径循环生成对抗网络(GAN)来构建路径循环生成器。我们使用来自国家肺筛检试验数据集的西门子和GE的硬和软的重建内核。我们使用每个重建内核的 50 次扫描训练路径循环生成器。为了评估一致性对重建内核的影响,我们每个从西门子硬内核、GE软内核和GE硬内核中将 50 次扫描 harmonize 到西门子软内核(B30f)上并评估微血管计数。我们考虑年龄、吸烟状况、性别和供应商等因素,并使用线性模型进行方差分析,以评估微血管计数结果的精度。我们的方法最小化了微血管测量的差异,并强调年龄、性别、吸烟状况和供应商对微血管计数量化的影响。
https://arxiv.org/abs/2309.12953
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels. Recently, a new paradigm has emerged by generating a foreground prediction map (FPM) to achieve pixel-level localization. While existing FPM-based methods use cross-entropy to evaluate the foreground prediction map and to guide the learning of the generator, this paper presents two astonishing experimental observations on the object localization learning process: For a trained network, as the foreground mask expands, 1) the cross-entropy converges to zero when the foreground mask covers only part of the object region. 2) The activation value continuously increases until the foreground mask expands to the object boundary. Therefore, to achieve a more effective localization performance, we argue for the usage of activation value to learn more object regions. In this paper, we propose a Background Activation Suppression (BAS) method. Specifically, an Activation Map Constraint (AMC) module is designed to facilitate the learning of generator by suppressing the background activation value. Meanwhile, by using foreground region guidance and area constraint, BAS can learn the whole region of the object. In the inference phase, we consider the prediction maps of different categories together to obtain the final localization results. Extensive experiments show that BAS achieves significant and consistent improvement over the baseline methods on the CUB-200-2011 and ILSVRC datasets. In addition, our method also achieves state-of-the-art weakly supervised semantic segmentation performance on the PASCAL VOC 2012 and MS COCO 2014 datasets. Code and models are available at this https URL.
弱监督的对象定位和语义分割旨在使用图像级别标签只定位对象。最近,出现了一种新范式,通过生成前景预测图(FPM)来实现像素级定位。虽然现有的FPM相关方法使用交叉熵来评估前景预测图和指导生成器学习,但本文提出了对对象定位学习过程的两个惊人的实验观察:对于训练网络,当前景掩膜扩展时,1)交叉熵收敛到零,当前景掩膜仅覆盖对象区域的一部分时。2)激活值持续增加,直到前景掩膜扩展到对象边界。因此,为了实现更有效的定位表现,我们主张使用激活值来学习更多的对象区域。在本文中,我们提出了一种背景激活抑制(BAS)方法。具体来说,一个激活图约束(AMC)模块旨在抑制背景激活值,以促进生成器学习。同时,通过使用前景区域指导和监督区域大小,BAS可以学习整个对象区域。在推理阶段,我们考虑不同类别的预测图一起获得最终定位结果。广泛的实验表明,BAS在CUB-200-2011和ILSVRC数据集上实现了显著和一致性的提高,与基准方法相比。此外,我们的方法还在PASCAL VOC 2012和MS COCO 2014数据集上实现了弱监督语义分割性能的顶尖水平。代码和模型在此httpsURL上可用。
https://arxiv.org/abs/2309.12943
Recently, some researchers started exploring the use of ViTs in tackling HSI classification and achieved remarkable results. However, the training of ViT models requires a considerable number of training samples, while hyperspectral data, due to its high annotation costs, typically has a relatively small number of training samples. This contradiction has not been effectively addressed. In this paper, aiming to solve this problem, we propose the single-direction tuning (SDT) strategy, which serves as a bridge, allowing us to leverage existing labeled HSI datasets even RGB datasets to enhance the performance on new HSI datasets with limited samples. The proposed SDT inherits the idea of prompt tuning, aiming to reuse pre-trained models with minimal modifications for adaptation to new tasks. But unlike prompt tuning, SDT is custom-designed to accommodate the characteristics of HSIs. The proposed SDT utilizes a parallel architecture, an asynchronous cold-hot gradient update strategy, and unidirectional interaction. It aims to fully harness the potent representation learning capabilities derived from training on heterologous, even cross-modal datasets. In addition, we also introduce a novel Triplet-structured transformer (Tri-Former), where spectral attention and spatial attention modules are merged in parallel to construct the token mixing component for reducing computation cost and a 3D convolution-based channel mixer module is integrated to enhance stability and keep structure information. Comparison experiments conducted on three representative HSI datasets captured by different sensors demonstrate the proposed Tri-Former achieves better performance compared to several state-of-the-art methods. Homologous, heterologous and cross-modal tuning experiments verified the effectiveness of the proposed SDT.
最近,一些研究人员开始探索使用VITs解决HSI分类的问题,并取得了显著成果。然而,训练VIT模型需要大量的训练样本,而高分辨率数据由于其标注成本高,通常拥有相对较少的训练样本。这个矛盾并没有得到有效的解决。在本文中,我们提出了一种单向调优策略(SDT),作为连接的手段,使我们可以利用现有的标注HSI数据集,甚至RGB数据集,以提高有限的样本量下的HSI数据集的性能。我们所提出的SDT继承了快速调优的想法,旨在使用少量的修改重新使用训练过的模型以适应新的任务。但是与快速调优不同,SDT是专门为适应HSI特点而设计的。我们提出的SDT使用并行架构、异步冷温梯度更新策略和单向交互,旨在完全利用从训练异构数据集中获得的潜在表示学习能力。此外,我们还介绍了一种独特的三向结构转换器(Tri-former),其中 spectral和空间注意力模块在并行中合并,构建 token 混合组件,以降低计算成本,并整合基于3D卷积通道融合模块的3D卷积通道融合模块,以增强稳定性并保留结构信息。对不同传感器捕获的三个代表性HSI数据集进行的比较实验表明,相比 several state-of-the-art methods,Tri-former achieve better performance。同源性和跨同源性调优实验验证了我们提出的SDT的有效性。
https://arxiv.org/abs/2309.12865
Emerging from the monolithic pairwise attention mechanism in conventional Transformer models, there is a growing interest in leveraging sparse interactions that align more closely with biological principles. Approaches including the Set Transformer and the Perceiver employ cross-attention consolidated with a latent space that forms an attention bottleneck with limited capacity. Building upon recent neuroscience studies of Global Workspace Theory and associative memory, we propose the Associative Transformer (AiT). AiT induces low-rank explicit memory that serves as both priors to guide bottleneck attention in the shared workspace and attractors within associative memory of a Hopfield network. Through joint end-to-end training, these priors naturally develop module specialization, each contributing a distinct inductive bias to form attention bottlenecks. A bottleneck can foster competition among inputs for writing information into the memory. We show that AiT is a sparse representation learner, learning distinct priors through the bottlenecks that are complexity-invariant to input quantities and dimensions. AiT demonstrates its superiority over methods such as the Set Transformer, Vision Transformer, and Coordination in various vision tasks.
从传统的Transformer模型中分化出来的单个对偶注意力机制已经引起了越来越多的关注和兴趣,这种注意力机制更加符合生物学原则。包括Set Transformer和感知器的方法使用交叉注意力巩固与潜在空间,形成了具有有限能力的注意瓶颈。基于最近的全球工作空间理论和感觉记忆的神经学研究,我们提出了感觉记忆Transformer(AiT)。AiT诱导低秩显式记忆,作为在共享工作空间中的引导瓶颈注意力的先验,并在Hopfield网络感觉记忆中的吸引器中形成。通过端到端的训练,这些先验自然地发展模块专业化,每个模块都贡献了一个不同的经验偏见,形成了瓶颈。一个瓶颈可以激励输入之间的竞争,以将信息写入记忆。我们表明,AiT是一种稀疏表示学习器,通过瓶颈学习不同的先验,这些先验对于输入数量和质量的复杂性不敏感。AiT证明了它比Set Transformer、视觉Transformer和协调在各种视觉任务中的方法更加优越。
https://arxiv.org/abs/2309.12862
With the rapid advances in high-throughput sequencing technologies, the focus of survival analysis has shifted from examining clinical indicators to incorporating genomic profiles with pathological images. However, existing methods either directly adopt a straightforward fusion of pathological features and genomic profiles for survival prediction, or take genomic profiles as guidance to integrate the features of pathological images. The former would overlook intrinsic cross-modal correlations. The latter would discard pathological information irrelevant to gene expression. To address these issues, we present a Cross-Modal Translation and Alignment (CMTA) framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic crossmodal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods.
随着高分辨率测序技术的迅速发展,生存分析的重点已经从检查临床指标转移到结合病理图像的基因组 profiles 。然而,现有的方法要么直接采用一种简单直接的融合病理特征和基因组 profiles 进行生存预测的方法,要么将基因组 profiles 作为指导,以整合病理图像的特征。前者会忽略内在的跨模态 correlation 。后者则会丢弃与基因表达无关的病理信息。为了解决这些问题,我们提出了一种跨模态翻译和对齐(CMTA)框架,以探索内在的跨模态 correlation 和传输潜在的互补信息。具体而言,我们构建了两个并行的编码-解码结构,对多模态数据整合内部模态信息,生成跨模态表示。利用生成的跨模态表示来提高和重新校准内部模态表示,可以显著改善全面生存分析的区分性。为了探索内在的跨模态 correlation,我们进一步设计了一个跨模态注意模块,作为不同模态之间的信息桥梁,进行跨模态交互和传输互补信息。我们对五个公开的 TCGA 数据集进行了广泛的实验,证明了我们提出的框架比当前最先进的方法表现更好。
https://arxiv.org/abs/2309.12855
Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: this https URL.
单眼深度估计是一项关键的任务,以测量与相机相对的距离,这对于应用,如机器人导航和自主驾驶非常重要。传统的帧方法由于动态范围限制和运动模糊性能下降而表现不佳。因此,最近的工作利用了新型事件相机来补充或指导帧模式,通过帧-事件特征融合来实现。然而,事件流表现出空间稀疏性,导致一些区域未被感知,特别是在光线变化微弱的区域。因此,直接融合方法,如RAMNet,往往忽略每个模式中最有信心的区域的贡献。这导致模式融合过程中的结构歧义,从而降低了深度估计性能。在本文中,我们提出了一种新的空间可靠性oriented Fusion Network(SRFNet),可以在白天和晚上以精细的结构进行深度估计。我们的方法包括两个关键技术组件。首先,我们提出了一种基于注意力的交互融合(AIF)模块,应用空间先验与帧和帧特征融合作为初始掩模,并学习的共识区域以指导跨模式特征融合。融合特征然后将回溯以提高帧和事件特征学习。同时,它使用输出头生成一个融合掩模,并迭代更新以学习的共识空间先验。其次,我们提出了一种可靠性oriented Depth refinement(RDR)模块,以基于融合特征和掩模的深度估计精细结构。我们对我们的方法在合成和真实世界数据集上的性能进行了评估,这表明,即使没有预训练,我们的方法在夜晚场景方面胜过了以前的方法和RAMNet。我们的项目主页: this https URL.
https://arxiv.org/abs/2309.12842
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
有限次学习在解决目标查询集合中 novel 类的新样本以及在不同域之间的视觉转换方面取得了令人印象深刻的进展。然而,现有技术在域转换下识别目标异常样本方面存在缺陷,通过学习从源域中拒绝源域中的伪异常样本,导致对两个问题的不完整解决方案。为了全面解决这些挑战,我们提出了一种名为“域自适应有限次开放集识别”(DA-FSOS)的新方法,并介绍了名为 DAFOSNET 的元学习架构。在训练期间,我们的模型学习一个共享且具有区别性的嵌入空间,同时创建一个伪开放空间的决策边界,给定一个完全监督的源域和一个标签独立的有限次目标域。为了增强数据密度,我们使用具有可调节噪声均值的两个条件对抗网络,增加两个域的关闭和伪开放空间。此外,我们提出了一个域特定的批量归一化类原型对齐策略,以全球对齐两个域,同时通过新度量目标保证类分类性。我们的训练方法确保了 DAFOS-NET 可以在目标域中的新场景下泛化良好。基于 Office-Home、迷你 ImageNet/CUB 和 DomainNet 数据集,我们提出了三个基准指标,用于 DA-FSOS,并通过广泛的实验证明了 DAFOS-NET 的效力。
https://arxiv.org/abs/2309.12814
Background: View planning for the acquisition of cardiac magnetic resonance (CMR) imaging remains a demanding task in clinical practice. Purpose: Existing approaches to its automation relied either on an additional volumetric image not typically acquired in clinic routine, or on laborious manual annotations of cardiac structural landmarks. This work presents a clinic-compatible, annotation-free system for automatic CMR view planning. Methods: The system mines the spatial relationship, more specifically, locates the intersecting lines, between the target planes and source views, and trains deep networks to regress heatmaps defined by distances from the intersecting lines. The intersection lines are the prescription lines prescribed by the technologists at the time of image acquisition using cardiac landmarks, and retrospectively identified from the spatial relationship. As the spatial relationship is self-contained in properly stored data, the need for additional manual annotation is eliminated. In addition, the interplay of multiple target planes predicted in a source view is utilized in a stacked hourglass architecture to gradually improve the regression. Then, a multi-view planning strategy is proposed to aggregate information from the predicted heatmaps for all the source views of a target plane, for a globally optimal prescription, mimicking the similar strategy practiced by skilled human prescribers. Results: The experiments include 181 CMR exams. Our system yields the mean angular difference and point-to-plane distance of 5.68 degrees and 3.12 mm, respectively. It not only achieves superior accuracy to existing approaches including conventional atlas-based and newer deep-learning-based in prescribing the four standard CMR planes but also demonstrates prescription of the first cardiac-anatomy-oriented plane(s) from the body-oriented scout.
背景:获取心脏磁共振成像的视图规划在临床实践中仍然是一项艰巨的任务。目的:现有的自动化方法依赖在 Clinic 常规中不常见的体积图像,或者对心脏结构 landmarks 的人工手动注释。本工作提出了一种 Clinic 兼容、无人工注释的自动 CMR 视图规划系统。方法:系统挖掘空间关系,更具体地说,定位交叉线,在目标平面和源视图之间,并训练深度网络以从交叉线定义的距离倒退热映射图。交叉线是使用心血管 landmarks 获取图像时医生所指定的处方线,并从历史空间关系中识别。由于空间关系在正确存储的数据中是自我封闭的,因此不再需要额外的手动注释。此外,源视图中预测的多个目标平面之间的交互作用被用于层叠的钟型结构,以逐步改善回归。然后,提出了一种多视图规划策略,将从预测的热映射图汇总的信息,以制定全局最优处方,模拟有技能的医生使用的类似策略。结果:实验包括 181 次 CMR 考试。我们的系统产生平均角差和点-平面距离分别为 5.68 度和 3.12 毫米。它不仅比现有的方法(包括传统的Atlas和最新的深度学习方法)在制定四个标准 CMR 平面的精度上表现更好,而且从身体导航的浏览中展示了第一例心脏解剖学导向平面的处方。
https://arxiv.org/abs/2309.12805
Coral reefs are among the most diverse ecosystems on our planet, and are depended on by hundreds of millions of people. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular SfM photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego-motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high-precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: a 100 m video transect acquired within 5 minutes of diving with a cheap consumer-grade camera can be fully automatically analyzed within 5 minutes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method democratizes coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning-based Structure-from-Motion has broad implications for fast low-cost mapping of underwater environments other than coral reefs.
珊瑚礁是地球上最多样化的生态系统之一,是数百万人的依赖者。不幸的是,大多数珊瑚礁都受到全球气候变化和当地人类活动的压力的严重威胁。为了更好地理解珊瑚礁恶化的动态机制,提高空间和时间分辨率的监测是关键。然而,用于量化珊瑚覆盖和物种数量的常规监测方法由于需要大量的手动劳动而 scale 受到限制。尽管计算机视觉工具被用于协助这个过程,特别是 SfM 照相测量和图像分割深度学习网络,但数据分析造成了瓶颈,有效地限制了其 scalability。本文提出了从自我运动视频映射水下环境的新方法,将 3D 映射系统统一起来,使用机器学习适应水下挑战条件,并与现代方法之一,图像语义分割相结合。该方法在红海北部阿喀巴湾的珊瑚礁区举例说明,展示了前所未有的高精度 3D 语义映射,且所需劳动成本 significantly 减少了:使用廉价的消费级摄像机在潜水5分钟内采集的100米视频线可以在5分钟内完全自动分析。我们的方法 significantly 提高了珊瑚礁监测的规模,通过迈向完全自动分析视频线一大步。该方法通过减少劳动、设备、后勤和计算成本,实现了珊瑚礁线民主化。这种方法可以帮助更有效地传达保护政策。基于学习的结构自运动计算方法对于快速、低成本映射水下环境除珊瑚礁以外的其他生态系统也有广泛的影响。
https://arxiv.org/abs/2309.12804
With the development of the neural field, reconstructing the 3D model of a target object from multi-view inputs has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene, while it is still under-explored how to reconstruct a certain object indicated by users on-the-fly. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images, in this paper, we propose Neural Object Cloning (NOC), a novel high-quality 3D object reconstruction method, which leverages the benefits of both neural field and SAM from two aspects. Firstly, to separate the target object from the scene, we propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D variation field. The 3D variation field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. Then, apart from 2D masks, we further lift the 2D features of the SAM encoder into a 3D SAM field in order to improve the reconstruction quality of the target object. NOC lifts the 2D masks and features of SAM into the 3D neural field for high-quality target object reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be released.
随着神经网络的发展,从多视角输入中重建目标对象的3D模型已越来越引起社区的关注。现有的方法通常需要对整个场景学习一个神经网络场,而如何通过用户的实时指示来重建特定的对象仍未被深入研究。考虑到Segment Anything Model(SAM)在分割任何2D图像方面的有效性,本文提出了一种名为 Neural Object cloning(NOC)的新高质量的3D对象重建方法,该方法利用神经网络场和SAM的两个方面的优势。首先,为了从目标对象与场景分离,我们提出了一种新策略,将SAM的多视角2D分割掩膜转换为一个统一的3D变化场。变化场随后被投影到2D空间,并生成SAM的新提示。这个过程迭代直到收敛,以分离目标对象与场景。除了2D掩膜外,我们还进一步将SAM编码器的2D特征提取到3DSAM场中,以提高目标对象的重建质量。NOC将SAM的2D掩膜和特征提取到3D神经网络场中,以进行高质量的目标对象重建。我们针对多个基准数据集进行了详细的实验,以证明我们方法的优势。代码将发布。
https://arxiv.org/abs/2309.12790