Modeling the shape of garments has received much attention, but most existing approaches assume the garments to be worn by someone, which constrains the range of shapes they can assume. In this work, we address shape recovery when garments are being manipulated instead of worn, which gives rise to an even larger range of possible shapes. To this end, we leverage the implicit sewing patterns (ISP) model for garment modeling and extend it by adding a diffusion-based deformation prior to represent these shapes. To recover 3D garment shapes from incomplete 3D point clouds acquired when the garment is folded, we map the points to UV space, in which our priors are learned, to produce partial UV maps, and then fit the priors to recover complete UV maps and 2D to 3D mappings. Experimental results demonstrate the superior reconstruction accuracy of our method compared to previous ones, especially when dealing with large non-rigid deformations arising from the manipulations.
模型化衣物的形状已经引起了很多关注,但现有的方法都假定衣物是由人穿着的,这限制了它们可能具有的形状范围。在这项工作中,我们在衣物被操作而不是穿着时解决形状恢复问题,这导致具有更广泛的可能形状。为此,我们利用了衣物建模中的隐式缝纫图案(ISP)模型,并将其扩展以添加扩散基的变形,以表示这些形状。为了从衣物折叠时获得的 incomplete 3D 点云中恢复 3D 衣物形状,我们将点映射到 UV 空间,其中我们的先验知识是在这个空间中学习的,然后将先验知识贴合以恢复完整的 UV 映射和 2D 到 3D 映射。实验结果表明,与以前的方法相比,我们方法的精度优越,尤其是在处理由操作引起的大非刚性变形时。
https://arxiv.org/abs/2405.10934
In recent years, various large foundation models have been proposed for image segmentation. There models are often trained on large amounts of data corresponding to general computer vision tasks. Hence, these models do not perform well on medical data. There have been some attempts in the literature to perform parameter-efficient finetuning of such foundation models for medical image segmentation. However, these approaches assume that all the parameters of the model are available for adaptation. But, in many cases, these models are released as APIs or blackboxes, with no or limited access to the model parameters and data. In addition, finetuning methods also require a significant amount of compute, which may not be available for the downstream task. At the same time, medical data can't be shared with third-party agents for finetuning due to privacy reasons. To tackle these challenges, we pioneer a blackbox adaptation technique for prompted medical image segmentation, called BAPS. BAPS has two components - (i) An Image-Prompt decoder (IP decoder) module that generates visual prompts given an image and a prompt, and (ii) A Zero Order Optimization (ZOO) Method, called SPSA-GC that is used to update the IP decoder without the need for backpropagating through the foundation model. Thus, our method does not require any knowledge about the foundation model's weights or gradients. We test BAPS on four different modalities and show that our method can improve the original model's performance by around 4%.
近年来,已经提出了许多大型基础模型来进行图像分割。这些模型通常在对应于一般计算机视觉任务的较大数据集上进行训练。因此,这些模型在医学数据上的表现往往不佳。有一些文献中尝试对这类基础模型进行参数高效的微调来进行医学图像分割。然而,这些方法都假设模型的所有参数都可用进行调整。但是,在许多情况下,这些模型作为API或黑盒发布,没有或仅有限地访问模型参数和数据。此外,微调方法也需要大量的计算,这可能不适用于下游任务。同时,由于隐私原因,医疗数据也不能与第三方共享以进行微调。为了应对这些挑战,我们首创了一种用于提示医学图像分割的黑盒适应技术,称为BAPS。BAPS有两个组件——(i)图像提示编码器(IP decoder)模块,根据图像和提示生成视觉提示;(ii)零阶优化(ZOO)方法,称为SPSA-GC,用于在无需通过基础模型反向传播的情况下更新IP decoder。因此,我们的方法不需要了解基础模型的权重或梯度。我们在四个不同数据集上测试BAPS,结果表明,我们的方法可以将原始模型的性能提高约4%。
https://arxiv.org/abs/2405.10913
Most existing methods often rely on complex models to predict scene depth with high accuracy, resulting in slow inference that is not conducive to deployment. To better balance precision and speed, we first designed SmallDepth based on sparsity. Second, to enhance the feature representation ability of SmallDepth during training under the condition of equal complexity during inference, we propose an equivalent transformation module(ETM). Third, to improve the ability of each layer in the case of a fixed SmallDepth to perceive different context information and improve the robustness of SmallDepth to the left-right direction and illumination changes, we propose pyramid loss. Fourth, to further improve the accuracy of SmallDepth, we utilized the proposed function approximation loss (APX) to transfer knowledge in the pretrained HQDecv2, obtained by optimizing the previous HQDec to address grid artifacts in some regions, to SmallDepth. Extensive experiments demonstrate that each proposed component improves the precision of SmallDepth without changing the complexity of SmallDepth during inference, and the developed approach achieves state-of-the-art results on KITTI at an inference speed of more than 500 frames per second and with approximately 2 M parameters. The code and models will be publicly available at this https URL.
大多数现有方法通常依赖复杂的模型来预测场景深度,以实现高精度的预测,但会导致推理速度较慢,不适合部署。为了实现精度和速度的平衡,我们首先基于稀疏性设计了我的小型深度(SmallDepth)。接着,为了在推理过程中增强SmallDepth的特征表示能力,我们提出了等效变换模块(ETM)。然后,为了在固定SmallDepth的情况下更好地感知不同上下文信息,并提高SmallDepth对左右方向和光照变化等的鲁棒性,我们提出了金字塔损失。最后,为了进一步提高SmallDepth的准确性,我们利用所提出的函数逼近损失(APX)将前HQDecv2预训练知识传递给SmallDepth,以解决某些区域中的网格伪影问题。大量实验证明,每个所提出的组件都能提高SmallDepth的精度,而不会改变SmallDepth在推理过程中的复杂度,并且所开发的方法在每秒超过500帧的推理速度和大约20000个参数的KITTI数据集上实现了最先进的结果。代码和模型将公开发布在https://这个URL上。
https://arxiv.org/abs/2405.10885
The goal of image registration is to establish spatial correspondence between two or more images, traditionally through dense displacement fields (DDFs) or parametric transformations (e.g., rigid, affine, and splines). Rethinking the existing paradigms of achieving alignment via spatial transformations, we uncover an alternative but more intuitive correspondence representation: a set of corresponding regions-of-interest (ROI) pairs, which we demonstrate to have sufficient representational capability as other correspondence representation methods.Further, it is neither necessary nor sufficient for these ROIs to hold specific anatomical or semantic significance. In turn, we formulate image registration as searching for the same set of corresponding ROIs from both moving and fixed images - in other words, two multi-class segmentation tasks on a pair of images. For a general-purpose and practical implementation, we integrate the segment anything model (SAM) into our proposed algorithms, resulting in a SAM-enabled registration (SAMReg) that does not require any training data, gradient-based fine-tuning or engineered prompts. We experimentally show that the proposed SAMReg is capable of segmenting and matching multiple ROI pairs, which establish sufficiently accurate correspondences, in three clinical applications of registering prostate MR, cardiac MR and abdominal CT images. Based on metrics including Dice and target registration errors on anatomical structures, the proposed registration outperforms both intensity-based iterative algorithms and DDF-predicting learning-based networks, even yielding competitive performance with weakly-supervised registration which requires fully-segmented training data.
图像配准的目的是建立两张或多张图像之间的空间对应关系,这通常通过密集位移场(DDFs)或参数变换(如刚性、平滑和曲线)来实现。重新思考通过空间变换实现对齐的传统范式,我们发现了另一种但更直观的对应关系表示:一系列相应的兴趣区域(ROI)对,我们证明了它们具有足够的表示能力作为其他配准表示方法。此外,这些ROI不需要或只需要特定的解剖或语义意义才能成立。因此,我们将图像配准定义为在动态和静态图像中寻找相同的一组对应ROI - 换句话说,是两个多类分割任务。为了实现通用和实际应用,我们将分割 anything模型(SAM)集成到我们的建议算法中,从而实现了一个 SARM-enabled 配准(SAMReg)。我们通过实验证明了所提出的 SARMReg 能够对前列腺MR、心脏MR和腹部CT图像进行分割和匹配多个ROI对,从而建立足够准确的对应关系,在三个前列腺MR、心脏MR和腹部CT图像的临床应用中取得了良好的效果。根据包括余子格和目标配准误差在内的指标,与基于强度的迭代算法和DDF预测学习网络相比,所提出的配准表现出更优异的性能,即使在弱监督下,其性能也与其相当。
https://arxiv.org/abs/2405.10879
Glioblastoma is the most common primary adult brain tumor, with a grim prognosis - median survival of 12-18 months following treatment, and 4 months otherwise. Glioblastoma is widely infiltrative in the cerebral hemispheres and well-defined by heterogeneous molecular and micro-environmental histopathologic profiles, which pose a major obstacle in treatment. Correctly diagnosing these tumors and assessing their heterogeneity is crucial for choosing the precise treatment and potentially enhancing patient survival rates. In the gold-standard histopathology-based approach to tumor diagnosis, detecting various morpho-pathological features of distinct histology throughout digitized tissue sections is crucial. Such "features" include the presence of cellular tumor, geographic necrosis, pseudopalisading necrosis, areas abundant in microvascular proliferation, infiltration into the cortex, wide extension in subcortical white matter, leptomeningeal infiltration, regions dense with macrophages, and the presence of perivascular or scattered lymphocytes. With these features in mind and building upon the main aim of the BraTS Cluster of Challenges this https URL, the goal of the BraTS-Path challenge is to provide a systematically prepared comprehensive dataset and a benchmarking environment to develop and fairly compare deep-learning models capable of identifying tumor sub-regions of distinct histologic profile. These models aim to further our understanding of the disease and assist in the diagnosis and grading of conditions in a consistent manner.
Glioblastoma是成人最常见的主要脑肿瘤,其预后令人沮丧 - 经过治疗后的中位数生存期为12-18个月,而其他治疗方式下则只有4个月。Glioblastoma广泛浸润于脑叶,并通过对不同分子和显微环境病理学特征的异质性进行评估,构成了在治疗中的主要障碍。正确诊断这些肿瘤并评估其异质性对于选择精确的治疗和潜在提高患者生存率至关重要。在基于肿瘤组织学诊断的金标准方法中,检测数字组织切片中的各种形态学特征至关重要。这些“特征”包括细胞肿瘤的存在、局灶性坏死、伪足性坏死、富含血管增殖的区域、侵入皮质、广泛分布于白质中的区域、脊髓旁沟内渗入、富含巨噬细胞的区域以及周围或散在分布的淋巴细胞的存在。牢记这些特征并在此基础上, BraTS-Path挑战的目标是为 BraTS 集群提供一个系统地准备的全面数据集和基准环境,以开发和公平地比较能够识别不同组织学profile的深度学习模型。这些模型旨在进一步深入了解疾病,并协助以一致的方式对疾病进行诊断和分级。
https://arxiv.org/abs/2405.10871
Objectives: This work aims to explore the impact of multicenter data heterogeneity on deep learning brain metastases (BM) autosegmentation performance, and assess the efficacy of an incremental transfer learning technique, namely learning without forgetting (LWF), to improve model generalizability without sharing raw data. Materials and methods: A total of six BM datasets from University Hospital Erlangen (UKER), University Hospital Zurich (USZ), Stanford, UCSF, NYU and BraTS Challenge 2023 on BM segmentation were used for this evaluation. First, the multicenter performance of a convolutional neural network (DeepMedic) for BM autosegmentation was established for exclusive single-center training and for training on pooled data, respectively. Subsequently bilateral collaboration was evaluated, where a UKER pretrained model is shared to another center for further training using transfer learning (TL) either with or without LWF. Results: For single-center training, average F1 scores of BM detection range from 0.625 (NYU) to 0.876 (UKER) on respective single-center test data. Mixed multicenter training notably improves F1 scores at Stanford and NYU, with negligible improvement at other centers. When the UKER pretrained model is applied to USZ, LWF achieves a higher average F1 score (0.839) than naive TL (0.570) and single-center training (0.688) on combined UKER and USZ test data. Naive TL improves sensitivity and contouring accuracy, but compromises precision. Conversely, LWF demonstrates commendable sensitivity, precision and contouring accuracy. When applied to Stanford, similar performance was observed. Conclusion: Data heterogeneity results in varying performance in BM autosegmentation, posing challenges to model generalizability. LWF is a promising approach to peer-to-peer privacy-preserving model training.
目标:本研究旨在探讨多中心数据异质性对深度学习脑转移瘤(BM)自分割性能的影响,并评估增广转移学习技术(学习不遗忘,LWF)对提高模型泛化能力而不共享原始数据的有效性。材料和方法:本研究使用了英国埃尔兰大学医院(UKER)、瑞士苏黎世大学医院(USZ)、斯坦福大学、加州大学旧金山分校(UCSF)、纽约大学(NYU)和2023年BrTS挑战赛BM分割数据集进行评估。首先,对于单中心训练和基于池化数据的训练,建立了DeepMedic卷积神经网络(BM)自分割的异中心性能。然后,评估了双边合作,其中 UKER 预训练模型在一个中心进行共享,用于进一步训练,无论是否使用 LWF。结果:对于单中心训练,单中心测试数据的 BM 检测范围平均分数从0.625(NYU)到0.876(UKER)不等。混合多中心训练在斯坦福和纽约大学上显著提高了 F1 分数,而在其他中心上影响较小。当 UKER 预训练模型应用于 USZ 时,LWF 获得的平均 F1 分数(0.839)高于 naive TL(0.570)和单中心训练(0.688)的总和。 naive TL 提高了灵敏度和轮廓准确性,但牺牲了精确性。相反,LWF 表现出值得称赞的灵敏度、精度和轮廓准确性。当应用于斯坦福时,观察到了类似的表现。结论:数据异质性导致 BM 自分割性能存在差异,对模型的泛化能力构成挑战。LWF 是保护隐私的同时实现模型协同训练的有前景的方法。
https://arxiv.org/abs/2405.10870
This paper presents a novel approach to the digital signing of electronic documents through the use of a camera-based interaction system, single-finger tracking for sign recognition, and multi commands executing hand gestures. The proposed solution, referred to as "Air Signature," involves writing the signature in front of the camera, rather than relying on traditional methods such as mouse drawing or physically signing on paper and showing it to a web camera. The goal is to develop a state-of-the-art method for detecting and tracking gestures and objects in real-time. The proposed methods include applying existing gesture recognition and object tracking systems, improving accuracy through smoothing and line drawing, and maintaining continuity during fast finger movements. An evaluation of the fingertip detection, sketching, and overall signing process is performed to assess the effectiveness of the proposed solution. The secondary objective of this research is to develop a model that can effectively recognize the unique signature of a user. This type of signature can be verified by neural cores that analyze the movement, speed, and stroke pixels of the signing in real time. The neural cores use machine learning algorithms to match air signatures to the individual's stored signatures, providing a secure and efficient method of verification. Our proposed System does not require sensors or any hardware other than the camera.
本文提出了一种新颖的方法,通过使用摄像头交互系统、单手指跟踪签名和多命令手势来对电子文档进行数字签名。所提出的解决方案被称为“空气签名”,涉及在摄像头前签名,而不是依赖传统的如鼠标绘图或纸质签名并将其显示给网络摄像头的方法。目标是开发出一种在实时检测和跟踪手势和物体的高级方法。所提出的方法包括应用现有的手势识别和物体跟踪系统、通过平滑和绘线来提高准确性以及保持连续性在快速手指运动过程中。为了评估所提出的解决方案的有效性,对指尖检测、绘图和整体签名过程进行了评估。本研究的主要目标是开发一个模型,可以有效地识别用户的独特签名。这种签名可以通过分析签署者的运动、速度和画笔像素来验证。神经内核使用机器学习算法将空气签名与存储在个人中的签名匹配,提供了一种安全且高效的身份验证方法。我们所提出的系统不需要传感器或任何其他硬件,只需要摄像头。
https://arxiv.org/abs/2405.10868
Recent advancements in text-to-image generation using diffusion models have significantly improved the quality of generated images and expanded the ability to depict a wide range of objects. However, ensuring that these models adhere closely to the text prompts remains a considerable challenge. This issue is particularly pronounced when trying to generate photorealistic images of humans. Without significant prompt engineering efforts models often produce unrealistic images and typically fail to incorporate the full extent of the prompt information. This limitation can be largely attributed to the nature of captions accompanying the images used in training large scale diffusion models, which typically prioritize contextual information over details related to the person's appearance. In this paper we address this issue by introducing a training-free pipeline designed to generate accurate appearance descriptions from images of people. We apply this method to create approximately 250,000 captions for publicly available face datasets. We then use these synthetic captions to fine-tune a text-to-image diffusion model. Our results demonstrate that this approach significantly improves the model's ability to generate high-quality, realistic human faces and enhances adherence to the given prompts, compared to the baseline model. We share our synthetic captions, pretrained checkpoints and training code.
近年来,利用扩散模型进行文本到图像生成的技术取得了显著的进步,大大提高了生成的图像的质量,并扩展了描绘各种对象的能力。然而,确保这些模型紧密遵循文本提示仍然是一个相当大的挑战。尤其是在尝试生成人类照片的情况下,这个问题尤为突出。如果没有进行显著的提示工程,模型通常会产生不现实的图像,并且通常无法涵盖提示的全部信息。这个限制可以很大程度上归因于用于训练大规模扩散模型的文本伴随物的性质,通常优先考虑上下文信息而不是与人物外貌相关的详细信息。在本文中,我们通过引入一个无需训练的管道来生成准确的人的形象描述,解决了这个问题。我们将这种方法应用于公开可用的人脸数据集上,创建了大约250,000个文本注释。然后,我们使用这些合成注释来微调一个文本到图像扩散模型。我们的结果表明,与基线模型相比,这种方法显著提高了模型的生成高质量、真实人类脸的能力,并增强了遵循给定提示的准确性。我们还共享了我们的合成注释、预训练检查点和训练代码。
https://arxiv.org/abs/2405.10864
Increasing demands on medical imaging departments are taking a toll on the radiologist's ability to deliver timely and accurate reports. Recent technological advances in artificial intelligence have demonstrated great potential for automatic radiology report generation (ARRG), sparking an explosion of research. This survey paper conducts a methodological review of contemporary ARRG approaches by way of (i) assessing datasets based on characteristics, such as availability, size, and adoption rate, (ii) examining deep learning training methods, such as contrastive learning and reinforcement learning, (iii) exploring state-of-the-art model architectures, including variations of CNN and transformer models, (iv) outlining techniques integrating clinical knowledge through multimodal inputs and knowledge graphs, and (v) scrutinising current model evaluation techniques, including commonly applied NLP metrics and qualitative clinical reviews. Furthermore, the quantitative results of the reviewed models are analysed, where the top performing models are examined to seek further insights. Finally, potential new directions are highlighted, with the adoption of additional datasets from other radiological modalities and improved evaluation methods predicted as important areas of future development.
医疗影像部门的需求不断增加,对放射科医生及时和准确报告的能力产生了压力。最近的人工智能技术进步表明,自动放射学报告生成(ARRG)具有巨大的潜力,引发了爆炸性的研究。这份调查论文通过对当代ARRG方法的系统综述,评估了基于特征的数据集,研究了对比学习和支持性学习等深度学习训练方法,探讨了最先进的模型架构,包括CNN和Transformer模型的变体,以及通过多模态输入和知识图整合临床知识的 technique。此外,还审视了当前的模型评估技术,包括常用于NLP指标的质量和定性临床评价。进一步,审查了审查模型的定量结果,重点关注表现最好的模型,以寻求进一步的见解。最后,概述了潜在的新方向,包括其他放射学模态数据集的采用和评估方法的改进,预测为未来发展的关键领域。
https://arxiv.org/abs/2405.10842
Background and purpose: Deep Learning (DL) has been widely explored for Organs at Risk (OARs) segmentation; however, most studies have focused on a single modality, either CT or MRI, not both simultaneously. This study presents a high-performing DL pipeline for segmentation of 30 OARs from MRI and CT scans of Head and Neck (H&N) cancer patients. Materials and methods: Paired CT and MRI-T1 images from 42 H&N cancer patients alongside annotation for 30 OARs from the H&N OAR CT & MR segmentation challenge dataset were used to develop a segmentation pipeline. After cropping irrelevant regions, rigid followed by non-rigid registration of CT and MRI volumes was performed. Two versions of the CT volume, representing soft tissues and bone anatomy, were stacked with the MRI volume and used as input to an nnU-Net pipeline. Modality Dropout was used during the training to force the model to learn from the different modalities. Segmentation masks were predicted with the trained model for an independent set of 14 new patients. The mean Dice Score (DS) and Hausdorff Distance (HD) were calculated for each OAR across these patients to evaluate the pipeline. Results: This resulted in an overall mean DS and HD of 0.777 +- 0.118 and 3.455 +- 1.679, respectively, establishing the state-of-the-art (SOTA) for this challenge at the time of submission. Conclusion: The proposed pipeline achieved the best DS and HD among all participants of the H&N OAR CT and MR segmentation challenge and sets a new SOTA for automated segmentation of H&N OARs.
背景和目的: 深度学习(DL)已经在组织和濒危组织(OARs)分割方面得到了广泛探讨;然而,大多数研究都集中在单个模态上,通常是CT或MRI,而不是同时进行。本研究旨在为来自头颈部癌症患者MRI和CT扫描的30个OARs开发一个高效的DL管道。材料和方法: 42个头颈部癌症患者的成对CT和T1图像以及H&N OAR CT & MR分割挑战数据集中的注释,用于开发分割管道。在裁剪无关区域后,对CT和MRI体积进行刚性和非刚性对齐。有两版CT体积,代表软组织和骨组织,与MRI体积堆叠使用,作为nnU-Net网络的输入。在训练期间使用模态丢失来强制模型从不同模态学习。预测分割掩码用于对14个独立患者进行分割。计算每个OAR在这些患者上的平均Dice分数(DS)和Hausdorff距离(HD),以评估该管道。 结果: 这导致总体平均DS和HD分别为0.777 +- 0.118和3.455 +- 1.679,分别位于SOTA。 结论: 该提出的管道在所有H&N OAR CT和MRI分割挑战 participants中实现了最佳的DS和HD,并设定了新的SOTA,用于自动分割H&N OARs。
https://arxiv.org/abs/2405.10833
Spatio-temporal action detection (STAD) is an important fine-grained video understanding task. Current methods require box and label supervision for all action classes in advance. However, in real-world applications, it is very likely to come across new action classes not seen in training because the action category space is large and hard to enumerate. Also, the cost of data annotation and model training for new classes is extremely high for traditional methods, as we need to perform detailed box annotations and re-train the whole network from scratch. In this paper, we propose a new challenging setting by performing open-vocabulary STAD to better mimic the situation of action detection in an open world. Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good generalization performance on novel action classes. For OV-STAD, we build two benchmarks based on the existing STAD datasets and propose a simple but effective method based on pretrained video-language models (VLM). To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs. This customized fine-tuning endows the VLM with better motion understanding, thus contributing to a more accurate alignment between video regions and texts. Local region feature and global video feature fusion before alignment is adopted to further improve the action detection performance by providing global context. Our method achieves a promising performance on novel classes.
Spatio-temporal action detection (STAD) 是一个重要的细粒度视频理解任务。 目前的解决方案需要对所有动作类别进行箱和标签监督。然而,在现实应用中,可能会遇到在训练中没有见过的全新动作类别,因为动作类别空间很大,很难枚举。 传统方法的代价是数据注释和模型训练成本非常高,因为我们需要对整个网络进行详细的箱注释,并从头开始重新训练整个网络。 在本文中,我们提出了一个新的具有挑战性的设置,通过进行开放式词汇 STAD 来更好地模仿在开放世界中动作检测的情况。开放式词汇spatiotemporal动作检测 (OV-STAD)需要对有限的基础类别的模型进行训练,并且预计将产生对 novel 动作类别的良好泛化性能。对于 OV-STAD,我们基于现有的 STAD 数据集构建了两个基准,并提出了一个简单但有效的方法,基于预训练的视频语言模型 (VLM)。为了更好地适应细粒度动作检测任务,我们在局部视频区域文本对上进行微调。 这种自定义微调使 VLM 具有更好的运动理解,从而提高了视频区域和文本之间的更准确的对齐。 在对齐之前采用局部区域特征和全局视频特征的融合,进一步提高了动作检测性能,提供了全局上下文。我们的方法在 novel 类上取得了良好的性能。
https://arxiv.org/abs/2405.10832
Earlier diagnosis of Leukemia can save thousands of lives annually. The prognosis of leukemia is challenging without the morphological information of White Blood Cells (WBC) and relies on the accessibility of expensive microscopes and the availability of hematologists to analyze Peripheral Blood Samples (PBS). Deep Learning based methods can be employed to assist hematologists. However, these algorithms require a large amount of labeled data, which is not readily available. To overcome this limitation, we have acquired a realistic, generalized, and large dataset. To collect this comprehensive dataset for real-world applications, two microscopes from two different cost spectrums (high-cost HCM and low-cost LCM) are used for dataset capturing at three magnifications (100x, 40x, 10x) through different sensors (high-end camera for HCM, middle-level camera for LCM and mobile-phone camera for both). The high-sensor camera is 47 times more expensive than the middle-level camera and HCM is 17 times more expensive than LCM. In this collection, using HCM at high resolution (100x), experienced hematologists annotated 10.3k WBC types (14) and artifacts, having 55k morphological labels (Cell Size, Nuclear Chromatin, Nuclear Shape, etc.) from 2.4k images of several PBS leukemia patients. Later on, these annotations are transferred to other 2 magnifications of HCM, and 3 magnifications of LCM, and on each camera captured images. Along with the LeukemiaAttri dataset, we provide baselines over multiple object detectors and Unsupervised Domain Adaptation (UDA) strategies, along with morphological information-based attribute prediction. The dataset will be publicly available after publication to facilitate the research in this direction.
早期对白血病的诊断可以拯救成千上万的生命。在没有白细胞形态信息的情况下,白血病的预后非常具有挑战性,而且依赖于昂贵的显微镜和能够分析外周血样本的血液学家。基于深度学习的方法可以协助血液学家。然而,这些算法需要大量标记数据,而这些数据并不容易获得。为了克服这一限制,我们获得了真实、泛化的一个大数据集。为了收集这个全面的现实世界的应用数据,我们使用了两种不同成本范围的显微镜(高成本HCM和低成本LCM)在三个放大倍数(100x,40x,10x)下对数据进行捕捉,通过不同传感器(高分辨率相机用于HCM,中档水平相机用于LCM和手机相机)进行数据采集。高传感器相机比中档相机贵47倍,而HCM比LCM贵17倍。在这个收集过程中,使用高分辨率(100x)对HCM进行采集的血液学家标注了10303个WBC类型(14),以及从2400个不同PBS白血病患者图像中的55000个形态标签(细胞大小、核染色质、核形状等)。后来,这些标注传到了其他2个HCM放大倍数和3个LCM放大倍数,以及每个相机捕获的图像。与白血病Attri数据集相结合,我们在多个目标检测器和无监督领域自适应(UDA)策略的基础上提供了基线。该数据集将在发布后公开提供,以促进关于这一方向的研究。
https://arxiv.org/abs/2405.10803
Convolutional neural networks (CNNs) are among the most widely used machine learning models for computer vision tasks, such as image classification. To improve the efficiency of CNNs, many CNNs compressing approaches have been developed. Low-rank methods approximate the original convolutional kernel with a sequence of smaller convolutional kernels, which leads to reduced storage and time complexities. In this study, we propose a novel low-rank CNNs compression method that is based on reduced storage direct tensor ring decomposition (RSDTR). The proposed method offers a higher circular mode permutation flexibility, and it is characterized by large parameter and FLOPS compression rates, while preserving a good classification accuracy of the compressed network. The experiments, performed on the CIFAR-10 and ImageNet datasets, clearly demonstrate the efficiency of RSDTR in comparison to other state-of-the-art CNNs compression approaches.
卷积神经网络(CNNs)是计算机视觉任务中最广泛使用的机器学习模型之一,如图像分类。为了提高CNN的效率,已经提出了许多CNN压缩方法。低秩方法用较小的卷积内核序列近似原始卷积内核,导致存储和时间复杂度的降低。在这项研究中,我们提出了一种基于减少存储直接张量环分解(RSDTR)的新型低秩CNNs压缩方法。所提出的方法具有更高的环模式变换灵活性,其参数和FLOPS压缩率较大,同时保持压缩网络的分类精度。在CIFAR-10和ImageNet数据集上进行的实验明确证明了RSDTR与其他最先进的CNN压缩方法相比具有更高的效率。
https://arxiv.org/abs/2405.10802
In this paper, we aim to reconstruct an n-dimensional real vector from m phaseless measurements corrupted by an additive noise. We extend the noiseless framework developed in [15], based on mirror descent (or Bregman gradient descent), to deal with noisy measurements and prove that the procedure is stable to (small enough) additive noise. In the deterministic case, we show that mirror descent converges to a critical point of the phase retrieval problem, and if the algorithm is well initialized and the noise is small enough, the critical point is near the true vector up to a global sign change. When the measurements are i.i.d Gaussian and the signal-to-noise ratio is large enough, we provide global convergence guarantees that ensure that with high probability, mirror descent converges to a global minimizer near the true vector (up to a global sign change), as soon as the number of measurements m is large enough. The sample complexity bound can be improved if a spectral method is used to provide a good initial guess. We complement our theoretical study with several numerical results showing that mirror descent is both a computationally and statistically efficient scheme to solve the phase retrieval problem.
在本文中,我们的目标是使用m个无噪声的测量值从由添加噪声引起的n维实向量中重构。我们在[15]的基础上,基于镜像退火(或Bregman梯度下降)扩展了无噪声框架,以处理有噪声的测量值,并证明该过程对(足够小的)添加噪声是稳定的。在确定性情况下,我们证明了镜像退火收敛到相检索问题的临界点,如果算法初始化得好,噪声足够小,那么临界点接近于真实向量,且在全局符号变化之前。当测量是i.i.d高斯分布,信号-噪声比足够大时,我们提供了全局收敛保证,确保随着m变得足够大,镜像退火将收敛到与真实向量接近的全球最小值(在全局符号变化之前),作为一种全局收敛保证。如果使用谱方法提供良好的初始猜测,可以提高样本复杂度的下界。我们用几个数值结果补充了理论研究,证明了镜像退火是一种计算上和统计上有效的求解相检索问题的方法。
https://arxiv.org/abs/2405.10754
Diffusion models have become a successful approach for solving various image inverse problems by providing a powerful diffusion prior. Many studies tried to combine the measurement into diffusion by score function replacement, matrix decomposition, or optimization algorithms, but it is hard to balance the data consistency and realness. The slow sampling speed is also a main obstacle to its wide application. To address the challenges, we propose Deep Data Consistency (DDC) to update the data consistency step with a deep learning model when solving inverse problems with diffusion models. By analyzing existing methods, the variational bound training objective is used to maximize the conditional posterior and reduce its impact on the diffusion process. In comparison with state-of-the-art methods in linear and non-linear tasks, DDC demonstrates its outstanding performance of both similarity and realness metrics in generating high-quality solutions with only 5 inference steps in 0.77 seconds on average. In addition, the robustness of DDC is well illustrated in the experiments across datasets, with large noise and the capacity to solve multiple tasks in only one pre-trained model.
扩散模型已成为解决各种图像反问题的一种成功方法,通过提供强大的扩散先验。许多研究试图通过评分函数替换、矩阵分解或优化算法将测量与扩散相结合,但很难平衡数据一致性和真实性。缓慢的采样速度也是其主要障碍。为了应对这些挑战,我们提出了Deep Data Consistency(DDC),在解决扩散模型反问题时,用深度学习模型更新数据一致性步。通过分析现有方法,我们使用变分 Bound 训练目标来最大化条件后验并减小其对扩散过程的影响。与线性和非线性任务的先进方法相比,DDC在仅用 5 个推理步骤的情况下,在平均 0.77 秒的时间内生成高质量解决方案,同时展示了其在相似性和真实性度量方面的卓越性能。此外,DDC在数据集上的鲁棒性得到了充分说明,具有大噪声和仅用一个预训练模型解决多个任务的能力。
https://arxiv.org/abs/2405.10748
In recent years, people have increasingly used AI to help them with their problems by asking questions on different topics. One of these topics can be software-related and programming questions. In this work, we focus on the questions which need the understanding of images in addition to the question itself. We introduce the StackOverflowVQA dataset, which includes questions from StackOverflow that have one or more accompanying images. This is the first VQA dataset that focuses on software-related questions and contains multiple human-generated full-sentence answers. Additionally, we provide a baseline for answering the questions with respect to images in the introduced dataset using the GIT model. All versions of the dataset are available at this https URL.
近年来,人们越来越多地使用AI来寻求各种问题的帮助,比如通过在各种主题上提出问题来解决问题。其中一项主题是软件相关的问题和编程问题。在这项工作中,我们关注需要理解图像的问题,以及问题本身。我们引入了StackOverflowVQA数据集,其中包括StackOverflow上有一个或多个伴随图片的问题。这是第一个关注软件相关问题的VQA数据集,并且包含了多个人类编写的完整句子答案。此外,我们还提供了使用GIT模型回答问题与引入的数据集中的图像相关的基线。所有版本的数据集都可以在https://url.com/这个URL中找到。
https://arxiv.org/abs/2405.10736
Modern diffusion MRI sequences commonly acquire a large number of volumes with diffusion sensitization gradients of differing strengths or directions. Such sequences rely on echo-planar imaging (EPI) to achieve reasonable scan duration. However, EPI is vulnerable to off-resonance effects, leading to tissue susceptibility and eddy-current induced distortions. The latter is particularly problematic because it causes misalignment between volumes, disrupting downstream modelling and analysis. The essential correction of eddy distortions is typically done post-acquisition, with image registration. However, this is non-trivial because correspondence between volumes can be severely disrupted due to volume-specific signal attenuations induced by varying directions and strengths of the applied gradients. This challenge has been successfully addressed by the popular FSL~Eddy tool but at considerable computational cost. We propose an alternative approach, leveraging recent advances in image processing enabled by deep learning (DL). It consists of two convolutional neural networks: 1) An image translator to restore correspondence between images; 2) A registration model to align the translated images. Results demonstrate comparable distortion estimates to FSL~Eddy, while requiring only modest training sample sizes. This work, to the best of our knowledge, is the first to tackle this problem with deep learning. Together with recently developed DL-based susceptibility correction techniques, they pave the way for real-time preprocessing of diffusion MRI, facilitating its wider uptake in the clinic.
现代扩散MRI序列通常使用扩散敏感度梯度的大小或方向不同的数量来获取大量体积数据。这些序列依赖于回波平面成像(EPI)来实现合理的扫描时间。然而,EPI对谐波影响敏感,导致组织敏度和涡流诱导的畸变。后一种情况尤其成问题,因为它会导致体积之间的错位,破坏下游建模和分析。对涡流畸变的根本纠正通常在收购后进行,通过图像配准实现。但是,这并不容易,因为应用的梯度方向和强度引起的体积特定信号衰减会导致图像之间的对应关系严重破坏。这个问题已经被流行的FSL~Eddy工具成功解决,但代价是相当高的计算成本。我们提出了一个利用深度学习(DL)成像处理 recent 进展的方法来解决这个问题。它包括两个卷积神经网络:1)一个图像转换器来恢复图像之间的对应关系;2)一个配准模型来对平移后的图像进行对齐。结果表明,与FSL~Eddy相当的组织畸变估计,而只需要很小的训练样本量。据我们所知,这是第一个利用深度学习来解决这个问题的。与最近开发基于DL的敏感性纠正技术相结合,它们为扩散MRI的实时预处理铺平了道路,促进了其在临床上的更广泛应用。
https://arxiv.org/abs/2405.10723
In this paper, we introduce the first comprehensive multilingual sign language dataset named Prompt2Sign, which builds from public data including American Sign Language (ASL) and seven others. Our dataset transforms a vast array of videos into a streamlined, model-friendly format, optimized for training with translation models like seq2seq and text2text. Building on this new dataset, we propose SignLLM, the first multilingual Sign Language Production (SLP) model, which includes two novel multilingual SLP modes that allow for the generation of sign language gestures from input text or prompt. Both of the modes can use a new loss and a module based on reinforcement learning, which accelerates the training by enhancing the model's capability to autonomously sample high-quality data. We present benchmark results of SignLLM, which demonstrate that our model achieves state-of-the-art performance on SLP tasks across eight sign languages.
在本文中,我们提出了名为Prompt2Sign的第一个全面的跨语言手语数据集,该数据集基于包括美国手语(ASL)在内的公共数据,并对其进行了扩展。我们的数据集将广泛的视频转换为简洁、模型友好的格式,专为训练包括seq2seq和text2text等翻译模型的训练而优化。在此基础上,我们提出了SignLLM,第一个跨语言手语生成(SLP)模型,包括两种新的多语言SLP模式,允许从输入文本或提示生成手语动作。这两种模式都可以使用新的损失和基于强化学习的模块,从而加速模型的训练,提高其自主采样高质量数据的能力。我们展示了SignLLM的基准结果,证明了我们的模型在八个手语语言的SLP任务上实现了最先进的成绩。
https://arxiv.org/abs/2405.10718
Referring image segmentation (RIS) aims to locate the particular region corresponding to the language expression. Existing methods incorporate features from different modalities in a \emph{bottom-up} manner. This design may get some unnecessary image-text pairs, which leads to an inaccurate segmentation mask. In this paper, we propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism and uses the parameter-efficient fine-tuning (PEFT) framework. To be specific, the Human-Like Attention gets a \emph{feedback} signal from multi-modal features, which makes the network center on the specific objects and discard the irrelevant image-text pairs. Besides, we introduce the PEFT framework to preserve the zero-shot ability of pre-trained encoders. Extensive experiments on three widely used RIS benchmarks and the PhraseCut dataset demonstrate that our method achieves state-of-the-art performance and great zero-shot ability.
参考图像分割(RIS)旨在定位语言表达的特定区域。现有的方法以自下而上的方式结合了不同模态的特征。这种设计可能会有多余的图像-文本对,导致不准确的分割掩码。在本文中,我们提出了一种称为HARIS的参考图像分割方法,它引入了人类注意机制并使用了参数高效的微调(PEFT)框架。具体来说,人类注意从多模态特征中获取反馈信号,使网络集中于特定的对象并丢弃无关的图像-文本对。此外,我们还引入了PEFT框架来保留预训练编码器的零样本能力。通过对三个广泛使用的RIS基准和PhraseCut数据集的实验,证明了我们的方法达到了最先进的性能和极大的零样本能力。
https://arxiv.org/abs/2405.10707
Digital Subtraction Angiography (DSA) is one of the gold standards in vascular disease diagnosing. With the help of contrast agent, time-resolved 2D DSA images deliver comprehensive insights into blood flow information and can be utilized to reconstruct 3D vessel structures. Current commercial DSA systems typically demand hundreds of scanning views to perform reconstruction, resulting in substantial radiation exposure. However, sparse-view DSA reconstruction, aimed at reducing radiation dosage, is still underexplored in the research community. The dynamic blood flow and insufficient input of sparse-view DSA images present significant challenges to the 3D vessel reconstruction task. In this study, we propose to use a time-agnostic vessel probability field to solve this problem effectively. Our approach, termed as vessel probability guided attenuation learning, represents the DSA imaging as a complementary weighted combination of static and dynamic attenuation fields, with the weights derived from the vessel probability field. Functioning as a dynamic mask, vessel probability provides proper gradients for both static and dynamic fields adaptive to different scene types. This mechanism facilitates a self-supervised decomposition between static backgrounds and dynamic contrast agent flow, and significantly improves the reconstruction quality. Our model is trained by minimizing the disparity between synthesized projections and real captured DSA images. We further employ two training strategies to improve our reconstruction quality: (1) coarse-to-fine progressive training to achieve better geometry and (2) temporal perturbed rendering loss to enforce temporal consistency. Experimental results have demonstrated superior quality on both 3D vessel reconstruction and 2D view synthesis.
数字减影血管造影(DSA)是诊断血管疾病的一个金标准。通过对比剂,时间分辨率2D DSA图像能全面了解血流信息,并可用于重建3D血管结构。当前商业DSA系统通常需要数百个扫描 views 来执行重建,导致大量辐射暴露。然而,稀疏视野DSA重建,旨在降低辐射剂量,在研究社区中仍是一个未被探索的问题。动态血流和稀疏视野DSA图像输入不足,给3D血管重建任务带来了重大挑战。在本研究中,我们提出了一种名为“引导稀疏视野DSA学习”的方法来有效解决这一问题。我们的方法将DSA成像视为静态和动态衰减场的一个互补加权组合,权重来自血管概率场。作为动态掩码,血管概率提供不同场景自适应的静态和动态场的正确梯度。这一机制促使自监督分解静态背景和动态对比剂流,从而显著提高重建质量。我们的模型通过最小化生成投影和真实捕获DSA图像之间的差异进行训练。我们进一步采用两种训练策略来提高我们的重建质量:粗到细的渐进训练以实现更好的几何形状(1);时间扰动渲染损失以确保时间一致性(2)。实验结果表明,在3D血管重建和2D视图合成方面具有卓越的质量。
https://arxiv.org/abs/2405.10705