Encoder-decoder networks become a popular choice for various medical image segmentation tasks. When they are trained with a standard loss function, these networks are not explicitly enforced to preserve the shape integrity of an object in an image. However, this ability of the network is important to obtain more accurate results, especially when there is a low-contrast difference between the object and its surroundings. In response to this issue, this work introduces a new shape-aware loss function, which we name FourierLoss. This loss function relies on quantifying the shape dissimilarity between the ground truth and the predicted segmentation maps through the Fourier descriptors calculated on their objects, and penalizing this dissimilarity in network training. Different than the previous studies, FourierLoss offers an adaptive loss function with trainable hyperparameters that control the importance of the level of the shape details that the network is enforced to learn in the training process. This control is achieved by the proposed adaptive loss update mechanism, which end-to-end learns the hyperparameters simultaneously with the network weights by backpropagation. As a result of using this mechanism, the network can dynamically change its attention from learning the general outline of an object to learning the details of its contour points, or vice versa, in different training epochs. Working on 2879 computed tomography images of 93 subjects, our experiments revealed that the proposed adaptive shape-aware loss function led to statistically significantly better results for liver segmentation, compared to its counterparts.
编码-解码网络成为各种医学图像分割任务的流行选择。在训练时使用标准损失函数时,这些网络并不明确强制维护图像中物体的形状完整性。然而,该网络的能力对于获得更准确的结果非常重要,特别是当物体和其周围环境的对比度较低时。为了解决这一问题,这项工作提出了一种新的形状 aware loss function,我们称之为傅里叶 Loss。该损失函数依赖于量化形状差异,通过计算它们在物体上的傅里叶描述符,并在网络训练时惩罚这种差异。与以前的研究不同,傅里叶 Loss 提供了可训练的超参数,控制网络在训练过程中被迫学习的形状细节的重要性。该控制由提出的自适应损失更新机制实现,该机制从网络的端到端学习超参数,同时使用反向传播。由于使用这种机制,网络可以从学习物体的一般轮廓动态地改变其注意力,或反之,在不同的训练 epoch 中。针对 2879 个对象的 93 个样本,我们的实验表明, proposed 自适应形状 aware loss function 导致肝脏分割结果比其竞品Statistically significantly 更好。
https://arxiv.org/abs/2309.12106
Unsupervised Video Object Segmentation (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge. However, previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time. This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view. Specifically, LSTA consists of two dominant modules, i.e., Long Temporal Memory and Short Temporal Attention. The former captures the long-term global pixel relations of the past frames and the current frame, which models constantly present objects by encoding appearance pattern. Meanwhile, the latter reveals the short-term local pixel relations of one nearby frame and the current frame, which models moving objects by encoding motion pattern. To speedup the inference, the efficient projection and the locality-based sliding window are adopted to achieve nearly linear time complexity for the two light modules, respectively. Extensive empirical studies on several benchmarks have demonstrated promising performances of the proposed method with high efficiency.
无监督视频对象分割(VOS)的目标是在没有先前知识的情况下识别视频中的主要 foreground 物体的轮廓。然而,以前的方法和方法没有完全利用空间-时间上下文,无法在实时中解决这个挑战性的任务。这激励我们从整体视角开发一种高效的 Long-Short Temporal Attention Network(称为 LSTA),以无监督 VOS 任务为例。具体来说,LSTA 由两个主要模块组成,即长期时间记忆和短期时间记忆。前者捕获过去帧和当前帧的长期全球像素关系,通过编码外观模式模型不断出现的物体。与此同时,后者揭示相邻帧和当前帧的短期局部像素关系,通过编码运动模式模型移动物体。为了加快推理,采用高效的投影和基于局部性的滑动窗口,以实现两个光模块的几乎线性时间复杂度。对多个基准进行了广泛的实证研究,证明了高效方法的高有效性。
https://arxiv.org/abs/2309.11707
Image segmentation is a complex mathematical problem, especially for images that contain intensity inhomogeneity and tightly packed objects with missing boundaries in between. For instance, Magnetic Resonance (MR) muscle images often contain both of these issues, making muscle segmentation especially difficult. In this paper we propose a novel intensity correction and a semi-automatic active contour based segmentation approach. The approach uses a geometric flow that incorporates a reproducing kernel Hilbert space (RKHS) edge detector and a geodesic distance penalty term from a set of markers and anti-markers. We test the proposed scheme on MR muscle segmentation and compare with some state of the art methods. To help deal with the intensity inhomogeneity in this particular kind of image, a new approach to estimate the bias field using a fat fraction image, called Prior Bias-Corrected Fuzzy C-means (PBCFCM), is introduced. Numerical experiments show that the proposed scheme leads to significantly better results than compared ones. The average dice values of the proposed method are 92.5%, 85.3%, 85.3% for quadriceps, hamstrings and other muscle groups while other approaches are at least 10% worse.
图像分割是一个复杂的数学问题,特别是对于包含强度不均匀、紧密排列的物体且它们之间缺少边界的图像。例如,磁共振(MR)肌肉图像往往同时包含这些问题,这使得肌肉分割特别困难。在本文中,我们提出了一种新的强度修正和基于主动轮廓分割的方法。方法使用了一个几何流,其中包含一个复制内核哈希空间(RKHS)的边缘检测器和一组标记和反标记的广义距离惩罚 term。我们测试了所提出的方法在MR肌肉分割上的性能,并与其他先进技术进行了比较。为了处理这种特定类型的图像中的强度不均匀问题,我们介绍了一种使用脂肪比例图像来估计偏差场的新方法,称为先验偏差纠正Fuzzy C-means(PBCFCM)。数值实验表明,所提出的方法导致比对照组更好的结果。该方法的平均骰子值为92.5%、85.3%、85.3% for quadriceps、 hamstrings和其他肌肉群组,而其他方法至少有10%劣化。
https://arxiv.org/abs/2309.10935
Marine debris poses a significant threat to the survival of marine wildlife, often leading to entanglement and starvation, ultimately resulting in death. Therefore, removing debris from the ocean is crucial to restore the natural balance and allow marine life to thrive. Instance segmentation is an advanced form of object detection that identifies objects and precisely locates and separates them, making it an essential tool for autonomous underwater vehicles (AUVs) to navigate and interact with their underwater environment effectively. AUVs use image segmentation to analyze images captured by their cameras to navigate underwater environments. In this paper, we use instance segmentation to calculate the area of individual objects within an image, we use YOLOV7 in Roboflow to generate a set of bounding boxes for each object in the image with a class label and a confidence score for every detection. A segmentation mask is then created for each object by applying a binary mask to the object's bounding box. The masks are generated by applying a binary threshold to the output of a convolutional neural network trained to segment objects from the background. Finally, refining the segmentation mask for each object is done by applying post-processing techniques such as morphological operations and contour detection, to improve the accuracy and quality of the mask. The process of estimating the area of instance segmentation involves calculating the area of each segmented instance separately and then summing up the areas of all instances to obtain the total area. The calculation is carried out using standard formulas based on the shape of the object, such as rectangles and circles. In cases where the object is complex, the Monte Carlo method is used to estimate the area. This method provides a higher degree of accuracy than traditional methods, especially when using a large number of samples.
海洋垃圾对海洋野生生物的生存构成严重威胁,常常导致缠绕和饥饿,最终导致死亡。因此,从海洋中移除垃圾是至关重要的,以恢复自然平衡,让海洋生物得以繁荣。实例分割是一种对象检测的高级形式,可以识别对象并精确定位并分离它们,使其成为无人驾驶潜水器(AUVs)有效导航和与水下环境交互的关键工具。AUVs使用图像分割来分析其相机捕获的图像,以导航水下环境。在本文中,我们使用实例分割来计算图像中每个对象的个体面积,我们在Roboflow中使用YOLOV7生成每个对象的图像 bounding box,并使用类标签和每个检测的置信度分数生成一个分割掩码。然后,每个对象的分割掩码是通过将二进制掩码应用到对象 bounding box中创建的。掩码是通过应用二进制阈值训练的卷积神经网络的输出生成的。最后,是每个对象的优化分割掩码是通过应用形态学操作和轮廓检测等后处理技术实现的,以提高掩码的精度和质量。估计实例分割面积的过程涉及分别计算每个分割实例的面积,然后将它们的总和加起来,以获得总面积。计算使用基于对象形状的标准公式,例如矩形和圆形。如果对象比较复杂,则使用蒙特卡罗方法估计面积。这种方法提供比传统方法更高的精度,特别是在使用大量样本时。
https://arxiv.org/abs/2309.10617
Current Scene text image super-resolution approaches primarily focus on extracting robust features, acquiring text information, and complex training strategies to generate super-resolution images. However, the upsampling module, which is crucial in the process of converting low-resolution images to high-resolution ones, has received little attention in existing works. To address this issue, we propose the Pixel Adapter Module (PAM) based on graph attention to address pixel distortion caused by upsampling. The PAM effectively captures local structural information by allowing each pixel to interact with its neighbors and update features. Unlike previous graph attention mechanisms, our approach achieves 2-3 orders of magnitude improvement in efficiency and memory utilization by eliminating the dependency on sparse adjacency matrices and introducing a sliding window approach for efficient parallel computation. Additionally, we introduce the MLP-based Sequential Residual Block (MSRB) for robust feature extraction from text images, and a Local Contour Awareness loss ($\mathcal{L}_{lca}$) to enhance the model's perception of details. Comprehensive experiments on TextZoom demonstrate that our proposed method generates high-quality super-resolution images, surpassing existing methods in recognition accuracy. For single-stage and multi-stage strategies, we achieved improvements of 0.7\% and 2.6\%, respectively, increasing the performance from 52.6\% and 53.7\% to 53.3\% and 56.3\%. The code is available at this https URL.
当前场景文本图像超分辨率方法主要关注提取稳健特征、获取文本信息以及复杂的训练策略以生成超分辨率图像。然而,在将低分辨率图像转换为高分辨率图像的过程中不可或缺的高斯采样模块却往往未被现有工作所重视。为了解决这一问题,我们提出了基于图注意力的像素适配模块(PAM)来解决高斯采样引起的像素扭曲问题。PAM有效地捕捉了局部结构信息,允许每个像素与其邻居交互并更新特征,与以前的图注意力机制不同,我们的方法实现了2-3个数量级的效率和内存利用率改进,通过消除稀疏邻接矩阵的依赖并引入高效的并行计算窗口方法。此外,我们还引入了基于多层感知器的序列残差块(MSRB)以从文本图像中提取稳健特征,并引入了Local Contour awareness loss( $\mathcal{L}_{lca}$)来增强模型对细节的感知。在文本Zoom实验中,我们实现了0.7%和2.6%的性能提升,将识别精度从52.6%和53.7%提高到了53.3%和56.3%。代码可在此处获取。
https://arxiv.org/abs/2309.08919
Computed Tomography (CT) is a medical imaging modality that can generate more informative 3D images than 2D X-rays. However, this advantage comes at the expense of more radiation exposure, higher costs, and longer acquisition time. Hence, the reconstruction of 3D CT images using a limited number of 2D X-rays has gained significant importance as an economical alternative. Nevertheless, existing methods primarily prioritize minimizing pixel/voxel-level intensity discrepancies, often neglecting the preservation of textural details in the synthesized images. This oversight directly impacts the quality of the reconstructed images and thus affects the clinical diagnosis. To address the deficits, this paper presents a new self-driven generative adversarial network model (SdCT-GAN), which is motivated to pay more attention to image details by introducing a novel auto-encoder structure in the discriminator. In addition, a Sobel Gradient Guider (SGG) idea is applied throughout the model, where the edge information from the 2D X-ray image at the input can be integrated. Moreover, LPIPS (Learned Perceptual Image Patch Similarity) evaluation metric is adopted that can quantitatively evaluate the fine contours and textures of reconstructed images better than the existing ones. Finally, the qualitative and quantitative results of the empirical studies justify the power of the proposed model compared to mainstream state-of-the-art baselines.
计算机断层扫描(CT)是一种医学成像技术,可以生成比2D X射线更多的 informative 3D 图像。然而,这种优势的代价是更多的辐射暴露、更高的成本和更长的收集时间。因此,使用有限数量的2D X射线进行3D CT图像的重建已经成为一种经济上的替代方案。然而,现有的方法主要关注最小化像素/微立方单元级亮度差异,常常忽略了合成图像中的纹理细节保留。这种疏忽直接影响重建图像的质量,从而影响临床诊断。为了解决缺陷,本文提出了一种新的自驱动生成对抗网络模型(SdCT-GAN),该模型被激励更多地关注图像细节,通过在分岔器中引入一种新的自编码结构来引入。此外,Sobel梯度引导器(SGG)方法在整个模型中应用,其中输入表面的2D X射线图像的边缘信息可以集成。此外,使用 LPIPS(学习感知图像块相似性)评价指标可以比现有方法更好地定量评估重建图像的 fine 轮廓和纹理。最后,经验研究和定量结果证明了该模型相对于主流主流基准的功能强大。
https://arxiv.org/abs/2309.04960
To date, most instance segmentation approaches are based on supervised learning that requires a considerable amount of annotated object contours as training ground truth. Here, we propose a framework that searches for the target object based on a shape prior. The shape prior model is learned with a variational autoencoder that requires only a very limited amount of training data: In our experiments, a few dozens of object shape patches from the target dataset, as well as purely synthetic shapes, were sufficient to achieve results en par with supervised methods with full access to training data on two out of three cell segmentation datasets. Our method with a synthetic shape prior was superior to pre-trained supervised models with access to limited domain-specific training data on all three datasets. Since the learning of prior models requires shape patches, whether real or synthetic data, we call this framework semi-supervised learning.
迄今为止,大多数实例分割方法都基于监督学习,需要大量标注的对象轮廓作为训练基准真相。在这里,我们提出一种框架,基于形状先验,搜索目标对象。形状先验模型使用一种变分自编码器学习,只需要非常少量的训练数据:在我们的实验中,从目标数据集中提取的几个数十个对象轮廓补丁和纯粹的合成形状足以实现与完全访问三个细胞分割数据集训练数据的监督方法的结果相媲美。我们的方法使用合成形状先验比使用限制 domains 特定的训练数据的预训练监督模型更有效。由于先验模型的学习需要形状补丁,无论是真实的还是合成的数据,因此我们称之为半监督学习。
https://arxiv.org/abs/2309.04888
Geodesic models are known as an efficient tool for solving various image segmentation problems. Most of existing approaches only exploit local pointwise image features to track geodesic paths for delineating the objective boundaries. However, such a segmentation strategy cannot take into account the connectivity of the image edge features, increasing the risk of shortcut problem, especially in the case of complicated scenario. In this work, we introduce a new image segmentation model based on the minimal geodesic framework in conjunction with an adaptive cut-based circular optimal path computation scheme and a graph-based boundary proposals grouping scheme. Specifically, the adaptive cut can disconnect the image domain such that the target contours are imposed to pass through this cut only once. The boundary proposals are comprised of precomputed image edge segments, providing the connectivity information for our segmentation model. These boundary proposals are then incorporated into the proposed image segmentation model, such that the target segmentation contours are made up of a set of selected boundary proposals and the corresponding geodesic paths linking them. Experimental results show that the proposed model indeed outperforms state-of-the-art minimal paths-based image segmentation approaches.
几何模型被称为解决各种图像分割问题的有效工具。大部分现有方法只利用局部点式的图像特征来跟踪几何路径以确定目标边界。然而,这种分割策略无法考虑图像边缘特征的连通性,增加了绕过问题的风险,特别是在复杂的场景下。在本研究中,我们介绍了基于最小几何框架并结合自适应切基最优路径计算Scheme和图形边界建议分组Scheme的新图像分割模型。具体来说,自适应切可以断开图像域,使目标轮廓只能穿过该切一次。边界建议由预计算的图像边缘段组成,为我们的分割模型提供了连通性信息。这些边界建议随后被集成到 proposed 图像分割模型中,使目标分割轮廓由一组选定的边界建议和相应的几何路径链接而成。实验结果表明,该提议模型确实超越了当前最先进的基于最小路径的图像分割方法。
https://arxiv.org/abs/2309.04169
Not all camouflages are equally effective, as even a partially visible contour or a slight color difference can make the animal stand out and break its camouflage. In this paper, we address the question of what makes a camouflage successful, by proposing three scores for automatically assessing its effectiveness. In particular, we show that camouflage can be measured by the similarity between background and foreground features and boundary visibility. We use these camouflage scores to assess and compare all available camouflage datasets. We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner. The generated synthetic dataset is used to train a transformer-based model for segmenting camouflaged animals in videos. Experimentally, we demonstrate state-of-the-art camouflage breaking performance on the public MoCA-Mask benchmark.
并非所有伪装效果都同样有效,因为即使是可见部分或轻微颜色差异也能使动物突出并破坏其伪装。在本文中,我们探讨了为什么伪装成功的问题,并提出了一种用于自动评估其有效性的三个得分。特别是,我们表明伪装可以通过背景和前景特征之间的相似性和边界可见度来衡量。我们使用这些伪装得分评估并比较所有可用的伪装数据集。我们还将提议的伪装得分添加到生成模型中作为辅助损失,并表明可以以 scalable 的方式合成有效的伪装图像或视频。生成的合成数据集被用于训练一个基于Transformer的模型,以在视频中分割伪装的动物。实验中,我们展示了最先进的伪装破坏性能,在公共MoCA-Mask基准测试中取得了。
https://arxiv.org/abs/2309.03899
In recent years, many mammographic image analysis methods have been introduced for improving cancer classification tasks. Two major issues of mammogram classification tasks are leveraging multi-view mammographic information and class-imbalance handling. In the first problem, many multi-view methods have been released for concatenating features of two or more views for the training and inference stage. Having said that, most multi-view existing methods are not explainable in the meaning of feature fusion, and treat many views equally for diagnosing. Our work aims to propose a simple but novel method for enhancing examined view (main view) by leveraging low-level feature information from the auxiliary view (ipsilateral view) before learning the high-level feature that contains the cancerous features. For the second issue, we also propose a simple but novel malignant mammogram synthesis framework for upsampling minor class samples. Our easy-to-implement and no-training framework has eliminated the current limitation of the CutMix algorithm which is unreliable synthesized images with random pasted patches, hard-contour problems, and domain shift problems. Our results on VinDr-Mammo and CMMD datasets show the effectiveness of our two new frameworks for both multi-view training and synthesizing mammographic images, outperforming the previous conventional methods in our experimental settings.
近年来,许多乳腺成像分析方法被引入以提高癌症分类任务的质量。乳腺癌分类任务的两个主要问题是利用多视角乳腺信息和处理类别不平衡。在第一个问题上,许多多视角方法被释放用于将两个或多个视角的特征组合用于训练和推断阶段。尽管如此,大多数现有的多视角方法在特征融合的意义方面是不可解释的,并将许多视角视为诊断同等对待。我们的工作旨在提出一种简单但新颖的方法,以增强检查视图(主视图)的方法,通过利用辅助视图(同侧视图)的低级别特征信息在学习高级别特征之前。对于第二个问题,我们还提出了一种简单但新颖的恶性乳腺合成框架,以扩大次要类样本的规模。我们的简单易用且不需要训练的框架已经消除了 Cut 混合算法当前的局限性,该算法以随机粘贴补丁、硬轮廓问题和域转换问题的可靠性而闻名。我们在 VinDr-Mammo 和CMMD 数据集上的结果表明,我们的两个新框架在多视角训练和合成乳腺图像方面都的有效性,在我们的实验设置中超过了以前的传统方法。
https://arxiv.org/abs/2309.03506
Deep learning-based automatic segmentation methods have become state-of-the-art. However, they are often not robust enough for direct clinical application, as domain shifts between training and testing data affect their performance. Failure in automatic segmentation can cause sub-optimal results that require correction. To address these problems, we propose a novel 3D extension of an interactive segmentation framework that represents a segmentation from a convolutional neural network (CNN) as a B-spline explicit active surface (BEAS). BEAS ensures segmentations are smooth in 3D space, increasing anatomical plausibility, while allowing the user to precisely edit the 3D surface. We apply this framework to the task of 3D segmentation of the anal sphincter complex (AS) from transperineal ultrasound (TPUS) images, and compare it to the clinical tool used in the pelvic floor disorder clinic (4D View VOCAL, GE Healthcare; Zipf, Austria). Experimental results show that: 1) the proposed framework gives the user explicit control of the surface contour; 2) the perceived workload calculated via the NASA-TLX index was reduced by 30% compared to VOCAL; and 3) it required 7 0% (170 seconds) less user time than VOCAL (p< 0.00001)
基于深度学习的自动分割方法已经成为前沿技术。然而,它们往往不足以满足直接临床应用的要求,因为在训练和测试数据之间的领域转换会影响它们的性能。自动分割的错误可能导致优化结果需要修正。为了解决这些问题,我们提出了一种全新的3D扩展交互分割框架,将卷积神经网络(CNN)的分割作为B-spline显式活动表面(BEAS)的表示。BEAS确保在3D空间中的分割平滑,增加解剖学的合理性,同时允许用户精确编辑3D表面。我们应用这个框架到从 transperineal ultrasound (TPUS) 图像中计算 anal sphincter complex (AS) 的3D分割任务,并将其与 pelvic floor disorder Clinic 中使用的临床工具(4D视图 VOCAL,GE Healthcare,Zipf,奥地利)进行比较。实验结果显示:1) 提出的框架为用户提供了表面轮廓的明确控制;2) 通过NASA-TLX指数计算,相对于 VOCAL,感知工作量减少了30%;3)它需要的user时间比 VOCAL 少70%(170秒)(p<0.00001)。
https://arxiv.org/abs/2309.02335
The increase in security concerns due to technological advancements has led to the popularity of biometric approaches that utilize physiological or behavioral characteristics for enhanced recognition. Face recognition systems (FRSs) have become prevalent, but they are still vulnerable to image manipulation techniques such as face morphing attacks. This study investigates the impact of the alignment settings of input images on deep learning face morphing detection performance. We analyze the interconnections between the face contour and image context and suggest optimal alignment conditions for face morphing detection.
科技进步所引起的安全担忧的增加,导致了利用生理或行为特征增强识别的Biometrics方法的流行。人脸识别系统(FRSs)已经变得普及,但它们仍然容易受到图像操纵技术,如人脸变形攻击的影响。本研究探讨了输入图像对齐设置对深度学习人脸变形检测性能的影响。我们分析了人脸轮廓和图像上下文之间的联系,并提出了人脸变形检测的最佳对齐条件。
https://arxiv.org/abs/2309.00549
In light of the expanding population, an automated framework of disease detection can assist doctors in the diagnosis of ocular diseases, yields accurate, stable, rapid outcomes, and improves the success rate of early detection. The work initially intended the enhancing the quality of fundus images by employing an adaptive contrast enhancement algorithm (CLAHE) and Gamma correction. In the preprocessing techniques, CLAHE elevates the local contrast of the fundus image and gamma correction increases the intensity of relevant features. This study operates on a AMDNet23 system of deep learning that combined the neural networks made up of convolutions (CNN) and short-term and long-term memory (LSTM) to automatically detect aged macular degeneration (AMD) disease from fundus ophthalmology. In this mechanism, CNN is utilized for extracting features and LSTM is utilized to detect the extracted features. The dataset of this research is collected from multiple sources and afterward applied quality assessment techniques, 2000 experimental fundus images encompass four distinct classes equitably. The proposed hybrid deep AMDNet23 model demonstrates to detection of AMD ocular disease and the experimental result achieved an accuracy 96.50%, specificity 99.32%, sensitivity 96.5%, and F1-score 96.49.0%. The system achieves state-of-the-art findings on fundus imagery datasets to diagnose AMD ocular disease and findings effectively potential of our method.
随着人口的不断增长,一种自动化的框架可以协助医生诊断眼疾病,获得准确、稳定、快速的结果,并提高早期检测的成功率。最初,研究旨在通过采用自适应对比度增强算法(CLAHE)和伽马纠正来提高 fundus 图像的质量。在预处理技术中,CLAHE提高了 fundus 图像的局部对比度,伽马纠正增加了相关特征的强度。本研究使用 AMDNet23 深度学习系统,该系统结合了卷积神经网络(CNN)和短期和长期记忆(LSTM),自动从眼科学中检测年龄 macular degeneration(AMD)疾病。在该机制中,CNN 用于提取特征,LSTM 用于检测提取的特征。该研究的样本从多个来源收集,随后应用质量评估技术,2000 次实验 fundus 图像涵盖了四个不同的类别平等性。提出的混合深度 AMDNet23 模型演示了检测 AMD 眼疾病,实验结果达到准确率 96.50%,特异性 99.32%,灵敏度 96.5%,F1-score 96.49.0%。该系统在 fundus 图像数据集上实现了最先进的发现,以诊断 AMD 眼疾病,并有效地利用了我们的方法和发现的潜力。
https://arxiv.org/abs/2308.15822
We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' positions and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets.
我们提出了PB former,这是一种高效但强大的场景文本探测器,将Transformer与一种新的文本形状表示polynomial band(PB)相结合。PB表示具有四个polynomial曲线,以文本的顶部、底部、左侧和右侧为模板,可以以不同polynomial系数捕捉具有复杂形状的文本。与传统的表示相比,PB具有吸引人的特点:1)它可以以固定参数数量建模不同曲率,而polygon-points-based方法需要使用不同的点。2)它可以区别相邻或重叠的文本,因为它们具有显然不同的曲线系数,而 segmentation-based或points-based方法则受到黏着空间位置的影响。PB former将PB与Transformer相结合,可以直接从预测曲线中采样平滑的文本轮廓,而不需要插值。一个参数免费的跨尺寸像素注意力(CPA)模块被使用,以突出适当的尺寸上的特征图,同时抑制其他特征图。简单的操作可以帮助检测小型文本,并与一阶段的DeTR框架兼容,NMS不存在 postprocessing。此外,PB former使用了一种形状损失,不仅强制地将实际值和预测值之间的分段对齐,而且使曲线的位置和形状相互一致。在没有文本预处理 bells和 whistles的情况下,我们的方法在任意形状的文本数据集上优于先前最先进的文本探测器。
https://arxiv.org/abs/2308.15004
This paper presents a new database consisting of concurrent articulatory and acoustic speech data. The articulatory data correspond to ultrasound videos of the vocal tract dynamics, which allow the visualization of the tongue upper contour during the speech production process. Acoustic data is composed of 30 short sentences that were acquired by a directional cardioid microphone. This database includes data from 17 young subjects (8 male and 9 female) from the Santander region in Colombia, who reported not having any speech pathology.
本论文介绍了一个由同时采集口语发音和声学语音数据的新数据库。口语数据对应于声学语音视频,可用于可视化口齿上轮廓,在发音过程中进行可视化。声学数据由一个directional cardioid microphone采集的30个简短的句子组成。这个数据库包括来自哥伦比亚桑坦德尔地区的17名年轻受试者(8男9女),他们声称没有进行任何言语治疗。
https://arxiv.org/abs/2308.13941
Ultrasound (US) image segmentation is an active research area that requires real-time and highly accurate analysis in many scenarios. The detect-to-segment (DTS) frameworks have been recently proposed to balance accuracy and efficiency. However, existing approaches may suffer from inadequate contour encoding or fail to effectively leverage the encoded results. In this paper, we introduce a novel Fourier-anchor-based DTS framework called Fourier Feature Pyramid Network (FFPN) to address the aforementioned issues. The contributions of this paper are two fold. First, the FFPN utilizes Fourier Descriptors to adequately encode contours. Specifically, it maps Fourier series with similar amplitudes and frequencies into the same layer of the feature map, thereby effectively utilizing the encoded Fourier information. Second, we propose a Contour Sampling Refinement (CSR) module based on the contour proposals and refined features produced by the FFPN. This module extracts rich features around the predicted contours to further capture detailed information and refine the contours. Extensive experimental results on three large and challenging datasets demonstrate that our method outperforms other DTS methods in terms of accuracy and efficiency. Furthermore, our framework can generalize well to other detection or segmentation tasks.
超声波(US)图像分割是一个活跃的研究领域,在许多情况下需要实时和高准确性的分析。最近,提出了一种名为“检测-分割(DTS)框架”的方法,以平衡准确性和效率。然而,现有的方法可能缺乏轮廓编码或无法有效地利用编码的结果。在本文中,我们介绍了一种新的基于傅里叶Anchor的DTS框架,名为傅里叶特征金字塔网络(FFPN),以解决上述问题。本文的贡献是两个。第一,FFPN利用傅里叶特征描述符充分编码轮廓。具体来说,它将具有相似振幅和频率的傅里叶序列映射到特征图同一层,从而有效地利用编码的傅里叶信息。第二,我们提出了一个轮廓采样改进(CSR)模块,基于FFPN提出的轮廓提议和改进的特征,该模块从预测轮廓周围的丰富特征中提取信息,进一步捕捉细节并改进轮廓。在三个大型且具有挑战性的dataset的实验结果证明,我们的方法和其他DTS方法在准确性和效率方面表现优异。此外,我们的框架可以很好地适用于其他检测或分割任务。
https://arxiv.org/abs/2308.13790
Dynamic contrast-enhanced (DCE) cardiac magnetic resonance imaging (CMRI) is a widely used modality for diagnosing myocardial blood flow (perfusion) abnormalities. During a typical free-breathing DCE-CMRI scan, close to 300 time-resolved images of myocardial perfusion are acquired at various contrast "wash in/out" phases. Manual segmentation of myocardial contours in each time-frame of a DCE image series can be tedious and time-consuming, particularly when non-rigid motion correction has failed or is unavailable. While deep neural networks (DNNs) have shown promise for analyzing DCE-CMRI datasets, a "dynamic quality control" (dQC) technique for reliably detecting failed segmentations is lacking. Here we propose a new space-time uncertainty metric as a dQC tool for DNN-based segmentation of free-breathing DCE-CMRI datasets by validating the proposed metric on an external dataset and establishing a human-in-the-loop framework to improve the segmentation results. In the proposed approach, we referred the top 10% most uncertain segmentations as detected by our dQC tool to the human expert for refinement. This approach resulted in a significant increase in the Dice score (p<0.001) and a notable decrease in the number of images with failed segmentation (16.2% to 11.3%) whereas the alternative approach of randomly selecting the same number of segmentations for human referral did not achieve any significant improvement. Our results suggest that the proposed dQC framework has the potential to accurately identify poor-quality segmentations and may enable efficient DNN-based analysis of DCE-CMRI in a human-in-the-loop pipeline for clinical interpretation and reporting of dynamic CMRI datasets.
动态对比增强(DCE)心脏磁共振成像(CMRI)是一种广泛应用的方法,用于诊断心脏血流(通量)异常情况。在典型的自由呼吸DCE-CMRI扫描中,接近300个时间 resolved 的心脏通量图像在各种不同的对比度“洗涤/流出”阶段获取。在DCE图像系列每个时间帧上的手动心脏轮廓分割可能会感到繁琐和耗时,特别是当非接触性运动纠正失败或无法使用时。虽然深度学习网络(DNN)在分析DCE-CMRI数据集方面表现出了 promise,但可靠的分割失败检测方法缺乏。在这里,我们提出了一种新的空间-时间不确定性度量作为 dQC工具,用于基于DNN的自由呼吸DCE-CMRI数据集分割,通过在外部数据集上验证该度量,并建立人类参与的循环框架来改善分割结果。在提出的方法中,我们将我们 dQC 工具检测到的最具不确定性的分割结果作为人类专家的进一步 refine 的首选对象。这种方法导致Dice得分显著增加(p<0.001),并显著减少了没有分割成功的人像图像数量(16.2%至11.3%)而替代的方法随机选择相同的分割数量的人像 referrals 并未实现任何显著的改进。我们的结果表明,我们提出的 dQC 框架有潜力准确地识别高质量的分割结果,并可能在人类参与的循环管道中,用于临床解释和报告动态CMRI数据集的DNN分析。
https://arxiv.org/abs/2308.13488
The portrait matting task aims to extract an alpha matte with complete semantics and finely-detailed contours. In comparison to CNN-based approaches, transformers with self-attention allow a larger receptive field, enabling it to better capture long-range dependencies and low-frequency semantic information of a portrait. However, the recent research shows that self-attention mechanism struggle with modeling high-frequency information and capturing fine contour details, which can lead to bias while predicting the portrait's contours. To address the problem, we propose EFormer to enhance the model's attention towards semantic and contour features. Especially the latter, which is surrounded by a large amount of high-frequency details. We build a semantic and contour detector (SCD) to accurately capture the distribution of semantic and contour features. And we further design contour-edge extraction branch and semantic extraction branch for refining contour features and complete semantic information. Finally, we fuse the two kinds of features and leverage the segmentation head to generate the predicted portrait matte. Remarkably, EFormer is an end-to-end trimap-free method and boasts a simple structure. Experiments conducted on VideoMatte240K-JPEGSD and AIM datasets demonstrate that EFormer outperforms previous portrait matte methods.
肖像剪辑任务旨在提取具有完整语义和精细轮廓的阿尔法 matte。与基于卷积神经网络的方法相比,具有自我注意力的Transformer 允许更大的响应范围,更好地捕捉肖像的远程依赖和低频语义信息。然而,最近的研究表明,自我注意力机制在与高频信息建模和提取精细轮廓细节方面存在困难,这可能在预测肖像的轮廓时导致偏见。为了解决这个问题,我们提出了 E former,以增强模型对语义和轮廓特征的关注。特别是轮廓的后面部分,这部分周围有很多高频细节。我们建立了语义和轮廓检测器(SCD),以准确地捕捉语义和轮廓特征的分布。我们还设计了轮廓边缘提取分支和语义提取分支,以优化轮廓特征和完整的语义信息。最后,我们融合这两种特征,利用分块头生成预测的肖像 matte。令人惊讶地,E former 是一个端到端 trimap 无方法,并声称一个简单的结构。在视频 Matte240K-JPEGSD 和 AIM 数据集上的实验表明,E former 比过去的肖像 matte 方法更有效。
https://arxiv.org/abs/2308.12831
Text-based audio generation models have limitations as they cannot encompass all the information in audio, leading to restricted controllability when relying solely on text. To address this issue, we propose a novel model that enhances the controllability of existing pre-trained text-to-audio models by incorporating additional conditions including content (timestamp) and style (pitch contour and energy contour) as supplements to the text. This approach achieves fine-grained control over the temporal order, pitch, and energy of generated audio. To preserve the diversity of generation, we employ a trainable control condition encoder that is enhanced by a large language model and a trainable Fusion-Net to encode and fuse the additional conditions while keeping the weights of the pre-trained text-to-audio model frozen. Due to the lack of suitable datasets and evaluation metrics, we consolidate existing datasets into a new dataset comprising the audio and corresponding conditions and use a series of evaluation metrics to evaluate the controllability performance. Experimental results demonstrate that our model successfully achieves fine-grained control to accomplish controllable audio generation. Audio samples and our dataset are publicly available at this https URL
基于文本的音频生成模型有限制,因为它们无法涵盖音频的所有信息,导致仅仅依靠文本无法实现充分的控制。为了解决这一问题,我们提出了一种新模型,可以增强现有的预训练文本到音频模型的控制性,通过将内容(时间戳)和风格(音高轮廓和能量轮廓)作为文本的附加条件来实现。这种方法可以精确控制生成音频的时间顺序、音高和能量。为了保留生成的多样性,我们使用了一个训练有素的控制条件编码器,该编码器由大型语言模型增强,并使用训练有素的 Fusion-Net 来编码和融合额外的条件,同时保持预训练文本到音频模型的权重冻结。由于缺少合适的数据和评估指标,我们将这些现有数据集合并成一个包含音频和相关条件的新数据集,并使用一系列评估指标来评估控制性表现。实验结果显示,我们的模型成功地实现了精细的控制,以实现可控制音频生成。音频样本和我们的 dataset 可在 this https URL 上公开获取。
https://arxiv.org/abs/2308.11940
Detecting small scene text instances in the wild is particularly challenging, where the influence of irregular positions and nonideal lighting often leads to detection errors. We present MixNet, a hybrid architecture that combines the strengths of CNNs and Transformers, capable of accurately detecting small text from challenging natural scenes, regardless of the orientations, styles, and lighting conditions. MixNet incorporates two key modules: (1) the Feature Shuffle Network (FSNet) to serve as the backbone and (2) the Central Transformer Block (CTBlock) to exploit the 1D manifold constraint of the scene text. We first introduce a novel feature shuffling strategy in FSNet to facilitate the exchange of features across multiple scales, generating high-resolution features superior to popular ResNet and HRNet. The FSNet backbone has achieved significant improvements over many existing text detection methods, including PAN, DB, and FAST. Then we design a complementary CTBlock to leverage center line based features similar to the medial axis of text regions and show that it can outperform contour-based approaches in challenging cases when small scene texts appear closely. Extensive experimental results show that MixNet, which mixes FSNet with CTBlock, achieves state-of-the-art results on multiple scene text detection datasets.
在野生环境中检测小型场景文本实例是非常具有挑战性的,因为不规则的位置和不理想的光线往往会导致检测错误。我们提出了 MixNet 一个混合架构,结合了卷积神经网络和Transformer的优点,能够准确地从挑战性的自然场景中检测小型文本,无论方向、风格和光照条件如何。 MixNet 包含两个关键模块:(1) 特征shuffle网络(FSNet)作为主干,(2) 中央Transformer块(CTBlock)利用场景文本的1D多平面约束。我们首先介绍了FSNet中一种新的特征shuffle策略,以促进多个尺度的特征交换,生成比流行的ResNet和HRNet更高的分辨率特征。FSNet主干已经在许多现有文本检测方法中取得了显著改进,包括PAN、DB和Fast。然后我们设计了互补的CTBlock,利用中心线based features类似于文本区域的边缘 axis 的特征,并表明在小型场景文本相邻的情况下,它可以在挑战性情况下比轮廓方法表现出色。广泛的实验结果表明,将FSNet和CTBlock混合起来可以实现多个场景文本检测数据集上最先进的结果。
https://arxiv.org/abs/2308.12817