Crowd counting is a key aspect of crowd analysis and has been typically accomplished by estimating a crowd-density map and summing over the density values. However, this approach suffers from background noise accumulation and loss of density due to the use of broad Gaussian kernels to create the ground truth density maps. This issue can be overcome by narrowing the Gaussian kernel. However, existing approaches perform poorly when trained with such ground truth density maps. To overcome this limitation, we propose using conditional diffusion models to predict density maps, as diffusion models are known to model complex distributions well and show high fidelity to training data during crowd-density map generation. Furthermore, as the intermediate time steps of the diffusion process are noisy, we incorporate a regression branch for direct crowd estimation only during training to improve the feature learning. In addition, owing to the stochastic nature of the diffusion model, we introduce producing multiple density maps to improve the counting performance contrary to the existing crowd counting pipelines. Further, we also differ from the density summation and introduce contour detection followed by summation as the counting operation, which is more immune to background noise. We conduct extensive experiments on public datasets to validate the effectiveness of our method. Specifically, our novel crowd-counting pipeline improves the error of crowd-counting by up to $6\%$ on JHU-CROWD++ and up to $7\%$ on UCF-QNRF.
人群计数是人群分析的关键方面,通常通过估计人群密度图并累加密度值来实现。然而,这种方法受到背景噪声的积累和密度损失的影响,因为使用广泛的高斯曲率Kernel来创建基准密度图。解决这个问题可以通过减小高斯曲率Kernel来实现。然而,现有的方法在训练时表现并不理想,与使用基准密度图训练的情况相比。为了克服这一限制,我们建议使用条件扩散模型来预测密度图,因为扩散模型已知能够很好地模拟复杂的分布,并在生成人群密度图时表现出与训练数据高度相似的特点。此外,由于扩散过程的中间时间步骤是噪声的,仅在训练期间才引入直接人群估计的回归分支,以改善特征学习。此外,由于扩散模型的随机性质,我们引入了生成多个密度图来改善计数性能,与现有的人群计数管道相反。此外,我们还与密度累加不同,引入了轮廓检测和累加作为计数操作,更能够抵御背景噪声。我们在公共数据集上进行广泛的实验来验证我们方法的有效性。具体来说,我们的新型人群计数管道在JHU-CROWD++数据集上提高了人群计数误差高达6%,在UCF-QNRF数据集上高达7%。
https://arxiv.org/abs/2303.12790
Organ at risk (OAR) segmentation in computed tomography (CT) imagery is a difficult task for automated segmentation methods and can be crucial for downstream radiation treatment planning. U-net has become a de-facto standard for medical image segmentation and is frequently used as a common baseline in medical image segmentation tasks. In this paper, we develop a multiple decoder U-net architecture where a noisy auxiliary decoder is used to generate noisy segmentation. The segmentation from the main branch and the noisy segmentation from the auxiliary branch are used together to estimate the attention. Our contribution is the development of a new attention module which derives the attention from the softmax probabilities of two decoder branches. The union and intersection of two segmentation masks from two branches carry the information where both decoders agree and disagree. The softmax probabilities from regions of agreement and disagreement are the indicators of low and high uncertainty. Thus, the probabilities of those selected regions are used as attention in the bottleneck layer of the encoder and passes only through the main decoder for segmentation. For accurate contour segmentation, we also developed a CT intensity integrated regularization loss. We tested our model on two publicly available OAR challenge datasets, Segthor and LCTSC respectively. We trained 12 models on each dataset with and without the proposed attention model and regularization loss to check the effectiveness of the attention module and the regularization loss. The experiments demonstrate a clear accuracy improvement (2\% to 5\% Dice) on both datasets. Code for the experiments will be made available upon the acceptance for publication.
在计算机断层扫描(CT)图像中的器官受创(OAR)分割是一项自动化分割方法难以完成的任务,对于后续辐射治疗规划来说可能是关键的。U-net已经成为医学图像分割的事实上的标准,经常用作医学图像分割任务的共同点基线。在本文中,我们开发了一种多个解码器U-net架构,其中一个噪声辅助解码器用于产生噪声分割。主分支和辅助分支的噪声分割一起用于估计注意力。我们的贡献是开发了一个新的注意力模块,从两个解码分支的softmax概率中获取注意力。两个分支中的分割掩码的交集和并集携带了 both decoders 同意和不同意的信息。同意和不同意区域的softmax概率是低和高不确定性的 indicators。因此,选择的那些区域的概率被用于编码器的瓶颈层的注意力,仅通过主解码器进行分割。对于准确的轮廓分割,我们还开发了一种CT强度集成的 Regularization Loss。我们测试了我们的模型在两个公开的OAR挑战数据集上,分别是Segthor 和 LCTSC。在每只数据集上,我们训练了12只模型,同时使用和不使用提出的注意力模型和 Regularization Loss,以检查注意力模块和 Regularization Loss的有效性。实验表明,在每只数据集上,明显提高了精度(2\%至5\%的Dice)。代码将在接受出版后提供。
https://arxiv.org/abs/2303.10796
Infrared small target detection (ISTD) has a wide range of applications in early warning, rescue, and guidance. However, CNN based deep learning methods are not effective at segmenting infrared small target (IRST) that it lack of clear contour and texture features, and transformer based methods also struggle to achieve significant results due to the absence of convolution induction bias. To address these issues, we propose a new model called attention with bilinear correlation (ABC), which is based on the transformer architecture and includes a convolution linear fusion transformer (CLFT) module with a novel attention mechanism for feature extraction and fusion, which effectively enhances target features and suppresses noise. Additionally, our model includes a u-shaped convolution-dilated convolution (UCDC) module located deeper layers of the network, which takes advantage of the smaller resolution of deeper features to obtain finer semantic information. Experimental results on public datasets demonstrate that our approach achieves state-of-the-art performance. Code is available at this https URL
红外小目标检测(ISTD)在预警、救援和 guidance 等领域具有广泛的应用。然而,基于卷积神经网络的深度学习方法在分割红外小目标(IRST)方面效果不佳,因为它缺乏清晰的轮廓和纹理特征。Transformer 方法也由于卷积诱导偏差而无法取得显著结果。为了解决这些问题,我们提出了一种新模型,称为注意力具有双端相关性(ABC),它基于Transformer 架构,包括一个卷积线性融合Transformer(CLFT)模块,并配备了一种独特的注意力机制,用于特征提取和融合,有效地增强了目标特征,抑制了噪声。此外,我们的模型还包括一个U形卷积膨胀卷积(UCDC)模块,位于网络较深层的神经元中,利用较深特征的较小分辨率获取更精细的语义信息。公开数据集的实验结果表明,我们的 approach 取得了最先进的性能。代码可在 this https URL 中找到。
https://arxiv.org/abs/2303.10321
For semantic segmentation in urban scene understanding, RGB cameras alone often fail to capture a clear holistic topology, especially in challenging lighting conditions. Thermal signal is an informative additional channel that can bring to light the contour and fine-grained texture of blurred regions in low-quality RGB image. Aiming at RGB-T (thermal) segmentation, existing methods either use simple passive channel/spatial-wise fusion for cross-modal interaction, or rely on heavy labeling of ambiguous boundaries for fine-grained supervision. We propose a Spatial-aware Demand-guided Recursive Meshing (SpiderMesh) framework that: 1) proactively compensates inadequate contextual semantics in optically-impaired regions via a demand-guided target masking algorithm; 2) refines multimodal semantic features with recursive meshing to improve pixel-level semantic analysis performance. We further introduce an asymmetric data augmentation technique M-CutOut, and enable semi-supervised learning to fully utilize RGB-T labels only sparsely available in practical use. Extensive experiments on MFNet and PST900 datasets demonstrate that SpiderMesh achieves new state-of-the-art performance on standard RGB-T segmentation benchmarks.
在城市场景理解中,仅使用RGB相机往往无法捕捉清晰的整体拓扑,特别是在挑战性的照明条件下。热信号是一种有用的额外通道,可以在低质量RGB图像中的模糊区域中揭示轮廓和精细的纹理。为了处理RGB-T(热)分割,现有方法要么使用简单的被动通道/空间wise fusion来进行跨模态交互,要么依赖严重的标记模糊边界进行精细监督。我们提出了一个空间aware的需求引导Recursive Meshing框架(SpiderMesh),该框架: 1)通过需求引导的目标掩膜算法预补偿光学受损区域的不足上下文语义; 2)通过Recursive Meshing优化多模态语义特征,以提高像素级语义分析性能。我们还介绍了一种非对称数据增强技术M-CutOut,并实现了半监督学习,以便充分利用在实际应用中罕见的RGB-T标签。MFNet和PST900数据集的广泛实验表明,SpiderMesh在标准RGB-T分割基准测试中实现了新的先进性能。
https://arxiv.org/abs/2303.08692
Analyzing the dynamic changes of cellular morphology is important for understanding the various functions and characteristics of live cells, including stem cells and metastatic cancer cells. To this end, we need to track all points on the highly deformable cellular contour in every frame of live cell video. Local shapes and textures on the contour are not evident, and their motions are complex, often with expansion and contraction of local contour features. The prior arts for optical flow or deep point set tracking are unsuited due to the fluidity of cells, and previous deep contour tracking does not consider point correspondence. We propose the first deep learning-based tracking of cellular (or more generally viscoelastic materials) contours with point correspondence by fusing dense representation between two contours with cross attention. Since it is impractical to manually label dense tracking points on the contour, unsupervised learning comprised of the mechanical and cyclical consistency losses is proposed to train our contour tracker. The mechanical loss forcing the points to move perpendicular to the contour effectively helps out. For quantitative evaluation, we labeled sparse tracking points along the contour of live cells from two live cell datasets taken with phase contrast and confocal fluorescence microscopes. Our contour tracker quantitatively outperforms compared methods and produces qualitatively more favorable results. Our code and data are publicly available at this https URL
分析细胞形态的动态变化对于理解细胞的各种功能以及特性,包括干细胞和转移癌症细胞的重要性很重要。为此,我们需要在每个帧的 live cell 视频中跟踪高度可变形的细胞轮廓的所有点。轮廓上的局部形状和纹理并不显著,它们的运动是复杂的,常常伴随着局部轮廓特征的扩张和收缩。由于光学流或深度点跟踪的先前技术由于细胞的流动性而无法适用,而先前的深度轮廓跟踪也没有考虑点对应关系。我们提出了一种基于深度学习的点对应细胞(或更一般的黏性材料)轮廓跟踪方法,通过将两个轮廓的密集表示相结合并交叉关注来 fusion。由于在轮廓上手动标注密集跟踪点是不可能的,我们建议将机械和周期性一致性损失组成 unsupervised 学习来训练我们的轮廓跟踪器。机械损失迫使点在与轮廓垂直的方向移动,有效地帮助了。为了进行量化评估,我们从两个使用相位 contrast 和单光子共轭荧光显微镜拍摄的 live cell 数据集中标记了稀疏跟踪点沿着细胞轮廓。我们的轮廓跟踪器在数值上超越了比较方法,并产生了更好的定性结果。我们的代码和数据在这个 https URL 上公开可用。
https://arxiv.org/abs/2303.08364
Unified visual grounding pursues a simple and generic technical route to leverage multi-task data with less task-specific design. The most advanced methods typically present boxes and masks as vertex sequences to model referring detection and segmentation as an autoregressive sequential vertex generation paradigm. However, generating high-dimensional vertex sequences sequentially is error-prone because the upstream of the sequence remains static and cannot be refined based on downstream vertex information, even if there is a significant location gap. Besides, with limited vertexes, the inferior fitting of objects with complex contours restricts the performance upper bound. To deal with this dilemma, we propose a parallel vertex generation paradigm for superior high-dimension scalability with a diffusion model by simply modifying the noise dimension. An intuitive materialization of our paradigm is Parallel Vertex Diffusion (PVD) to directly set vertex coordinates as the generation target and use a diffusion model to train and infer. We claim that it has two flaws: (1) unnormalized coordinate caused a high variance of loss value; (2) the original training objective of PVD only considers point consistency but ignores geometry consistency. To solve the first flaw, Center Anchor Mechanism (CAM) is designed to convert coordinates as normalized offset values to stabilize the training loss value. For the second flaw, Angle summation loss (ASL) is designed to constrain the geometry difference of prediction and ground truth vertexes for geometry-level consistency. Empirical results show that our PVD achieves state-of-the-art in both referring detection and segmentation, and our paradigm is more scalable and efficient than sequential vertex generation with high-dimension data.
统一的视觉接地追求一种简单且通用的技术路线,以利用较少任务特定的设计来处理具有多个任务的数据和任务。最先进的方法通常会将盒子和掩膜作为顶点序列来建模参考检测和分割,将其作为自回归顶点生成范式来处理。然而,生成高维顶点序列逐个进行是易犯错误的,因为前一个序列序列一直保持静态,并且无法基于后一个顶点信息进行 refined,即使存在显著的位置差距。此外,由于顶点数量有限,具有复杂形态的物体的不佳适配性限制了性能的上界。为了解决这一困境,我们提出了一种并行顶点生成范式,以使用扩散模型来扩展高维 scalability,而只需要修改噪声维度。我们的范式的直观实现是并行顶点扩散(PVD),直接设置顶点坐标作为生成目标,并使用扩散模型进行训练和推断。我们声称它有两个缺陷:(1)未标准化坐标导致了高方差的损失值;(2) PVD 的训练目标最初仅考虑点一致性,而忽视了几何一致性。为了解决第一个缺陷,中心锚机制(CAM)被设计为将坐标转换为标准化偏移值,以稳定训练损失值。对于第二个缺陷,角度累积损失(ASL)被设计为限制预测和实际顶点几何差异,以维持几何级一致性。经验数据表明,我们的 PVD 在参考检测和分割方面实现了最先进的性能,而我们的范式比使用高维数据Sequential 顶点生成更具有扩展性和效率。
https://arxiv.org/abs/2303.07216
This paper addresses the collision avoidance problem of UAV swarms in three-dimensional (3D) space. The key challenges are energy efficiency and cooperation of swarm members. We propose to combine Artificial Potential Field (APF) with Particle Swarm Planning (PSO). APF provides environmental awareness and implicit coordination to UAVs. PSO searches for the optimal trajectories for each UAV in terms of safety and energy efficiency by minimizing a fitness function. The fitness function exploits the advantages of the Active Contour Model in image processing for trajectory planning. Lastly, vehicle-to-vehicle collisions are detected in advance based on trajectory prediction and are resolved by cooperatively adjusting the altitude of UAVs. Simulation results demonstrate that our method can save up to 80\% of energy compared to state-of-the-art schemes.
本 paper 讨论了无人机群在三维空间中的避免碰撞问题。关键挑战是能源效率和群组成员的合作。我们提议将Artificial Potential Field(APF)与粒子群规划(PSO)相结合。APF为无人机提供了环境意识和隐含的协调。PSO通过最小化一个适应函数,为每个无人机寻找安全和能源效率最佳的路径。适应函数利用图像处理中的主动轮廓模型在路径规划方面的优势。最后,基于路径预测,提前检测到车对车的碰撞,并通过合作调整无人机的高度来解决。模拟结果显示,与我们现有的算法相比,这种方法可以节省高达80%的能源。
https://arxiv.org/abs/2303.06510
Hand pose estimation (HPE) is a task that predicts and describes the hand poses from images or video frames. When HPE models estimate hand poses captured in a laboratory or under controlled environments, they normally deliver good performance. However, the real-world environment is complex, and various uncertainties may happen, which could degrade the performance of HPE models. For example, the hands could be occluded, the visibility of hands could be reduced by imperfect exposure rate, and the contour of hands prone to be blurred during fast hand movements. In this work, we adopt metamorphic testing to evaluate the robustness of HPE models and provide suggestions on the choice of HPE models for different applications. The robustness evaluation was conducted on four state-of-the-art models, namely MediaPipe hands, OpenPose, BodyHands, and NSRM hand. We found that on average more than 80\% of the hands could not be identified by BodyHands, and at least 50\% of hands could not be identified by MediaPipe hands when diagonal motion blur is introduced, while an average of more than 50\% of strongly underexposed hands could not be correctly estimated by NSRM hand. Similarly, applying occlusions on only four hand joints will also largely degrade the performance of these models. The experimental results show that occlusions, illumination variations, and motion blur are the main obstacles to the performance of existing HPE models. These findings may pave the way for researchers to improve the performance and robustness of hand pose estimation models and their applications.
手姿态估计(HPE)是一个从图像或视频帧中预测和描述手姿态的任务。当HPE模型估计在实验室或受控环境中捕获的手姿态时,通常能够表现出色。然而,现实世界是复杂的,各种不确定性可能会发生,这可能会削弱HPE模型的性能。例如, hands 可能会被 occlusion, hands 的可见性可能因为不完美的曝光率而降低,而且手的轮廓在快速手移动时可能会变得模糊。在本文中,我们采用变形测试来评估HPE模型的鲁棒性,并为不同应用程序选择 HPE 模型的建议。鲁棒性评估是针对四个最先进的模型进行的,包括 MediaPipe hands、OpenPose、BodyHands 和 NSRM hand。我们发现,平均来说,超过 80% 的手无法通过 BodyHands 识别,当对角运动模糊引入时,超过 50% 的手无法通过 MediaPipe hands 正确识别,而 strongly under exposed hands 的平均超过 50% 无法通过 NSRM hand 准确地估计。类似地,仅应用 occlusions 在每个手关节上也会极大地削弱这些模型的性能。实验结果显示, occlusion、照明变化和运动模糊是现有 HPE 模型性能的主要障碍。这些发现可能为研究人员改进手姿态估计模型及其应用的性能铺平了道路。
https://arxiv.org/abs/2303.04566
We consider the motion planning problem for stochastic nonlinear systems in uncertain environments. More precisely, in this problem the robot has stochastic nonlinear dynamics and uncertain initial locations, and the environment contains multiple dynamic uncertain obstacles. Obstacles can be of arbitrary shape, can deform, and can move. All uncertainties do not necessarily have Gaussian distribution. This general setting has been considered and solved in [1]. In addition to the assumptions above, in this paper, we consider long-term tasks, where the planning method in [1] would fail, as the uncertainty of the system states grows too large over a long time horizon. Unlike [1], we present a real-time online motion planning algorithm. We build discrete-time motion primitives and their corresponding continuous-time tubes offline, so that almost all system states of each motion primitive are guaranteed to stay inside the corresponding tube. We convert probabilistic safety constraints into a set of deterministic constraints called risk contours. During online execution, we verify the safety of the tubes against deterministic risk contours using sum-of-squares (SOS) programming. The provided SOS-based method verifies the safety of the tube in the presence of uncertain obstacles without the need for uncertainty samples and time discretization in real-time. By bounding the probability the system states staying inside the tube and bounding the probability of the tube colliding with obstacles, our approach guarantees bounded probability of system states colliding with obstacles. We demonstrate our approach on several long-term robotics tasks.
我们考虑在不确定环境中处理随机非线性系统的运动规划问题。更具体地说,在这个问题上,机器人具有随机非线性动力学和不确定初始位置,环境中包含多个动态不确定障碍物。障碍物可以具有任意形状,可以变形,可以移动。所有不确定性不一定具有高斯分布。这个问题的一般情况已经在[1]中考虑并解决了。除了以上假设,在本文中,我们考虑长期任务,在这些任务中,[1]中的规划方法可能会失败,因为系统状态的不确定性在长时间内 horizon 内变得越来越大。与[1]不同,我们提出了实时在线运动规划算法。我们 offline 构建离散时间运动基本单元及其对应的连续时间管道,以确保每个运动基本单元系统的大部分状态都必定留在相应的管道内。我们将其概率安全性限制转换为称为风险轮廓的一组确定性限制。在在线执行期间,我们使用平方根编程方法验证管道的安全性,对抗确定性风险轮廓。提供的基于平方根的方法在存在不确定障碍物的情况下验证管道的安全性,而无需实时的不确定性样本和时间离散化。通过限制系统状态留在管道内的概率以及限制管道与障碍物碰撞的概率,我们的方法保证了系统状态碰撞的概率限制。我们展示了我们的方法和几个长期机器人任务。
https://arxiv.org/abs/2303.01631
Why do Recurrent State Space Models such as PlaNet fail at cloth manipulation tasks? Recent work has attributed this to the blurry reconstruction of the observation, which makes it difficult to plan directly in the latent space. This paper explores the reasons behind this by applying PlaNet in the pick-and-place cloth-flattening domain. We find that the sharp discontinuity of the transition function on the contour of the article makes it difficult to learn an accurate latent dynamic model. By adopting KL balancing and latent overshooting in the training loss and adjusting the planned picking position to the closest part of the cloth, we show that the updated PlaNet-Pick model can achieve state-of-the-art performance using latent MPC algorithms in simulation.
为什么循环状态空间模型(如 PlaNet)在衣物操作任务中失败?最近的研究表明,这可能是由于观察的模糊重构导致的,这使得在潜在空间中直接计划变得困难。本论文通过在挑选和放置衣物平移领域的 PlaNet 应用来探索这个问题的原因。我们发现,文章轮廓上的导数函数的尖锐中断使学习准确的潜在动态模型变得困难。通过在训练损失中采用KL平衡和潜在过度估计,并将计划选取位置调整至衣物最接近的部分,我们表明,更新的 PlaNet-挑选模型可以使用潜在 MPC 算法在模拟中实现最先进的性能。
https://arxiv.org/abs/2303.01345
Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is close to the top-1 accuracy (84.2%) achieved by JFT-300M pre-training, while the number of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases.
公式驱动的监督学习(FDSL)已被证明是训练视觉转换器的有效方法,其中ExFractalDB-21k比ImageNet-21k的前训练效果更加突出。这些研究还表明,在训练视觉转换器之前,轮廓的重要性比纹理更加重要。然而,缺乏对为什么这些轮廓导向的模拟数据集可以达到与真实数据集相同的精度的系统化研究,使得许多怀疑论者仍然存在。在目前的工作中,我们基于循环谐波提出了一种新的方法,系统性地研究轮廓导向模拟数据集的设计空间。这使我们能够高效地搜索FDSL参数的最佳范围,并最大限度地增加dataset中的合成图像的多样性,我们发现这是一个重要的因素。当结果的新datasetVisualAtom-21k用于前训练ViT-Base时,在ImageNet-1k上进行微调时,top-1准确率达到83.7%。这与JFT-300M前训练时(84.2%)的top-1准确率相当,而图像数量仅为1/14。与静态数据集JFT-300M不同,合成数据集的质量将继续提高,当前工作是对这种可能性的证明。FDSL也摆脱了与真实图像相关的常见问题,例如隐私/版权问题、标签费用/错误和伦理偏见。
https://arxiv.org/abs/2303.01112
Continuum robots are promising candidates for interactive tasks in various applications due to their unique shape, compliance, and miniaturization capability. Accurate and real-time shape sensing is essential for such tasks yet remains a challenge. Embedded shape sensing has high hardware complexity and cost, while vision-based methods require stereo setup and struggle to achieve real-time performance. This paper proposes the first eye-to-hand monocular approach to continuum robot shape sensing. Utilizing a deep encoder-decoder network, our method, MoSSNet, eliminates the computation cost of stereo matching and reduces requirements on sensing hardware. In particular, MoSSNet comprises an encoder and three parallel decoders to uncover spatial, length, and contour information from a single RGB image, and then obtains the 3D shape through curve fitting. A two-segment tendon-driven continuum robot is used for data collection and testing, demonstrating accurate (mean shape error of 0.91 mm, or 0.36% of robot length) and real-time (70 fps) shape sensing on real-world data. Additionally, the method is optimized end-to-end and does not require fiducial markers, manual segmentation, or camera calibration. Code and datasets will be made available at this https URL.
连续性机器人在各种应用中作为交互任务的理想候选者,因为它们独特的形状、合规性和小型化能力。准确实时的形状感知对于此类任务至关重要,但仍然是一个挑战。嵌入的形状感知具有高硬件复杂性和成本,而视觉方法需要立体设置并努力实现实时性能。本文提出了第一个从眼睛到手部的单向连续性机器人形状感知方法。利用深度编码器和解码网络,我们的方法MoSSNet消除了立体匹配的计算成本,并减少了感知硬件的要求。特别是,MoSSNet由编码器和三个并行解码器组成,从单个RGB图像中揭露空间、长度和轮廓信息,然后通过曲线 fitting获取3D形状。使用两个段的神经驱动连续性机器人用于数据收集和测试,在现实世界数据上展示了准确的(均值形状误差为0.91毫米,或机器人长度的0.36%)实时形状感知(70帧每秒)。此外,该方法实现了端到端优化,不需要标志点、手动分割或相机校准。代码和数据集将在此httpsURL上提供。
https://arxiv.org/abs/2303.00891
Affordances are a fundamental concept in robotics since they relate available actions for an agent depending on its sensory-motor capabilities and the environment. We present a novel Bayesian deep network to detect affordances in images, at the same time that we quantify the distribution of the aleatoric and epistemic variance at the spatial level. We adapt the Mask-RCNN architecture to learn a probabilistic representation using Monte Carlo dropout. Our results outperform the state-of-the-art of deterministic networks. We attribute this improvement to a better probabilistic feature space representation on the encoder and the Bayesian variability induced at the mask generation, which adapts better to the object contours. We also introduce the new Probability-based Mask Quality measure that reveals the semantic and spatial differences on a probabilistic instance segmentation model. We modify the existing Probabilistic Detection Quality metric by comparing the binary masks rather than the predicted bounding boxes, achieving a finer-grained evaluation of the probabilistic segmentation. We find aleatoric variance in the contours of the objects due to the camera noise, while epistemic variance appears in visual challenging pixels.
行为是机器人学中一个基本的概念,因为行为取决于机器人感知和运动能力以及环境。我们提出了一种新的贝叶斯深度学习网络,用于在图像中检测行为,同时我们也量化了空间级别的 aleatoric 和 epistemic 差异的分布。我们采用了 Mask-RCNN 架构,使用蒙特卡罗 dropout 来学习一个概率表示。我们的结果显示比确定性网络更好。这得益于编码器中更好的概率特征空间表示,以及在 mask 生成中引入的贝叶斯变化,这些适应对象轮廓。我们还介绍了一种新的概率型 mask 质量度量,用于在概率实例分割模型中揭示语义和空间差异。我们通过比较二进制 mask 而不是预测的边界框,修改了现有的概率检测质量度量,实现了更精细的的概率分割评估。我们发现对象的轮廓中的 aleatoric 差异是由于相机噪声引起的,而Epistemic 差异出现在视觉挑战性的像素中。
https://arxiv.org/abs/2303.00871
Modern object detectors are vulnerable to adversarial examples, which may bring risks to real-world applications. The sparse attack is an important task which, compared with the popular adversarial perturbation on the whole image, needs to select the potential pixels that is generally regularized by an $\ell_0$-norm constraint, and simultaneously optimize the corresponding texture. The non-differentiability of $\ell_0$ norm brings challenges and many works on attacking object detection adopted manually-designed patterns to address them, which are meaningless and independent of objects, and therefore lead to relatively poor attack performance. In this paper, we propose Adversarial Semantic Contour (ASC), an MAP estimate of a Bayesian formulation of sparse attack with a deceived prior of object contour. The object contour prior effectively reduces the search space of pixel selection and improves the attack by introducing more semantic bias. Extensive experiments demonstrate that ASC can corrupt the prediction of 9 modern detectors with different architectures (\e.g., one-stage, two-stage and Transformer) by modifying fewer than 5\% of the pixels of the object area in COCO in white-box scenario and around 10\% of those in black-box scenario. We further extend the attack to datasets for autonomous driving systems to verify the effectiveness. We conclude with cautions about contour being the common weakness of object detectors with various architecture and the care needed in applying them in safety-sensitive scenarios.
现代物体检测器对对抗样本具有脆弱性,这可能会对实际应用程序带来风险。稀疏攻击是一项重要的任务,相比整个图像的对抗扰动,需要选择通常通过 $ell_0$ 正则化约束 Regularized 的潜在像素,同时优化相应的纹理。$ell_0$ 正则化的不连续性带来挑战,许多攻击物体检测的工作采用了手动设计的模式来解决这些问题,这些模式没有意义且与物体独立,因此导致攻击性能相对较差。在本文中,我们提出了对抗语义轮廓(ASC),它是一种 MAP 估计的 Bayesian 框架中的稀疏攻击的贝叶斯估计。物体轮廓先验有效地减少了像素选择搜索空间,并引入了更多的语义偏见,改善了攻击。广泛实验表明, ASC 可以损坏不同架构的现代物体检测器的预测(例如一阶段、二阶段和Transformer)。我们还将攻击扩展到自动驾驶系统的dataset 以验证效果。我们的结论是,轮廓是各种架构物体检测器的常见弱点,在安全性敏感的场景中需要采取谨慎的方法应用它们。
https://arxiv.org/abs/2303.00284
Collective decision-making is an essential capability of large-scale multi-robot systems to establish autonomy on the swarm level. A large portion of literature on collective decision-making in swarm robotics focuses on discrete decisions selecting from a limited number of options. Here we assign a decentralized robot system with the task of exploring an unbounded environment, finding consensus on the mean of a measurable environmental feature, and aggregating at areas where that value is measured (e.g., a contour line). A unique quality of this task is a causal loop between the robots' dynamic network topology and their decision-making. For example, the network's mean node degree influences time to convergence while the currently agreed-on mean value influences the swarm's aggregation location, hence, also the network structure as well as the precision error. We propose a control algorithm and study it in real-world robot swarm experiments in different environments. We show that our approach is effective and achieves higher precision than a control experiment. We anticipate applications, for example, in containing pollution with surface vehicles.
集体决策是大规模多机器人系统建立群体自治的关键能力。在群体机器人学中,大量的文献都集中在从有限选项中选择离散决策的问题。在这里,我们委托一个分散化的机器人系统去探索一个不受限制的环境,找到可测量环境特征的平均数,并在这些值被测量的地方进行聚合(例如,地形线)。这个任务的独特的特征是机器人的动态网络拓扑与其决策之间的因果关系循环。例如,网络的平均节点度数会影响收敛的时间,而当前商定的平均值会影响群体聚合的位置,因此也影响网络结构和精度误差。我们提出了一个控制算法,并在不同的环境中进行了实际机器人群体实验。我们表明,我们的方法和控制实验相比更有效,精度更高。我们预计这种方法可以应用于例如,与陆上车辆排放有关的污染物控制。
https://arxiv.org/abs/2302.13629
In this work, we propose an adversarial attack-based data augmentation method to improve the deep-learning-based segmentation algorithm for the delineation of Organs-At-Risk (OAR) in abdominal Computed Tomography (CT) to facilitate radiation therapy. We introduce Adversarial Feature Attack for Medical Image (AFA-MI) augmentation, which forces the segmentation network to learn out-of-distribution statistics and improve generalization and robustness to noises. AFA-MI augmentation consists of three steps: 1) generate adversarial noises by Fast Gradient Sign Method (FGSM) on the intermediate features of the segmentation network's encoder; 2) inject the generated adversarial noises into the network, intentionally compromising performance; 3) optimize the network with both clean and adversarial features. Experiments are conducted segmenting the heart, left and right kidney, liver, left and right lung, spinal cord, and stomach. We first evaluate the AFA-MI augmentation using nnUnet and TT-Vnet on the test data from a public abdominal dataset and an institutional dataset. In addition, we validate how AFA-MI affects the networks' robustness to the noisy data by evaluating the networks with added Gaussian noises of varying magnitudes to the institutional dataset. Network performance is quantitatively evaluated using Dice Similarity Coefficient (DSC) for volume-based accuracy. Also, Hausdorff Distance (HD) is applied for surface-based accuracy. On the public dataset, nnUnet with AFA-MI achieves DSC = 0.85 and HD = 6.16 millimeters (mm); and TT-Vnet achieves DSC = 0.86 and HD = 5.62 mm. AFA-MI augmentation further improves all contour accuracies up to 0.217 DSC score when tested on images with Gaussian noises. AFA-MI augmentation is therefore demonstrated to improve segmentation performance and robustness in CT multi-organ segmentation.
在本文中,我们提出了一种基于对抗攻击的数据增强方法,以改进深度学习为基础的分割算法,用于在腹部CT扫描中绘制危险器官(OAR),以方便放疗。我们介绍了医学图像的对抗特征攻击(AFA-MI)增强技术,该技术迫使分割网络学习分布外的统计信息,并提高对噪声的泛化和鲁棒性。AFA-MI增强技术包括三个步骤:1)使用快速梯度符号方法(FGSM)在分割网络编码器的中级特征上生成对抗噪声;2)将生成的对抗噪声注入网络,故意降低性能;3)优化既有干净特征和对抗特征的网络。实验是将心脏、左右肾脏、肝脏、左右肺部、脊髓和胃部划分为子集进行测试。我们首先使用nnUnet和TT-Vnet从公共腹部数据集和机构数据集中测试数据进行AFA-MI增强的评估。此外,我们验证AFA-MI如何影响机构数据集上的噪声数据的鲁棒性,评估添加不同大小Gaussian噪声的网络。网络性能使用体积精度(DSC)进行定量评估,并使用高斯距离(HD)进行表面精度评估。在公共数据集上,nnUnet和AFA-MI增强的评估结果为DSC=0.85和HD=6.16毫米;TT-Vnet的评估结果为DSC=0.86和HD=5.62毫米。在带有Gaussian噪声的图像上的测试中,AFA-MI增强进一步提高了所有轮廓精度,达到0.217的DSC得分。因此,我们证明了AFA-MI增强技术可以提高CT多器官分割中的分割性能和鲁棒性。
https://arxiv.org/abs/2302.13172
Purpose: Thoracic radiographs are commonly used to evaluate patients with confirmed or suspected thoracic pathology. Proper patient positioning is more challenging in canine and feline radiography than in humans due to less patient cooperation and body shape variation. Improper patient positioning during radiograph acquisition has the potential to lead to a misdiagnosis. Asymmetrical hemithoraces are one of the indications of obliquity for which we propose an automatic classification method. Approach: We propose a hemithoraces segmentation method based on Convolutional Neural Networks (CNNs) and active contours. We utilized the U-Net model to segment the ribs and spine and then utilized active contours to find left and right hemithoraces. We then extracted features from the left and right hemithoraces to train an ensemble classifier which includes Support Vector Machine, Gradient Boosting and Multi-Layer Perceptron. Five-fold cross-validation was used, thorax segmentation was evaluated by Intersection over Union (IoU), and symmetry classification was evaluated using Precision, Recall, Area under Curve and F1 score. Results: Classification of symmetry for 900 radiographs reported an F1 score of 82.8% . To test the robustness of the proposed thorax segmentation method to underexposure and overexposure, we synthetically corrupted properly exposed radiographs and evaluated results using IoU. The results showed that the models IoU for underexposure and overexposure dropped by 2.1% and 1.2%, respectively. Conclusions: Our results indicate that the proposed thorax segmentation method is robust to poor exposure radiographs. The proposed thorax segmentation method can be applied to human radiography with minimal changes.
目的: 常用的方法是评估确认或怀疑Thoracic Pathology 的患者。在犬和小狐狸的X射线片中,正确的患者位置比人类更困难,因为患者合作程度不足和身体形状变化。在X射线采集期间不正确的患者位置有可能导致错误的诊断。斜面分割是斜面突出物分类的一个适应症,我们提出了一种自动分类方法来对其进行分类。方法:我们提出了基于卷积神经网络(CNNs)和主动轮廓的斜面分割方法。我们使用U-Net模型分割肋骨和脊柱,然后使用主动轮廓找到左和右斜面。我们然后从左和右斜面提取特征来训练一个集成分类器,其中包括支持向量机、梯度提升和多层感知器。使用五次交叉验证,对 Thoracic Segmentation 进行评估,使用Intersection over Union(IoU) 和对称分类进行评估,使用精度、召回率和曲线下面积和F1 分数进行评估。结果:对900张照片的对称分类报告F1 得分为82.8%。为了测试所提出的 Thoracic Segmentation 方法对不足曝光和过度曝光的鲁棒性,我们合成了正确的曝光照片并使用IoU 进行评估。结果表明,不足曝光和过度曝光模型的IoU分别下降了2.1%和1.2%。结论:我们的结果表明,所提出的 Thoracic Segmentation 方法对较差曝光的X射线片非常鲁棒。所提出的 Thoracic Segmentation 方法可以应用于人类X射线片,几乎没有变化。
https://arxiv.org/abs/2302.12923
Nuclei classification provides valuable information for histopathology image analysis. However, the large variations in the appearance of different nuclei types cause difficulties in identifying nuclei. Most neural network based methods are affected by the local receptive field of convolutions, and pay less attention to the spatial distribution of nuclei or the irregular contour shape of a nucleus. In this paper, we first propose a novel polygon-structure feature learning mechanism that transforms a nucleus contour into a sequence of points sampled in order, and employ a recurrent neural network that aggregates the sequential change in distance between key points to obtain learnable shape features. Next, we convert a histopathology image into a graph structure with nuclei as nodes, and build a graph neural network to embed the spatial distribution of nuclei into their representations. To capture the correlations between the categories of nuclei and their surrounding tissue patterns, we further introduce edge features that are defined as the background textures between adjacent nuclei. Lastly, we integrate both polygon and graph structure learning mechanisms into a whole framework that can extract intra and inter-nucleus structural characteristics for nuclei classification. Experimental results show that the proposed framework achieves significant improvements compared to the state-of-the-art methods.
核分裂分析提供了对病理切片图像分析有价值的信息。然而,不同核型的外观巨大差异导致识别核型的困难。大多数基于神经网络的方法受到卷积局部响应场的直接影响,并较少关注核的空间分布或核的不规则形态。在本文中,我们首先提出了一种独特的多边形结构特征学习机制,将核的轮廓转换为按顺序采样的点序列,并使用循环神经网络将关键点之间的Sequential change聚合起来以获得可学习的形状特征。接下来,我们将病理切片图像转换为以核作为节点的 graph 结构,并构建一个 graph 神经网络,将核的空间分布嵌入其表示中。为了捕捉核分类类别及其周围组织模式之间的相关关系,我们进一步引入了边缘特征,它们被定义为相邻核的背景纹理。最后,我们将多边形和 graph 结构学习机制集成到一个整体框架中,以提取核内和核间结构特征,用于核分类。实验结果显示,与当前最好的方法相比,我们提出的框架取得了显著的改进。
https://arxiv.org/abs/2302.11416
Deep learning models benefit from training with a large dataset (labeled or unlabeled). Following this motivation, we present an approach to learn a deep learning model for the automatic segmentation of Organs at Risk (OARs) in cervical cancer radiation treatment from a large clinically available dataset of Computed Tomography (CT) scans containing data inhomogeneity, label noise, and missing annotations. We employ simple heuristics for automatic data cleaning to minimize data inhomogeneity and label noise. Further, we develop a semi-supervised learning approach utilizing a teacher-student setup, annotation imputation, and uncertainty-guided training to learn in presence of missing annotations. Our experimental results show that learning from a large dataset with our approach yields a significant improvement in the test performance despite missing annotations in the data. Further, the contours generated from the segmentation masks predicted by our model are found to be equally clinically acceptable as manually generated contours.
深度学习模型从训练大型数据集(标记或未标记)中获得好处。基于这一动机,我们提出了一种方法,用于学习一种用于 cervical cancer 放疗中危险器官(OARs)的自动分割深度学习模型。该模型从具有数据一致性、标签噪声和缺失标注的大型临床可用的 CT 扫描数据集中提取。我们采用了简单的启发式数据清洗方法,以最小化数据一致性和标签噪声。此外,我们开发了一种半监督学习方法,利用教师学生架构、标注补全和不确定性引导的训练,在缺失标注的情况下学习。我们的实验结果显示,从我们的方法和大型数据集学习可以获得显著改进测试性能,尽管数据中存在缺失标注。此外,我们的模型预测的分割掩码生成的轮廓与手动生成的轮廓同样符合临床接受标准。
https://arxiv.org/abs/2302.10661
Spacecraft pose estimation plays a vital role in many on-orbit space missions, such as rendezvous and docking, debris removal, and on-orbit maintenance. At present, space images contain widely varying lighting conditions, high contrast and low resolution, pose estimation of space objects is more challenging than that of objects on earth. In this paper, we analyzing the radar image characteristics of spacecraft on-orbit, then propose a new deep learning neural Network structure named Dense Residual U-shaped Network (DR-U-Net) to extract image features. We further introduce a novel neural network based on DR-U-Net, namely Spacecraft U-shaped Network (SU-Net) to achieve end-to-end pose estimation for non-cooperative spacecraft. Specifically, the SU-Net first preprocess the image of non-cooperative spacecraft, then transfer learning was used for pre-training. Subsequently, in order to solve the problem of radar image blur and low ability of spacecraft contour recognition, we add residual connection and dense connection to the backbone network U-Net, and we named it DR-U-Net. In this way, the feature loss and the complexity of the model is reduced, and the degradation of deep neural network during training is avoided. Finally, a layer of feedforward neural network is used for pose estimation of non-cooperative spacecraft on-orbit. Experiments prove that the proposed method does not rely on the hand-made object specific features, and the model has robust robustness, and the calculation accuracy outperforms the state-of-the-art pose estimation methods. The absolute error is 0.1557 to 0.4491 , the mean error is about 0.302 , and the standard deviation is about 0.065 .
飞船姿态估计在许多轨道空间任务中发挥着关键作用,例如相遇和对接、碎片清除和轨道维护。目前,太空图像具有 widely varying 照明条件、高对比度和低分辨率,对太空物体的姿态估计比地球上的物体更难。在本文中,我们对太空站飞船雷达图像的特征进行分析,然后提出了一种名为Dense Residual U-shaped Network(DR-U-Net)的新深度学习神经网络结构,以提取图像特征。我们还介绍了基于DR-U-Net的一种新型神经网络,名为Spacecraft U-shaped Network(SU-Net),以实现对非合作飞船的姿态估计。具体来说,SU-Net首先对非合作飞船的图像进行预处理,然后使用迁移学习进行预训练。随后,为了解决雷达图像模糊和飞船轮廓识别能力不足的问题,我们将残留连接和密集连接添加到U-Net的主干网络中,并称为DR-U-Net。通过这种方法,特征损失和模型复杂性被减少,深度学习网络的训练过程不会被退化。最后,一个前向神经网络层被用于轨道非合作飞船的姿态估计。实验表明,该方法不依赖于制造物体特定的特征,模型具有鲁棒性,计算精度优于最先进的姿态估计方法。绝对误差为0.1557至0.4491,平均误差约为0.302,标准差约为0.065。
https://arxiv.org/abs/2302.10602