Generative Large Language Models have shown impressive in-context learning abilities, performing well across various tasks with just a prompt. Previous melody-to-lyric research has been limited by scarce high-quality aligned data and unclear standard for creativeness. Most efforts focused on general themes or emotions, which are less valuable given current language model capabilities. In tonal contour languages like Mandarin, pitch contours are influenced by both melody and tone, leading to variations in lyric-melody fit. Our study, validated by the Mpop600 dataset, confirms that lyricists and melody writers consider this fit during their composition process. In this research, we developed a multi-agent system that decomposes the melody-to-lyric task into sub-tasks, with each agent controlling rhyme, syllable count, lyric-melody alignment, and consistency. Listening tests were conducted via a diffusion-based singing voice synthesizer to evaluate the quality of lyrics generated by different agent groups.
生成式大型语言模型表现出令人印象深刻的上下文学习能力,只需一个提示就能在各种任务中表现出色。之前的旋律到歌词研究由于缺乏高质量的对齐数据和不清楚的创造力标准而受到限制。大多数努力都集中在一般主题或情感上,而当前语言模型的能力已经足够强大。在类似于普通话的语调语言中,音高轮廓既受到旋律的影响,也受到音调的影响,导致歌词旋律的匹配差异。我们通过Mpop600数据集验证的研究证实,歌词家和旋律家在创作过程中考虑这种匹配。在这项研究中,我们开发了一个多代理系统,将旋律到歌词任务分解为子任务,每个代理控制韵律、单词计数、歌词旋律对齐和一致性。通过通过扩散式人声合成器进行的 listening tests 来评估不同代理组生成的歌词的质量。
https://arxiv.org/abs/2410.01450
In this paper we propose a word-wise intonation model for Russian language and show how it can be generalized for other languages. The proposed model is suitable for automatic data markup and its extended application to text-to-speech systems. It can also be implemented for an intonation contour modeling by using rule-based algorithms or by predicting contours with language models. The key idea is a partial elimination of the variability connected with different placements of a stressed syllable in a word. It is achieved with simultaneous applying of pitch simplification with a dynamic time warping clustering. The proposed model could be used as a tool for intonation research or as a backbone for prosody description in text-to-speech systems. As the advantage of the model, we show its relations with the existing intonation systems as well as the possibility of using language models for prosody prediction. Finally, we demonstrate some practical evidence of the system robustness to parameter variations.
在本文中,我们提出了一个基于单词语调模型的俄罗斯语模型,并证明了它可以扩展到其他语言。所提出的模型适合自动数据标记,并可以应用于文本-转-语音系统。它还可以通过基于规则的算法或通过预测带语调轮廓来实现语调轮廓建模。关键思想是,通过同时应用声调简化与动态时间膨胀聚类,消除单词中不同位置的强音节所引起的变异性。通过这种方法,我们实现了与现有语调系统的关系以及使用语言模型进行语调预测的可能性。最后,我们证明了系统的鲁棒性,包括参数变化的影响。
https://arxiv.org/abs/2409.20374
The precision of contouring target structures and organs-at-risk (OAR) in radiotherapy planning is crucial for ensuring treatment efficacy and patient safety. Recent advancements in deep learning (DL) have significantly improved OAR contouring performance, yet the reliability of these models, especially in the presence of out-of-distribution (OOD) scenarios, remains a concern in clinical settings. This application study explores the integration of epistemic uncertainty estimation within the OAR contouring workflow to enable OOD detection in clinically relevant scenarios, using specifically compiled data. Furthermore, we introduce an advanced statistical method for OOD detection to enhance the methodological framework of uncertainty estimation. Our empirical evaluation demonstrates that epistemic uncertainty estimation is effective in identifying instances where model predictions are unreliable and may require an expert review. Notably, our approach achieves an AUC-ROC of 0.95 for OOD detection, with a specificity of 0.95 and a sensitivity of 0.92 for implant cases, underscoring its efficacy. This study addresses significant gaps in the current research landscape, such as the lack of ground truth for uncertainty estimation and limited empirical evaluations. Additionally, it provides a clinically relevant application of epistemic uncertainty estimation in an FDA-approved and widely used clinical solution for OAR segmentation from Varian, a Siemens Healthineers company, highlighting its practical benefits.
精确轮廓靶结构和器官-at-risk (OAR)在放射治疗计划中的准确性对确保治疗效果和患者安全至关重要。近年来,深度学习(DL)的进步显著提高了OAR轮廓性能,然而,尤其是在分布外(OOD)场景下,这些模型的可靠性仍然是一个令人担忧的问题。本应用研究探讨了将元理不确定性估计整合到OAR轮廓工作流程中,以实现临床上相关场景中的OOD检测,使用专门编写的数据。此外,我们还引入了一种高级统计方法来进行OOD检测,以增强不确定性估计方法论框架。我们的实证评估表明,元理不确定性估计在发现模型预测不可靠的情况下是有效的,可能需要专家审查。值得注意的是,我们的方法在OOD检测方面的AUC-ROC为0.95,特异性为0.95,灵敏度为0.92,强调了其有效性。本研究针对当前研究格局的显著空白,例如不确定性估计的地面真缺乏和有限的实证评价。此外,它为费森尤斯(Fisher Scientific)公司开发的经FDA批准且广泛使用的临床解决方案OAR分割提供了一个临床相关的应用,突出了其实用性。
https://arxiv.org/abs/2409.18628
While neural networks have made significant strides in many AI tasks, they remain vulnerable to a range of noise types, including natural corruptions, adversarial noise, and low-resolution artifacts. Many existing approaches focus on enhancing robustness against specific noise types, limiting their adaptability to others. Previous studies have addressed general robustness by adopting a spectral perspective, which tends to blur crucial features like texture and object contours. Our proposed solution, however, introduces an inverse scale variational sparsification framework within a time-continuous inverse scale space formulation. This framework progressively learns finer-scale features by discerning variational differences between pixels, ultimately preserving only large-scale features in the smoothed image. Unlike frequency-based methods, our approach not only removes noise by smoothing small-scale features where corruptions often occur but also retains high-contrast details such as textures and object contours. Moreover, our framework offers simplicity and efficiency in implementation. By integrating this algorithm into neural network training, we guide the model to prioritize learning large-scale features. We show the efficacy of our approach through enhanced robustness against various noise types.
虽然神经网络已经在许多AI任务中取得了显著的进展,但它们仍然容易受到各种噪声类型的影响,包括自然败坏、对抗噪声和低分辨率伪影等。许多现有方法侧重于增强对特定噪声类型的鲁棒性,而限制其对其他任务的适应性。以前的研究通过采用频域视角来解决通用鲁棒性,这种视角往往会使关键特征(如纹理和物体轮廓)变得模糊。然而,我们提出的解决方案是在时间连续反比例尺空间公式中引入逆比例稀疏化框架。这个框架通过区分像素之间的变分差异来逐步学习更细粒度的特征,最终在平滑的图像中只保留大尺度特征。与基于频率的方法不同,我们的方法不仅通过平滑小尺度特征中的败坏去除噪声,而且还保留了纹理和物体轮廓等高对比细节。此外,我们的框架在实现上具有简单和高效的特点。通过将该算法集成到神经网络训练中,我们引导模型优先学习大尺度特征。我们通过增强对各种噪声类型的鲁棒性来展示我们方法的实效性。
https://arxiv.org/abs/2409.18419
The irregular contour representation is one of the tough challenges in scene text detection. Although segmentation-based methods have achieved significant progress with the help of flexible pixel prediction, the overlap of geographically close texts hinders detecting them separately. To alleviate this problem, some shrink-based methods predict text kernels and expand them to restructure texts. However, the text kernel is an artificial object with incomplete semantic features that are prone to incorrect or missing detection. In addition, different from the general objects, the geometry features (aspect ratio, scale, and shape) of scene texts vary significantly, which makes it difficult to detect them accurately. To consider the above problems, we propose an effective spotlight text detector (STD), which consists of a spotlight calibration module (SCM) and a multivariate information extraction module (MIEM). The former concentrates efforts on the candidate kernel, like a camera focus on the target. It obtains candidate features through a mapping filter and calibrates them precisely to eliminate some false positive samples. The latter designs different shape schemes to explore multiple geometric features for scene texts. It helps extract various spatial relationships to improve the model's ability to recognize kernel regions. Ablation studies prove the effectiveness of the designed SCM and MIEM. Extensive experiments verify that our STD is superior to existing state-of-the-art methods on various datasets, including ICDAR2015, CTW1500, MSRA-TD500, and Total-Text.
不规则轮廓表示是场景文本检测中的一个艰难挑战。尽管基于分割的方法在灵活的像素预测的帮助下取得了显著的进展,但基于分割的文本之间存在的地理接近性使得单独检测它们变得困难。为了减轻这个问题,一些基于收缩的方法预测文本内核并将其扩展以重构文本。然而,文本内核是一个具有不完整语义特征的人工对象,容易出现错误或缺失检测。此外,与一般物体不同,场景文本的形状特征( aspect ratio, scale, and shape)差异很大,这使得准确检测它们变得困难。为了考虑上述问题,我们提出了一个有效的亮点文本检测器(STD),它由一个亮点校准模块(SCM)和一个多维信息提取模块(MIEM)组成。前者将精力集中在候选内核上,就像相机对目标进行对焦一样。它通过映射滤波器获得候选特征,并对其进行精确校准,以消除一些误检样本。后者设计不同的形状方案,以探索场景文本的多个几何特征。它有助于提取各种空间关系,提高模型对内核区域的识别能力。消融研究证明了设计的SCM和MIEM的有效性。丰富的实验证实了我们的STD在各种数据集(包括ICDAR2015、CTW1500、MSRA-TD500和Total-Text)上的优越性。
https://arxiv.org/abs/2409.16820
This article presents a Visual Servoing Nonlinear Model Predictive Control (NMPC) scheme for autonomously tracking a moving target using multirotor Unmanned Aerial Vehicles (UAVs). The scheme is developed for surveillance and tracking of contour-based areas with evolving features. NMPC is used to manage input and state constraints, while additional barrier functions are incorporated in order to ensure system safety and optimal performance. The proposed control scheme is designed based on the extraction and implementation of the full dynamic model of the features describing the target and the state variables. Real-time simulations and experiments using a quadrotor UAV equipped with a camera demonstrate the effectiveness of the proposed strategy.
本文提出了一种使用多旋翼无人机(UAVs)自主跟踪运动目标的视觉伺服器非线性模型预测控制(NMPC)方案。该方案用于监控和跟踪基于轮廓的区域的演变特征。NMPC用于管理输入和状态约束,而额外屏障函数被引入,以确保系统安全和最佳性能。所提出的控制方案是基于目标特征和状态变量的完整动态模型的提取和实现。使用配备摄像头的四旋翼UAV进行实时模拟和实验证明了所提出策略的有效性。
https://arxiv.org/abs/2409.16665
This paper presents a novel technique for camera calibration using a single view that incorporates a spherical mirror. Leveraging the distinct characteristics of the sphere's contour visible in the image and its reflections, we showcase the effectiveness of our method in achieving precise calibration. Furthermore, the reflection from the mirrored surface provides additional information about the surrounding scene beyond the image frame. Our method paves the way for the development of simple catadioptric stereo systems. We explore the challenges and opportunities associated with employing a single mirrored sphere, highlighting the potential applications of this setup in practical scenarios. The paper delves into the intricacies of the geometry and calibration procedures involved in catadioptric stereo utilizing a spherical mirror. Experimental results, encompassing both synthetic and real-world data, are presented to illustrate the feasibility and accuracy of our approach.
本文提出了一种使用单景相机进行 camera calibration 的全新技术,该技术结合了一个球形镜。通过利用球形轮廓在图像中清晰可见以及其反射的特点,我们展示了我们方法在精确校准方面的有效性。此外,镜面反射提供的关于图像之外场景的信息,使我们的方法在实际场景中具有更广泛的应用前景。我们的方法为简单猫目镜立体系统的发展铺平了道路。我们探讨了使用单一球形镜所涉及的几何和校准过程的复杂性,并着重强调了这种设置在实际场景中的潜在应用价值。本文深入探讨了利用球形镜进行猫目镜立体时所涉及的几何和校准过程的复杂性。为说明我们方法的可行性和准确性,本文提供了实验结果,包括合成和真实世界数据。
https://arxiv.org/abs/2409.16386
Radiotherapy requires precise segmentation of organs at risk (OARs) and of the Clinical Target Volume (CTV) to maximize treatment efficacy and minimize toxicity. While deep learning (DL) has significantly advanced automatic contouring, complex targets like CTVs remain challenging. This study explores the use of simpler, well-segmented structures (e.g., OARs) as Anatomical Prior (AP) information to improve CTV segmentation. We investigate gender bias in segmentation models and the mitigation effect of the prior information. Findings indicate that incorporating prior knowledge with the discussed strategies enhances segmentation quality in female patients and reduces gender bias, particularly in the abdomen region. This research provides a comparative analysis of new encoding strategies and highlights the potential of using AP to achieve fairer segmentation outcomes.
放射治疗需要对受影响的器官进行精确分割(OARs)和对临床靶体积(CTV)的分割,以最大程度地提高治疗效果和最小化毒性。尽管深度学习(DL)已经在自动建模方面取得了显著的进展,但复杂目标(如CTV)仍然具有挑战性。本研究探讨了使用简单的、分割良好的结构(例如OARs)作为解剖先验(AP)信息来改善CTV分割的方法。我们调查了分割模型的性别偏见以及先验信息的作用。研究结果表明,与所述策略相结合,可以为女性患者提高分割质量,减少性别偏见,特别是在腹部区域。这项研究对新的编码策略进行了比较分析,并强调了使用AP实现更公平分割结果的潜力。
https://arxiv.org/abs/2409.15888
We introduce a dual contouring method that provides state-of-the-art performance for occupancy functions while achieving computation times of a few seconds. Our method is learning-free and carefully designed to maximize the use of GPU parallelization. The recent surge of implicit neural representations has led to significant attention to occupancy fields, resulting in a wide range of 3D reconstruction and generation methods based on them. However, the outputs of such methods have been underestimated due to the bottleneck in converting the resulting occupancy function to a mesh. Marching Cubes tends to produce staircase-like artifacts, and most subsequent works focusing on exploiting signed distance functions as input also yield suboptimal results for occupancy functions. Based on Manifold Dual Contouring (MDC), we propose Occupancy-Based Dual Contouring (ODC), which mainly modifies the computation of grid edge points (1D points) and grid cell points (3D points) to not use any distance information. We introduce auxiliary 2D points that are used to compute local surface normals along with the 1D points, helping identify 3D points via the quadric error function. To search the 1D, 2D, and 3D points, we develop fast algorithms that are parallelizable across all grid edges, faces, and cells. Our experiments with several 3D neural generative models and a 3D mesh dataset demonstrate that our method achieves the best fidelity compared to prior works.
我们提出了一个双曲面方法,在实现高级 occupancy 函数的同时,实现了几秒钟的计算时间。我们的方法是学习无关的,并经过精心设计以充分利用 GPU 并行计算。最近,隐式神经表示的激增导致了对占位场的高度关注,从而产生了基于它们的各种 3D 重建和生成方法。然而,由于将得到的占位函数转换为网格的瓶颈,这些方法的输出被低估。 Marching Cubes 倾向于产生类似楼梯的伪影,而且大多数后续工作都专注于利用有符号距离函数作为输入,它们的占位函数的输出也不理想。基于 Manifold Dual Contouring(MDC),我们提出了基于占位场的双曲面方法(ODC),它主要修改了计算网格边缘点(1D 点)和网格单元点(3D 点)的方式,使其不使用任何距离信息。我们引入了辅助 2D 点,用于计算沿着 1D 点的局部表面法线,从而通过四元误差函数识别 3D 点。为了在所有网格边缘、面和单元格中搜索 1D、2D 和 3D 点,我们开发了快速并行算法。我们的几个 3D 神经生成模型和 3D 网格数据集的实验证明,我们的方法在保持与之前工作最佳保真度的同时,实现了卓越的性能。
https://arxiv.org/abs/2409.13418
Modeling the natural contour of fundamental frequency (F0) plays a critical role in music audio synthesis. However, transcribing and managing multiple F0 contours in polyphonic music is challenging, and explicit F0 contour modeling has not yet been explored for polyphonic instrumental synthesis. In this paper, we present ViolinDiff, a two-stage diffusion-based synthesis framework. For a given violin MIDI file, the first stage estimates the F0 contour as pitch bend information, and the second stage generates mel spectrogram incorporating these expressive details. The quantitative metrics and listening test results show that the proposed model generates more realistic violin sounds than the model without explicit pitch bend modeling. Audio samples are available online: this http URL.
在音乐音频合成中,建模基本频率(F0)的自然轮廓具有关键作用。然而,在多声音乐中转录和处理多个F0轮廓是非常具有挑战性的,而且对于多声乐器合成,还没有探索过明确的基本频率轮廓建模。在本文中,我们提出了ViolinDiff,一种基于扩散的两阶段合成框架。对于给定的小提琴MIDI文件,第一阶段估计F0轮廓为音高折返信息,第二阶段生成包括这些表现细节的 Mel 频谱图。定量和听觉测试结果表明,与没有明确音高折返建模的模型相比,所提出的模型生成的 violin 声音更加真实。音频样本可以在网上获取:这个 http URL。
https://arxiv.org/abs/2409.12477
Camouflaged object detection (COD) aims to segment camouflaged objects which exhibit very similar patterns with the surrounding environment. Recent research works have shown that enhancing the feature representation via the frequency information can greatly alleviate the ambiguity problem between the foreground objects and the background.With the emergence of vision foundation models, like InternImage, Segment Anything Model etc, adapting the pretrained model on COD tasks with a lightweight adapter module shows a novel and promising research direction. Existing adapter modules mainly care about the feature adaptation in the spatial domain. In this paper, we propose a novel frequency-guided spatial adaptation method for COD task. Specifically, we transform the input features of the adapter into frequency domain. By grouping and interacting with frequency components located within non overlapping circles in the spectrogram, different frequency components are dynamically enhanced or weakened, making the intensity of image details and contour features adaptively adjusted. At the same time, the features that are conducive to distinguishing object and background are highlighted, indirectly implying the position and shape of camouflaged object. We conduct extensive experiments on four widely adopted benchmark datasets and the proposed method outperforms 26 state-of-the-art methods with large margins. Code will be released.
伪装物体检测(COD)旨在分割那些与周围环境表现出非常相似模式的伪装物体。通过利用频率信息增强特征表示可以极大地减轻前景物体和背景之间的模糊问题。随着视觉基础模型的出现,如InternImage和Segment Anything Model等,通过轻量级适配模块在COD任务上调整预训练模型具有新颖和有前途的研究方向。现有的适配模块主要关注在空间域内的特征适应。在本文中,我们提出了一种新的频率引导的空间适应方法用于COD任务。具体来说,我们将适配器的输入特征变换到频率域。通过将频率分量分组并与其位于频谱图内非重叠圆圈内的频率分量相互作用,动态增强或削弱不同频率分量,使图像细节和轮廓特征的强度自适应调整。同时,有助于区分物体和背景的特征也被突出,间接地表明了伪装物体的位置和形状。我们对四个广泛采用的基准数据集进行了广泛的实验,与预训练模型相比,所提出方法具有很大的优势。代码将发布。
https://arxiv.org/abs/2409.12421
Cervical cancer remains the fourth most common malignancy amongst women worldwide.1 Concurrent chemoradiotherapy (CRT) serves as the mainstay definitive treatment regimen for locally advanced cervical cancers and includes external beam radiation followed by brachytherapy.2 Integral to radiotherapy treatment planning is the routine contouring of both the target tumor at the level of the cervix, associated gynecologic anatomy and the adjacent organs at risk (OARs). However, manual contouring of these structures is both time and labor intensive and associated with known interobserver variability that can impact treatment outcomes. While multiple tools have been developed to automatically segment OARs and the high-risk clinical tumor volume (HR-CTV) using computed tomography (CT) images,3,4,5,6 the development of deep learning-based tumor segmentation tools using routine T2-weighted (T2w) magnetic resonance imaging (MRI) addresses an unmet clinical need to improve the routine contouring of both anatomical structures and cervical cancers, thereby increasing quality and consistency of radiotherapy planning. This work applied a novel deep-learning model (PocketNet) to segment the cervix, vagina, uterus, and tumor(s) on T2w MRI. The performance of the PocketNet architecture was evaluated, when trained on data via 5-fold cross validation. PocketNet achieved a mean Dice-Sorensen similarity coefficient (DSC) exceeding 70% for tumor segmentation and 80% for organ segmentation. These results suggest that PocketNet is robust to variations in contrast protocols, providing reliable segmentation of the ROIs.
宫颈癌仍然是全球女性最常见的恶性肿瘤。1 同时放射和化疗治疗(CRT)是局部进展期宫颈癌的标准化治疗方案,包括先进行体外放疗,然后进行放射治疗。2 放射治疗方案的常规规划包括对宫颈癌靶肿瘤在宫颈水平的轮廓描绘以及与之相关的阴道解剖结构和受威胁的相邻器官(OARs)的轮廓描绘。然而,手动轮廓描绘这些结构既费时又费力,且与已知相互观者变异性有关,可能会影响治疗效果。虽然已经开发了多种工具用于自动分割OARs和高危临床肿瘤体积(HR-CTV)并通过计算机断层扫描(CT)图像进行,3,4,5,6,但利用常规T2加权(T2w)磁共振成像(MRI)开发基于深度学习的肿瘤分割工具解决了提高解剖结构和宫颈癌治疗质量的需求,从而提高了放射治疗规划的质量。这项工作将新型深度学习模型(PocketNet)应用于T2w MRI上对宫颈、阴道、子宫和肿瘤(如果有)进行分割。当通过5倍交叉验证训练数据时,PocketNet的性能评估结果表明,PocketNet在肿瘤分割和器官分割方面的Dice-Sorensen相似性系数(DSC)分别超过70%和80%。这些结果表明,PocketNet对对比方案的变异具有鲁棒性,能为ROI提供可靠的分割。
https://arxiv.org/abs/2409.11456
While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated strokes that correspond to the contours and type of hair. Prior works often use off-the-shelf vision-language models (VLMs) as success detectors; however, when success depends on the full trajectory, VLMs struggle to make correct judgments for two reasons. First, modern VLMs are trained only on single frames, and cannot capture changes over a full trajectory. Second, even if we provide state-of-the-art VLMs with an aggregate input of multiple frames, they still fail to detect success due to a lack of robot data. Our key idea is to fine-tune VLMs using abstract representations that are able to capture trajectory-level information such as the path the robot takes by overlaying keypoint trajectories on the final image. We propose motion instruction fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned abstract representations to semantically ground the robot's behavior in the environment. To benchmark and fine-tune VLMs for robotic motion understanding, we introduce the MotIF-1K dataset containing 653 human and 369 robot demonstrations across 13 task categories. MotIF assesses the success of robot motion given the image observation of the trajectory, task instruction, and motion description. Our model significantly outperforms state-of-the-art VLMs by at least twice in precision and 56.1% in recall, generalizing across unseen motions, tasks, and environments. Finally, we demonstrate practical applications of MotIF in refining and terminating robot planning, and ranking trajectories on how they align with task and motion descriptions. Project page: this https URL
在许多机器人任务中,成功仅通过观察最后状态和其与初始状态的差异来确定,例如,拿起苹果 - 许多任务需要观察机器人的完整运动才能正确确定成功。例如,洗头需要重复与头发的轮廓和类型的形状相匹配的画笔 strokes。之前的工作通常使用预先训练的视觉语言模型(VLMs)作为成功检测器;然而,当成功取决于完整的轨迹时,VLMs在两个原因上遇到了困难。首先,现代 VLMs 仅在单个帧上进行训练,无法捕捉到轨迹上的变化。其次,即使我们提供最先进的 VLMs aggregate input of multiple frames,它们仍然无法检测到成功,因为缺乏机器人数据。我们提出了一种使用能够捕捉轨迹级信息,如机器人路径覆盖关键点轨迹的抽象表示来微调 VLMs 的方法。我们提出了运动指令微调(MotIF)方法,一种使用上述抽象表示微调 VLMs 以 semantically ground the robot's behavior in the environment 的方法。为了评估和微调 VLMs 对机器人运动理解的性能,我们引入了包含 653 个人类和 369 个机器人演示的 MotIF-1K 数据集。MotIF 根据图像观察轨迹、任务指令和动作描述评估机器人的运动成功。我们的模型在精度和召回率上至少是现有 VLMs 的两倍,并能在未见过的运动、任务和环境中进行泛化。最后,我们展示了 MotIF 在精炼和终止机器人规划以及按任务和动作描述对轨迹进行排序方面的实际应用。项目页面:此链接
https://arxiv.org/abs/2409.10683
Robot interaction control is often limited to low dynamics or low flexibility, depending on whether an active or passive approach is chosen. In this work, we introduce a hybrid control scheme that combines the advantages of active and passive interaction control. To accomplish this, we propose the design of a novel Active Remote Center of Compliance (ARCC), which is based on a passive and active element which can be used to directly control the interaction forces. We introduce surrogate models for a dynamic comparison against purely robot-based interaction schemes. In a comparative validation, ARCC drastically improves the interaction dynamics, leading to an increase in the motion bandwidth of up to 31 times. We introduce further our control approach as well as the integration in the robot controller. Finally, we analyze ARCC on different industrial benchmarks like peg-in-hole, top-hat rail assembly and contour following problems and compare it against the state of the art, to highlight the dynamic and flexibility. The proposed system is especially suited if the application requires a low cycle time combined with a sensitive manipulation.
机器人交互控制通常局限于低动态或低灵活性,这取决于选择主动或被动方法。在这项工作中,我们介绍了一种混合控制方案,结合了主动和被动交互控制的优点。为了实现这一目标,我们提出了一个新型的主动远程合规中心(ARCC)的设计,该设计基于被动和主动元素,可以直接控制交互力。我们还引入了代理模型,用于与仅基于机器人的交互方案进行动态比较。在比较验证中,ARCC大幅提高了交互动态,导致运动带宽增加至31倍。我们还介绍了我们的控制方法和将ARCC集成到机器人控制器中的进一步发展。最后,我们在几个工业基准上分析ARCC,如夹子穿孔、顶帽钢轨组装和轮廓跟随问题,并与最先进的技术进行比较,以突出其动态和灵活性。如果应用需要低周期时间与敏感操作相结合,那么所提出的系统尤其适合。
https://arxiv.org/abs/2409.10024
In Mandarin, the tonal contours of monosyllabic words produced in isolation or in careful speech are characterized by four lexical tones: a high-level tone (T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However, in spontaneous speech, the actual tonal realization of monosyllabic words can deviate significantly from these canonical tones due to intra-syllabic co-articulation and inter-syllabic co-articulation with adjacent tones. In addition, Chuang et al. (2024) recently reported that the tonal contours of disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their meanings. Following up on their research, we present a corpus-based investigation of how the pitch contours of monosyllabic words are realized in spontaneous conversational Mandarin, focusing on the effects of contextual predictors on the one hand, and the way in words' meanings co-determine pitch contours on the other hand. We analyze the F0 contours of 3824 tokens of 63 different word types in a spontaneous Taiwan Mandarin corpus, using the generalized additive (mixed) model to decompose a given observed pitch contour into a set of component pitch contours. We show that the tonal context substantially modify a word's canonical tone. Once the effect of tonal context is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0), which in standard descriptions, is realized based on the preceding tone, emerges as a low tone in its own right, modified by the other predictors in the same way as the standard tones T1, T2, T3, and T4. We also show that word, and even more so, word sense, co-determine words' F0 contours. Analyses of variable importance using random forests further supported the substantial effect of tonal context and an effect of word sense.
在普通话中,单个音节词在独白或仔细发音中产生的音调轮廓由四个音调级组成:高音调(T1)、上升调(T2)、下降调(T3)和低音调(T4)。然而,在自然发音中,单个音节词的实际音调实现可能会显著偏离这些规范音调,因为相邻音节之间的共同发音和共同音调。此外,Chuang等人(2024)最近报道,带有T2-T4音调模式的二音节普通话单词的音调轮廓是由它们的含义共同决定的。根据他们的研究,我们提出了一个基于语料库的研究,研究了自然普通话中单个音节词的音调轮廓是如何在自言自语中实现的,重点关注语境预测器对一方面,以及单词含义对另一方面。我们分析了一个自台湾普通话语料库中63个不同词型的3824个词的F0音调,使用广义加权(混合)模型将给定的观察到的音调轮廓分解为一组成分音调轮廓。我们发现,音调上下文极大地改变了一个单词的规范音调。一旦对音调上下文的控制被考虑在内,T2和T3成为低音调,与T1作为高音调相比,与T4作为高至中音降调相比,存在明显的差异。中性音(T0)在标准的描述中,它是基于前一个音调实现的,它在自言自语中成为低音调,与其他标准音调T1、T2、T3和T4相同的方式受到其他预测器的影响。我们还发现,单词和词义共同决定单词的F0音调。使用随机森林进行变异性重要性的分析进一步加强了音调上下文的影响和词义的影响。
https://arxiv.org/abs/2409.07891
Visual servoing for the development of autonomous robotic systems capable of administering UltraSound (US) guided regional anesthesia requires real-time segmentation of nerves, needle tip localization and needle trajectory extrapolation. First, we recruited 227 patients to build a large dataset of 41,000 anesthesiologist annotated images from US videos of brachial plexus nerves and developed models to localize nerves in the US images. Generalizability of the best suited model was tested on the datasets constructed from separate US scanners. Using these nerve segmentation predictions, we define automated anesthesia needle targets by fitting an ellipse to the nerve contours. Next, we developed an image analysis tool to guide the needle toward their targets. For the segmentation of the needle, a natural RGB pre-trained neural network was first fine-tuned on a large US dataset for domain transfer and then adapted for the needle using a small dataset. The segmented needle trajectory angle is calculated using Radon transformation and the trajectory is extrapolated from the needle tip. The intersection of the extrapolated trajectory with the needle target guides the needle navigation for drug delivery. The needle trajectory average error was within acceptable range of 5 mm as per experienced anesthesiologists. The entire dataset has been released publicly for further study by the research community at this https URL
为了开发能够进行超声引导区域麻醉的自主机器人系统,需要对神经进行实时分割、针尖局部定位和针程延伸。首先,我们招募了227名患者,构建了一个包含41,000个由美国神经科学家标注的超声视频的的大数据集,并开发了将神经定位到美国图像中的模型。接着,我们在构建了单独的超声扫描仪的数据集上测试了最合适的模型的泛化能力。通过这些神经分割预测,我们通过将椭圆形对齐到神经轮廓来定义自动麻醉针的目标。接下来,我们开发了一个图像分析工具,用于指导针将其目标对准。为了对准针,我们对一个自然预训练的RGB神经网络在大型美国数据集上进行了微调,然后将其适应针。对准的针程延伸角度使用Radon变换计算,轨迹从针尖延伸。针程延伸轨迹与针尖目标相交,引导针的导航进行药物交付。整个数据集已公开发布,供研究社区进一步研究,网址为https://url。
https://arxiv.org/abs/2308.03717
The availability of high-quality datasets play a crucial role in advancing research and development especially, for safety critical and autonomous systems. In this paper, we present AssistTaxi, a comprehensive novel dataset which is a collection of images for runway and taxiway analysis. The dataset comprises of more than 300,000 frames of diverse and carefully collected data, gathered from Melbourne (MLB) and Grant-Valkaria (X59) general aviation airports. The importance of AssistTaxi lies in its potential to advance autonomous operations, enabling researchers and developers to train and evaluate algorithms for efficient and safe taxiing. Researchers can utilize AssistTaxi to benchmark their algorithms, assess performance, and explore novel approaches for runway and taxiway analysis. Addition-ally, the dataset serves as a valuable resource for validating and enhancing existing algorithms, facilitating innovation in autonomous operations for aviation. We also propose an initial approach to label the dataset using a contour based detection and line extraction technique.
高质量数据集的可用性在推动研究和开发中起着关键作用,尤其是在对安全关键和自主系统。在本文中,我们提出了AssistTaxi,一个全面的全新数据集,用于起降和滑行道分析。数据集包括来自墨尔本(MLB)和格兰特维卡尔迪亚(X59)通用航空机场的超过300,000个帧的多样化且精心收集的数据。AssistTaxi的重要性在于其可能推动自动驾驶操作的发展,使研究人员和开发人员能够训练和评估高效的和安全起降算法。研究人员可以通过AssistTaxi基准他们的算法,评估其性能,并探索关于起降和滑行道分析的新方法。此外,该数据集还成为验证和增强现有算法的有价值资源,促进航空自主操作的创新。我们还提出了一种基于轮廓检测和线提取技术对数据集进行初始标签的方法。
https://arxiv.org/abs/2409.06856
Recycled and recirculated books, such as ancient texts and reused textbooks, hold significant value in the second-hand goods market, with their worth largely dependent on surface preservation. However, accurately assessing surface defects is challenging due to the wide variations in shape, size, and the often imprecise detection of defects. To address these issues, we propose DDNet, an innovative detection model designed to enhance defect localization and classification. DDNet introduces a surface defect feature extraction module based on a deformable convolution operator (DC) and a densely connected FPN module (DFPN). The DC module dynamically adjusts the convolution grid to better align with object contours, capturing subtle shape variations and improving boundary delineation and prediction accuracy. Meanwhile, DFPN leverages dense skip connections to enhance feature fusion, constructing a hierarchical structure that generates multi-resolution, high-fidelity feature maps, thus effectively detecting defects of various sizes. In addition to the model, we present a comprehensive dataset specifically curated for surface defect detection in recycled and recirculated books. This dataset encompasses a diverse range of defect types, shapes, and sizes, making it ideal for evaluating the robustness and effectiveness of defect detection models. Through extensive evaluations, DDNet achieves precise localization and classification of surface defects, recording a mAP value of 46.7% on our proprietary dataset - an improvement of 14.2% over the baseline model - demonstrating its superior detection capabilities.
回收和再循环的书籍,如古代文本和再利用率高的教科书,在二手商品市场上具有重要的价值,很大程度上取决于表面保护。然而,准确评估表面缺陷是具有挑战性的,因为形状、大小和 often 不精确的缺陷检测之间存在很大的变异性。为了应对这些问题,我们提出了DDNet,一种专门设计用于增强缺陷定位和分类的创新检测模型。DDNet基于变形卷积操作(DC)的表面缺陷特征提取模块和密集连接卷积神经网络(DFPN)模块。DC模块动态调整卷积网格以更好地适应物体轮廓,捕捉微妙的形状变化并提高边界轮廓和预测准确性。同时,DFPN利用密集跳转连接增强特征融合,构建了一个层次结构,生成多分辨率、高保真度的特征图,从而有效地检测到各种大小的缺陷。除了模型外,我们还提供了专门为二手书籍表面缺陷检测而创建的全面数据集。这个数据集涵盖了各种缺陷类型、形状和大小,使其成为评估缺陷检测模型稳健性和有效性的理想数据。通过广泛的评估,DDNet实现了精确的表面缺陷定位和分类,在我们的专用数据集上的平均精度分数(mAP)为46.7% - 比基线模型提高了14.2% - 展示了其卓越的检测能力。
https://arxiv.org/abs/2409.04958
Edge detection, as a fundamental task in computer vision, has garnered increasing attention. The advent of deep learning has significantly advanced this field. However, recent deep learning-based methods which rely on large-scale pre-trained weights cannot be trained from scratch, with very limited research addressing this issue. This paper proposes a novel cycle pixel difference convolution (CPDC), which effectively integrates image gradient information with modern convolution operations. Based on the CPDC, we develop a U-shape encoder-decoder model named CPD-Net, which is a purely end-to-end network. Additionally, to address the issue of edge thickness produced by most existing methods, we construct a multi-scale information enhancement module (MSEM) to enhance the discriminative ability of the model, thereby generating crisp and clean contour maps. Comprehensive experiments conducted on three standard benchmarks demonstrate that our method achieves competitive performance on the BSDS500 dataset (ODS=0.813), NYUD-V2 (ODS=0.760), and BIPED dataset (ODS=0.898). Our approach provides a novel perspective for addressing these challenges in edge detection.
边缘检测是计算机视觉中的一个基本任务,吸引了越来越多的关注。深度学习的出现显著地推动了这个领域的发展。然而,依赖大型预训练权重的大规模预训练方法无法从零开始训练,为了解决这个问题,非常有限的研究工作进行了探讨。本文提出了一种新颖的循环像素差异卷积(CPDC),它有效地将图像梯度信息与现代卷积操作相结合。基于CPDC,我们开发了一个U形编码器-解码器模型,名为CPD-Net,这是一个纯粹的端到端网络。此外,为了解决大多数现有方法产生的边缘厚度的問題,我们构建了一个多尺度信息增强模块(MSEM),从而增强模型的判别能力,从而产生清晰干净的轮廓图。在三个标准的基准上进行全面的实验证明,我们的方法在BSDS500数据集(ODS=0.813)、NYUD-V2(ODS=0.760)和BIPED数据集(ODS=0.898)上实现了与现有方法的竞争力的性能。我们的方法为解决这些边缘检测中的挑战提供了一个新颖的视角。
https://arxiv.org/abs/2409.04272
Increased usage of automated tools like deep learning in medical image segmentation has alleviated the bottleneck of manual contouring. This has shifted manual labour to quality assessment (QA) of automated contours which involves detecting errors and correcting them. A potential solution to semi-automated QA is to use deep Bayesian uncertainty to recommend potentially erroneous regions, thus reducing time spent on error detection. Previous work has investigated the correspondence between uncertainty and error, however, no work has been done on improving the "utility" of Bayesian uncertainty maps such that it is only present in inaccurate regions and not in the accurate ones. Our work trains the FlipOut model with the Accuracy-vs-Uncertainty (AvU) loss which promotes uncertainty to be present only in inaccurate regions. We apply this method on datasets of two radiotherapy body sites, c.f. head-and-neck CT and prostate MR scans. Uncertainty heatmaps (i.e. predictive entropy) are evaluated against voxel inaccuracies using Receiver Operating Characteristic (ROC) and Precision-Recall (PR) curves. Numerical results show that when compared to the Bayesian baseline the proposed method successfully suppresses uncertainty for accurate voxels, with similar presence of uncertainty for inaccurate voxels. Code to reproduce experiments is available at this https URL
医疗图像分割中自动化工具的使用已经减轻了手动轮廓的瓶颈。这使得手动劳动从质量评估(QA)转向自动轮廓的质量评估,包括检测错误并纠正它们。半自动化的QA的一个解决方案是使用深度贝叶斯不确定性来推荐可能存在错误的区域,从而减少错误检测的时间。之前的工作已经研究了不确定性和误差之间的关系,但还没有工作致力于提高贝叶斯不确定性地图的“实用性”,使其仅在错误的区域出现,而不是准确的区域。我们的工作使用准确性 vs 不确定性(AvU)损失来训练FlipOut模型,该模型促进不确定性仅存在于错误的区域。我们在两个放射治疗部位的数据集上应用了这种方法,即头颈部CT和前列腺MR扫描。使用接收者操作特征(ROC)和精度-召回(PR)曲线对体积不准确性进行评估。数值结果表明,与贝叶斯基线相比,所提出的方法成功地抑制了准确单元的不确定性,同时对不准确的单元具有相似的不确定性存在。代码可在此处复制:https://www. this URL
https://arxiv.org/abs/2409.03470