Recently, there has been a high demand for accelerating and improving the detection of automatic cadastral mapping. As this problem is in its starting point, there are many methods of computer vision and deep learning that have not been considered yet. In this paper, we focus on deep learning and provide three geometric post-processing methods that improve the quality of the work. Our framework includes two parts, each of which consists of a few phases. Our solution to this problem uses instance segmentation. In the first part, we use Mask R-CNN with the backbone of pre-trained ResNet-50 on the ImageNet dataset. In the second phase, we apply three geometric post-processing methods to the output of the first part to get better overall output. Here, we also use computational geometry to introduce a new method for simplifying lines which we call it pocket-based simplification algorithm. For evaluating the quality of our solution, we use popular formulas in this field which are recall, precision and F-score. The highest recall we gain is 95 percent which also maintains high Precision of 72 percent. This resulted in an F-score of 82 percent. Implementing instance segmentation using Mask R-CNN with some geometric post-processes to its output gives us promising results for this field. Also, results show that pocket-based simplification algorithms work better for simplifying lines than Douglas-Puecker algorithm.
最近,加速和提高自动地形映射的发现的需求很高。由于这个问题处于开始阶段,还没有考虑许多计算机视觉和深度学习的方法。在本文中,我们关注深度学习,并提供三个几何后处理方法,以提高工作的质量。我们的框架包括两个部分,每个部分包括几个阶段。我们解决这个问题的方法使用实例分割。在第一部分中,我们使用Mask R-CNN,其基础是在ImageNet数据集上预先训练的ResNet-50。在第二部分中,我们应用三个几何后处理方法到第一部分的输出,以获得更好的整体输出。在这里,我们还使用计算几何来介绍一种新的方法,以简化线条,我们称之为“口袋为基础的简化算法”。为了评估我们的解决方案的质量,我们使用该领域流行的公式,即召回、精度和F-score。我们获得的最高的召回率为95%,同时也保持了72%的高精度。这导致了F-score的82%。使用Mask R-CNN加上一些几何后处理将其输出实例分割方法,为我们该领域提供了令人期望的结果。此外,结果显示,口袋为基础的简化算法对于简化线条比Douglas-Puecker算法更有效。
https://arxiv.org/abs/2309.16708
Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at this https URL.
由于伪装实例与背景具有很高的相似性,最近提出的伪装实例分割(CIS)面临精确定位和实例分割的挑战。为此,我们基于查询的Transformer提出了一个统一的查询基多任务学习框架,称为UQ Former,该框架构建了一系列掩码查询和边界查询,学习共同组成的查询表示,并高效集成全球伪装对象区域和边界 cues,在伪装场景中实现同时实例分割和实例边界检测。具体来说,我们设计了一个组合查询学习范式,通过学习共同的表示来捕捉对象区域和边界特征,通过设计的多尺度统一学习Transformer解码器中的掩码和边界查询的交叉注意力相互作用。然后,我们提出了一个基于学习的多任务学习框架,用于同时伪装实例分割和伪装实例边界检测,该框架基于学习到的共同查询表示,并迫使模型学习强大的实例级查询表示。值得注意的是,我们模型将实例分割视为基于查询的直接序列预测问题,而无需非最大抑制等后处理。与14个先进的方法相比,我们的UQ Former显著改善了伪装实例分割的性能。我们的代码将在这个httpsURL上可用。
https://arxiv.org/abs/2308.07392
Human-AI collaborative writing has been greatly facilitated with the help of modern large language models (LLM), e.g., ChatGPT. While admitting the convenience brought by technology advancement, educators also have concerns that students might leverage LLM to partially complete their writing assignment and pass off the human-AI hybrid text as their original work. Driven by such concerns, in this study, we investigated the automatic detection of Human-AI hybrid text in education, where we formalized the hybrid text detection as a boundary detection problem, i.e., identifying the transition points between human-written content and AI-generated content. We constructed a hybrid essay dataset by partially removing sentences from the original student-written essays and then instructing ChatGPT to fill in for the incomplete essays. Then we proposed a two-step detection approach where we (1) Separated AI-generated content from human-written content during the embedding learning process; and (2) Calculated the distances between every two adjacent prototypes (a prototype is the mean of a set of consecutive sentences from the hybrid text in the embedding space) and assumed that the boundaries exist between the two prototypes that have the furthest distance from each other. Through extensive experiments, we summarized the following main findings: (1) The proposed approach consistently outperformed the baseline methods across different experiment settings; (2) The embedding learning process (i.e., step 1) can significantly boost the performance of the proposed approach; (3) When detecting boundaries for single-boundary hybrid essays, the performance of the proposed approach could be enhanced by adopting a relatively large prototype size, leading to a $22$\% improvement (against the second-best baseline method) in the in-domain setting and an $18$\% improvement in the out-of-domain setting.
借助现代大型语言模型(LLM),如ChatGPT,人类-AI合作写作已经极大地便利了。尽管承认技术进步带来的便利,教育工作者也感到担忧,学生可能会利用LLM部分完成他们的写作任务,并将人类-AI混合文本当作自己的工作。基于这些担忧,在本研究中,我们探讨了在教育中自动检测人类-AI混合文本的问题,我们将混合文本检测 formalized 为边界检测问题,即确定人类文本和AI生成文本之间的过渡点。我们通过partially 删除原始学生写作的语句,构造了混合作文数据集,并指令ChatGPT为不完整的作文填充。然后,我们提出了一种两步检测方法,其中我们在嵌入学习过程中(1)将AI生成的内容与人类写的内容分离;(2)计算每个相邻原型之间的距离(原型是混合文本在嵌入空间中连续语句的均值)并假设有两个距离最长的原型之间存在边界。通过广泛的实验,我们总结了以下主要发现:(1) proposed 方法在不同实验设置中 consistently outperforms the baseline methods;(2) 嵌入学习过程(即步骤1)可以显著增强 proposed 方法的性能;(3) 在检测单一边界混合作文的边界时,采用相对较大原型大小可以提高 proposed 方法的性能,导致在域内设置中比最佳 baseline method 提高了22%,而在跨域设置中提高了18%。
https://arxiv.org/abs/2307.12267
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.
时间视频grounding(TVG)的目标是从未修剪的视频中提取语言查询的时间间隔。TVG面临的一个主要挑战是低“语义噪声比率(SNR)”,这会导致在低SNR下表现更差。以前的工作已经使用了复杂的技术解决了这个问题。在本文中,我们提出了一种无功能的TVG模型,由两个核心模块组成:多尺度相邻注意力和缩放边界检测。多尺度相邻注意力限制每个视频块只能从相邻的上下文聚合,从高比率噪声中提取具有多尺度特征级联最显著的信息。缩放边界检测则专注于对所选的顶级候选进行局部区分,以进行精细的视频grounding调整。通过端到端的训练策略,我们的模型在不同类型的TVG基准测试中取得了竞争性能,同时也得益于其轻量级架构,实现了更快的推理速度和更轻量级的模型参数。
https://arxiv.org/abs/2307.10567
Biomedical named entity recognition (BNER) serves as the foundation for numerous biomedical text mining tasks. Unlike general NER, BNER require a comprehensive grasp of the domain, and incorporating external knowledge beyond training data poses a significant challenge. In this study, we propose a novel BNER framework called DMNER. By leveraging existing entity representation models SAPBERT, we tackle BNER as a two-step process: entity boundary detection and biomedical entity matching. DMNER exhibits applicability across multiple NER scenarios: 1) In supervised NER, we observe that DMNER effectively rectifies the output of baseline NER models, thereby further enhancing performance. 2) In distantly supervised NER, combining MRC and AutoNER as span boundary detectors enables DMNER to achieve satisfactory results. 3) For training NER by merging multiple datasets, we adopt a framework similar to DS-NER but additionally leverage ChatGPT to obtain high-quality phrases in the training. Through extensive experiments conducted on 10 benchmark datasets, we demonstrate the versatility and effectiveness of DMNER.
生物医学命名实体识别(BNER)是许多生物医学文本挖掘任务的基础。与一般命名实体识别不同,BNER需要对该领域进行全面的理解,而将训练数据以外的外部知识引入会产生巨大的挑战。在本研究中,我们提出了一种名为DMNER的新型BNER框架。通过利用现有的实体表示模型SAPBERT,我们处理BNER采用了两个步骤:实体边界检测和生物医学实体匹配。DMNER表现出适用于多种命名实体识别场景的能力:1)在监督BNER中,我们发现DMNER有效地修复了基线BNER模型的输出,从而进一步提高了性能。2)在远程监督BNER中,将MRC和AutoNER组合作为跨越边界检测器,使DMNER能够取得令人满意的结果。3)通过在10个基准数据集上开展广泛的实验,我们证明了DMNER的 versatility和有效性。
https://arxiv.org/abs/2306.15736
The Generic Event Boundary Detection (GEBD) task aims to build a model for segmenting videos into segments by detecting general event boundaries applicable to various classes. In this paper, based on last year's MAE-GEBD method, we have improved our model performance on the GEBD task by adjusting the data processing strategy and loss function. Based on last year's approach, we extended the application of pseudo-label to a larger dataset and made many experimental attempts. In addition, we applied focal loss to concentrate more on difficult samples and improved our model performance. Finally, we improved the segmentation alignment strategy used last year, and dynamically adjusted the segmentation alignment method according to the boundary density and duration of the video, so that our model can be more flexible and fully applicable in different situations. With our method, we achieve an F1 score of 86.03% on the Kinetics-GEBD test set, which is a 0.09% improvement in the F1 score compared to our 2022 Kinetics-GEBD method.
通用事件边界检测任务旨在建立一个模型,通过检测适用于不同类别的通用事件边界,将视频分割成片段。在本文中,基于去年MAE-GEBD方法,我们通过调整数据处理策略和损失函数,提高了GEBD任务模型的性能。基于去年的方法,我们扩大应用范围,将伪标签扩展到更大的数据集,并做了很多实验尝试。此外,我们应用聚焦损失,更关注困难样本,提高了模型性能。最后,我们改进了去年使用的分割对齐策略,并动态调整分割对齐方法,根据视频的边界密度和持续时间,以便我们的模型更灵活,在不同情况下 fully applicable。使用我们的方法,我们在Kinetics-GEBD测试集上取得了F1得分86.03%,相比我们的2022年Kinetics-GEBD方法,F1得分有0.09%的提高。
https://arxiv.org/abs/2306.15704
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
句子分割(SBD)是自然语言处理(NLP)的基础构建块之一,错误的句子分割严重影响了后续任务的输出质量。对于算法来说,特别是考虑到使用复杂的和不同的句子结构,这是一个具有挑战性的任务。在这个研究中,我们编辑了一个多样化的多语言法律数据集,包含超过130,000个标注的语句,涵盖了6种语言。我们的实验结果显示,现有的SBD模型在多语言法律数据上的性能较差。我们基于CRF、BiLSTM-CRF和变分Transformer训练和测试了单语言和多语言模型,展示了最先进的性能。我们还证明了我们的多语言模型在葡萄牙语测试集上的零样本设置中优于所有基准模型。为了鼓励社区进一步研究和开发,我们公开了我们的数据集、模型和代码。
https://arxiv.org/abs/2305.01211
Medical image segmentation is a fundamental task in the community of medical image analysis. In this paper, a novel network architecture, referred to as Convolution, Transformer, and Operator (CTO), is proposed. CTO employs a combination of Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and an explicit boundary detection operator to achieve high recognition accuracy while maintaining an optimal balance between accuracy and efficiency. The proposed CTO follows the standard encoder-decoder segmentation paradigm, where the encoder network incorporates a popular CNN backbone for capturing local semantic information, and a lightweight ViT assistant for integrating long-range dependencies. To enhance the learning capacity on boundary, a boundary-guided decoder network is proposed that uses a boundary mask obtained from a dedicated boundary detection operator as explicit supervision to guide the decoding learning process. The performance of the proposed method is evaluated on six challenging medical image segmentation datasets, demonstrating that CTO achieves state-of-the-art accuracy with a competitive model complexity.
医学图像分割是医学图像分析社区的一项基本任务。在本文中,我们提出了一种新的网络架构,称为卷积、Transformer和操作(CTO)。CTO采用卷积神经网络(CNNs)、视觉Transformer(ViT)和明确的边界检测操作来实现高识别精度,同时保持精度和效率的最优平衡。我们提出了一种标准的编码-解码分割范式,其中编码网络采用流行的CNN基线以捕捉 local 语义信息,并使用轻量级的ViT助手以整合长期依赖关系。为了增强边界的学习能力,我们提出了一种边界引导的解码网络,使用专门用于边界检测的操作的边界掩码作为明确的监督来指导解码学习过程。我们针对六个具有挑战性的医学图像分割数据集进行了性能评估,证明了CTO实现了先进的精度,同时具有竞争力模型复杂性。
https://arxiv.org/abs/2305.00678
In this paper, we present the Semantic Boundary Conditioned Backbone (SBCB) framework, a simple yet effective training framework that is model-agnostic and boosts segmentation performance, especially around the boundaries. Motivated by the recent development in improving semantic segmentation by incorporating boundaries as auxiliary tasks, we propose a multi-task framework that uses semantic boundary detection (SBD) as an auxiliary task. The SBCB framework utilizes the nature of the SBD task, which is complementary to semantic segmentation, to improve the backbone of the segmentation head. We apply an SBD head that exploits the multi-scale features from the backbone, where the model learns low-level features in the earlier stages, and high-level semantic understanding in the later stages. This head perfectly complements the common semantic segmentation architectures where the features from the later stages are used for classification. We can improve semantic segmentation models without additional parameters during inference by only conditioning the backbone. Through extensive evaluations, we show the effectiveness of the SBCB framework by improving various popular segmentation heads and backbones by 0.5% ~ 3.0% IoU on the Cityscapes dataset and gains 1.6% ~ 4.1% in boundary Fscores. We also apply this framework on customized backbones and the emerging vision transformer models and show the effectiveness of the SBCB framework.
在本文中,我们介绍了语义边界条件基线(SBCB)框架,这是一个简单但有效的训练框架,模型无关,并可以提高分割性能,特别是在边界附近。出于最近发展的动力,将边界作为辅助任务来改进语义分割,我们提出了一个多任务框架,使用语义边界检测(SBD)作为辅助任务。SBCB框架利用SBD任务的性质,它是语义分割的互补任务,来改善分割头的结构。我们应用SBD头,利用基线中的多尺度特征,在早期的阶段,模型学习低级别的特征,在后期的阶段,模型学习高水平的语义理解。这个头完美补充了常见的语义分割架构,其中后期特征用于分类。通过广泛的评估,我们展示了SBCB框架的有效性,在Cityscapes数据集上,通过提高各种流行的分割头和基线的性能,0.5%至3.0%的IOU,并获得了边界Fscores的1.6%至4.1%。我们还将SBCB框架应用于定制的基线和新兴的视觉Transformer模型,并展示了SBCB框架的有效性。
https://arxiv.org/abs/2304.09427
The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms,~\textit{e.g.}, Kuaishou (Kwai), TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code can be found in this https URL .
短视频已经迅速流行起来并主导了最新的社交媒体趋势。主流的短视频平台,例如 Kuaishou (Kwai)、TikTok、Instagram Reels 和 YouTube Shorts,已经改变了我们消费和创造内容的方式。对于视频内容创建和理解,场景下的 shot boundary detection (SBD) 是最为重要的组成部分之一。在本研究中,我们发布了一个新的公开的短片片段匹配dataset,名为 Shot,由 853 个完整的短片和 11,606 个片段标注,在 200 个测试视频中有 2,716 个高质量的片段边界标注。利用这一新的数据财富,我们建议优化视频 SDB 模型设计,通过在包含各种高级 3D 卷积神经网络和Transformer 的搜索空间中进行神经网络架构搜索。我们提出的算法名为 AutoShot,在之前的先进方法中比其相比实现了更高的 F1 得分,例如在对我们的新创建的 Shot dataset 进行推导和评估时,超越了 TransNetV2 的 4.2%。此外,为了验证 AutoShot 架构的通用性,我们直接评估了另外两个公开dataset:ClipShots、BBC 和 RAI,而 AutoShot 的 F1 得分在之前的先进方法中超越了 1.1%、0.9% 和 1.2%。Shot dataset 和代码可在 this https URL 中找到。
https://arxiv.org/abs/2304.06116
Named entity recognition (NER) is an important research problem in natural language processing. There are three types of NER tasks, including flat, nested and discontinuous entity recognition. Most previous sequential labeling models are task-specific, while recent years have witnessed the rising of generative models due to the advantage of unifying all NER tasks into the seq2seq model framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. This paper proposes a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self and cross-attention mechanisms. We perform experiments on an extensive set of NER benchmarks, including two flat, three nested, and three discontinuous NER datasets. Experimental results show that our approach considerably improves the generative NER model's performance.
命名实体识别(NER)是自然语言处理中一个重要的研究问题。NER任务有三种类型,包括平面实体识别、嵌套实体识别和离散实体识别。过去,大多数顺序标记模型都是任务特定的,而近年来由于将所有NER任务统一到序列到序列模型框架的优势,生成模型的数量不断增加。尽管取得了良好的性能,但我们的初步研究表明,现有的生成模型在实体边界估计和实体类型估计方面无效。本文提出了多任务Transformer模型,将实体边界检测任务融入到命名实体识别任务中。更具体地说,我们通过分类句子中 tokens之间的关系来实现实体边界检测。为了改善解码时实体类型映射的准确性,我们采用了外部知识库来计算先前实体类型分布,然后通过自我和交叉注意力机制将信息注入到模型中。我们进行了广泛的NER基准测试集实验,包括两个平面、三个嵌套和三个离散实体识别数据集。实验结果显示,我们的方法显著提高了生成NER模型的性能。
https://arxiv.org/abs/2303.10870
Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions to construct a framework for localizing generic events in video. The features at multiple spatial dimensions of a pre-trained ResNet-50 are exploited with different views in the temporal dimension to form a temporal pyramid feature map. Based on that, the similarity between neighbor frames is calculated and projected to build a temporal pyramid similarity feature vector. A decoder with 1D convolution operations is used to decode these similarities to a new representation that incorporates their temporal relationship for later boundary score estimation. Extensive experiments conducted on the GEBD benchmark dataset show the effectiveness of our system and its variations, in which we outperformed the state-of-the-art approaches. Additional experiments on TAPOS dataset, which contains long-form videos with Olympic sport actions, demonstrated the effectiveness of our study compared to others.
https://arxiv.org/abs/2301.04288
This paper presents the first attempt to learn semantic boundary detection using image-level class labels as supervision. Our method starts by estimating coarse areas of object classes through attentions drawn by an image classification network. Since boundaries will locate somewhere between such areas of different classes, our task is formulated as a multiple instance learning (MIL) problem, where pixels on a line segment connecting areas of two different classes are regarded as a bag of boundary candidates. Moreover, we design a new neural network architecture that can learn to estimate semantic boundaries reliably even with uncertain supervision given by the MIL strategy. Our network is used to generate pseudo semantic boundary labels of training images, which are in turn used to train fully supervised models. The final model trained with our pseudo labels achieves an outstanding performance on the SBD dataset, where it is as competitive as some of previous arts trained with stronger supervision.
https://arxiv.org/abs/2212.07579
Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.
https://arxiv.org/abs/2212.06387
We analyze the problem of detecting tree rings in microscopy images of shrub cross sections. This can be regarded as a special case of the instance segmentation task with several particularities such as the concentric circular ring shape of the objects and high precision requirements due to which existing methods don't perform sufficiently well. We propose a new iterative method which we term Iterative Next Boundary Detection (INBD). It intuitively models the natural growth direction, starting from the center of the shrub cross section and detecting the next ring boundary in each iteration step. In our experiments, INBD shows superior performance to generic instance segmentation methods and is the only one with a built-in notion of chronological order. Our dataset and source code are available at this http URL.
https://arxiv.org/abs/2212.03022
Music structure analysis (MSA) systems aim to segment a song recording into non-overlapping sections with useful labels. Previous MSA systems typically predict abstract labels in a post-processing step and require the full context of the song. By contrast, we recently proposed a supervised framework, called "Music Structural Function Analysis" (MuSFA), that models and predicts meaningful labels like 'verse' and 'chorus' directly from audio, without requiring the full context of a song. However, the performance of this system depends on the amount and quality of training data. In this paper, we propose to repurpose a public dataset, HookTheory Lead Sheet Dataset (HLSD), to improve the performance. HLSD contains over 18K excerpts of music sections originally collected for studying automatic melody harmonization. We treat each excerpt as a partially labeled song and provide a label mapping, so that HLSD can be used together with other public datasets, such as SALAMI, RWC, and Isophonics. In cross-dataset evaluations, we find that including HLSD in training can improve state-of-the-art boundary detection and section labeling scores by ~3% and ~1% respectively.
https://arxiv.org/abs/2211.15787
Zero-shot relation triplet extraction (ZeroRTE) aims to extract relation triplets from unstructured texts, while the relation sets at the training and testing stages are disjoint. Previous state-of-the-art method handles this challenging task by leveraging pretrained language models to generate data as additional training samples, which increases the training cost and severely constrains the model performance. We tackle this task from a new perspective and propose a novel method named PCRED for ZeroRTE with Potential Candidate Relation selection and Entity boundary Detection. The model adopts a relation-first paradigm, which firstly recognizes unseen relations through candidate relation selection. By this approach, the semantics of relations are naturally infused in the context. Entities are extracted based on the context and the semantics of relations subsequently. We evaluate our model on two ZeroRTE datasets. The experiment result shows that our method consistently outperforms previous works. Besides, our model does not rely on any additional data, which boasts the advantages of simplicity and effectiveness. Our code is available at https://anonymous.4open.science/r/PCRED.
https://arxiv.org/abs/2211.14477
Hard exudates (HE) is the most specific biomarker for retina edema. Precise HE segmentation is vital for disease diagnosis and treatment, but automatic segmentation is challenged by its large variation of characteristics including size, shape and position, which makes it difficult to detect tiny lesions and lesion boundaries. Considering the complementary features between segmentation and super-resolution tasks, this paper proposes a novel hard exudates segmentation method named SS-MAF with an auxiliary super-resolution task, which brings in helpful detailed features for tiny lesion and boundaries detection. Specifically, we propose a fusion module named Multi-scale Attention Fusion (MAF) module for our dual-stream framework to effectively integrate features of the two tasks. MAF first adopts split spatial convolutional (SSC) layer for multi-scale features extraction and then utilize attention mechanism for features fusion of the two tasks. Considering pixel dependency, we introduce region mutual information (RMI) loss to optimize MAF module for tiny lesions and boundary detection. We evaluate our method on two public lesion datasets, IDRiD and E-Ophtha. Our method shows competitive performance with low-resolution inputs, both quantitatively and qualitatively. On E-Ophtha dataset, the method can achieve $\geq3\%$ higher dice and recall compared with the state-of-the-art methods.
https://arxiv.org/abs/2211.09404
Deep learning methods have contributed substantially to the rapid advancement of medical image segmentation, the quality of which relies on the suitable design of loss functions. Popular loss functions, including the cross-entropy and dice losses, often fall short of boundary detection, thereby limiting high-resolution downstream applications such as automated diagnoses and procedures. We developed a novel loss function that is tailored to reflect the boundary information to enhance the boundary detection. As the contrast between segmentation and background regions along the classification boundary naturally induces heterogeneity over the pixels, we propose the piece-wise two-sample t-test augmented (PTA) loss that is infused with the statistical test for such heterogeneity. We demonstrate the improved boundary detection power of the PTA loss compared to benchmark losses without a t-test component.
https://arxiv.org/abs/2211.02419
The present paper proposes a waveform boundary detection system for audio spoofing attacks containing partially manipulated segments. Partially spoofed/fake audio, where part of the utterance is replaced, either with synthetic or natural audio clips, has recently been reported as one scenario of audio deepfakes. As deepfakes can be a threat to social security, the detection of such spoofing audio is essential. Accordingly, we propose to address the problem with a deep learning-based frame-level detection system that can detect partially spoofed audio and locate the manipulated pieces. Our proposed method is trained and evaluated on data provided by the ADD2022 Challenge. We evaluate our detection model concerning various acoustic features and network configurations. As a result, our detection system achieves an equal error rate (EER) of 6.58% on the ADD2022 challenge test set, which is the best performance in partially spoofed audio detection systems that can locate manipulated clips.
https://arxiv.org/abs/2211.00226