Recent span-based joint extraction models have demonstrated significant advantages in both entity recognition and relation extraction. These models treat text spans as candidate entities, and span pairs as candidate relationship tuples, achieving state-of-the-art results on datasets like ADE. However, these models encounter a significant number of non-entity spans or irrelevant span pairs during the tasks, impairing model performance significantly. To address this issue, this paper introduces a span-based multitask entity-relation joint extraction model. This approach employs the multitask learning to alleviate the impact of negative samples on entity and relation classifiers. Additionally, we leverage the Intersection over Union(IoU) concept to introduce the positional information into the entity classifier, achieving a span boundary detection. Furthermore, by incorporating the entity Logits predicted by the entity classifier into the embedded representation of entity pairs, the semantic input for the relation classifier is enriched. Experimental results demonstrate that our proposed this http URL model can effectively mitigate the adverse effects of excessive negative samples on the model performance. Furthermore, the model demonstrated commendable F1 scores of 73.61\%, 53.72\%, and 83.72\% on three widely employed public datasets, namely CoNLL04, SciERC, and ADE, respectively.
最近,基于跨度的联合实体和关系提取模型在实体识别和关系提取方面表现出了显著的优势。这些模型将文本跨度视为候选实体,并将跨度对作为候选关系元组,在类似ADE的数据集上取得了最先进的结果。然而,在这些任务中,这些模型会遇到大量非实体跨度或无关的跨度对,显著影响了模型性能。为了解决这个问题,本文介绍了基于跨度的多任务实体和关系提取模型。这种方法采用多任务学习来减轻负样本对实体和关系分类器的影响。此外,我们利用交集概念将位置信息引入实体分类器,实现了跨度边界检测。此外,通过将实体分类器预测的实体Logits嵌入到实体对的嵌入表示中,可以增加关系分类器的语义输入。实验结果显示,我们提出的这个http URL模型能够有效地减轻过度负样本对模型性能的不利影响。此外,该模型在三个广泛使用的公共数据集上(CoNLL04、SciERC和ADE)分别表现出令人赞叹的F1得分73.61%、53.72%和83.72%。
https://arxiv.org/abs/2309.09713
Accurate polyp delineation in colonoscopy is crucial for assisting in diagnosis, guiding interventions, and treatments. However, current deep-learning approaches fall short due to integrity deficiency, which often manifests as missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network utilizes lightweight backbones and 3 key components for integrity ameliorating: 1) Pixel-wise feature redistribution (PFR) module captures global spatial correlations across channels in the final semantic-rich encoder features. 2) Cross-stage pixel-wise feature redistribution (CPFR) module dynamically fuses high-level semantics and low-level spatial features to capture contextual information. 3) Coarse-to-fine calibration module combines PFR and CPFR modules to achieve precise boundary detection. Extensive experiments on 5 public datasets demonstrate that the proposed IC-PolypSeg outperforms 8 state-of-the-art methods in terms of higher precision and significantly improved computational efficiency with lower computational consumption. IC-PolypSeg-EF0 employs 300 times fewer parameters than PraNet while achieving a real-time processing speed of 235 FPS. Importantly, IC-PolypSeg reduces the false negative ratio on five datasets, meeting clinical requirements.
在鼻镜检查中,准确的边界形成是协助诊断、指导干预和治疗的关键。然而,当前深度学习方法由于完整性不足而不足,这常常表现为 missing Lesion parts。本文介绍了在 macro 和 micro 级别上的完整性概念,旨在减轻完整性不足。具体来说,模型应该在 macro 级别上区分整个息肉,并在 micro 级别上识别息肉内部的所有组件。我们的完整性捕获息肉分割(IC-PolypSeg)网络使用轻量级骨架和三个关键组件来改善完整性:1)像素级特征重分配(PFR)模块捕获通道上的全局空间相关性,在最终的语义丰富的编码特征中。2)跨阶段像素级特征重分配(CPFR)模块动态地融合高层语义和低级别空间特征,以捕获上下文信息。3)粗到细校准模块将 PFR 和 CPFR 模块组合起来,以实现精确的边界检测。对 5 个公共数据集进行广泛的实验表明,所提出的 IC-PolypSeg 网络在更高的精度方面比 8 个先进的方法更好,同时减少了计算开销。IC-PolypSeg-EF0 使用比 PraNet 少 300 倍的参数,但实现了 235 FPS 的实时处理速度。重要的是,IC-PolypSeg 降低了 5 个数据集的 false negative 比率,符合临床要求。
https://arxiv.org/abs/2309.08234
Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by $41\%$ on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.
时间动作分割通常通过发现全球视觉描述符的重大差异来实现。在本文中,我们提出了对象中心的时间动作分割(OTAS)框架,以探索本地特征的优点。OTAS广义地说包括自监督的全球和本地特征提取模块以及边界选择模块,将特征融合并检测运动分割的显著边界。作为第二贡献,我们讨论了现有帧级和边界级评估 metrics 的优缺点。通过广泛的实验,我们发现OTAS平均比先前的最先进的方法高出41%,在推荐F1得分方面表现更好。令人惊讶地,OTAS在用户研究中甚至优于真实值人类标注。此外,OTAS高效 enough 以允许实时推断。
https://arxiv.org/abs/2309.06276
Music Structure Analysis (MSA) is the task aiming at identifying musical segments that compose a music track and possibly label them based on their similarity. In this paper we propose a supervised approach for the task of music boundary detection. In our approach we simultaneously learn features and convolution kernels. For this we jointly optimize -- a loss based on the Self-Similarity-Matrix (SSM) obtained with the learned features, denoted by SSM-loss, and -- a loss based on the novelty score obtained applying the learned kernels to the estimated SSM, denoted by novelty-loss. We also demonstrate that relative feature learning, through self-attention, is beneficial for the task of MSA. Finally, we compare the performances of our approach to previously proposed approaches on the standard RWC-Pop, and various subsets of SALAMI.
音乐结构分析(MSA)的任务是确定构成音乐片段的部分,并可能根据它们的相似性将它们分类。在本文中,我们提出了一个 supervised 的方法,用于音乐边界检测任务。在我们的方法中,我们同时学习特征和卷积核。为此,我们共同优化两个损失函数:一个基于学习特征的 self-similarity-Matrix(SSM)损失,另一个基于学习核的新颖性损失,用 Novelty-loss 表示。我们还证明,通过自我关注,相对特征学习对 MSA 任务有益。最后,我们比较了我们的方法和之前提出的 approaches 在标准 RWC-Pop 音乐片段和SalAMI 各种子集上的性能。
https://arxiv.org/abs/2309.02243
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.
本论文介绍了一组匈牙利语文本处理模型,这些模型在平衡资源效率和准确性方面实现了接近最新性能。这些模型在spaCy框架中实现,扩展了HuSpaCyToolkit的结构并做了多项改进。与匈牙利语语料库现有的NLP工具相比,我们的管道线 feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependencyParsing and named entity recognition with high accuracy and throughput. 我们彻底评估了所提出的改进,比较了管道线和最先进的工具,并展示了新模型在所有文本预处理步骤中的 competitive performance。所有实验都可以重复,管道线可以免费使用具有宽松许可证。
https://arxiv.org/abs/2308.12635
We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.
我们提出了boundary-RL,一种 novel 弱监督分割方法,仅使用块级标签进行训练。我们设想分割是一种边界检测问题,而不是在以前的作品中所采用的像素级分类问题。这种对分割的看法可以在遇到挑战的情况下实现边界分割,例如在兴趣区域(ROI)的边界内可能存在噪声效应,传统的基于像素级分类的弱监督方法可能无法有效地分割 ROI。尤其是,超声波图像,其强度值表示边界之间的 acoustic impedance 差异,也可能受益于边界分割方法。我们的方法使用强化学习来训练控制器函数,通过从预先训练的边界存在分类器中获取奖励,来定位 ROI 的边界。分类器表示何时对象边界在一个块内出现,因为控制器在Sequential 马尔可夫决策过程中对块位置进行修改。分类器本身仅使用二进制块级标签表示物体存在,这些标签是在整个边界分割框架训练中使用的唯一的标签,并成为告诉边界分割的弱信号。使用控制器函数确保了整个图像不必进行滑动窗口。它还可以防止可能的假阳性或阴性情况,通过最小化传递给边界存在分类器的块的数量。我们评估了我们所提出的方法在跨Rectified房地产分割相关任务中的表现。我们表现出与其他测试的弱监督方法相比更好的性能,使用相同的标签,例如多实例学习。
https://arxiv.org/abs/2308.11376
This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
本论文介绍了我们为第二个音频深度伪造检测挑战(ADD 2023)设计的系统,该系统专注于确定剪辑区域,重点关注如何定位修改区域。我们的方法是利用多个检测系统来识别剪辑区域并确定其真实性。具体而言,我们训练并整合了两个帧级别的系统:一个用于边界检测,另一个用于深度伪造检测。此外,我们使用训练唯一地基于真实数据的第三个VAE模型来确定给定音频片段的真实性。通过将这些三个系统的融合,我们ADD挑战中表现最佳的解决方案取得了令人印象深刻的82.23%语句准确性和60.66%的F1得分。这导致最终ADD得分为0.6713,确保了ADD 2023 track 2的第一排名。
https://arxiv.org/abs/2308.10281
Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at this https URL.
由于伪装实例与背景具有很高的相似性,最近提出的伪装实例分割(CIS)面临精确定位和实例分割的挑战。为此,我们基于查询的Transformer提出了一个统一的查询基多任务学习框架,称为UQ Former,该框架构建了一系列掩码查询和边界查询,学习共同组成的查询表示,并高效集成全球伪装对象区域和边界 cues,在伪装场景中实现同时实例分割和实例边界检测。具体来说,我们设计了一个组合查询学习范式,通过学习共同的表示来捕捉对象区域和边界特征,通过设计的多尺度统一学习Transformer解码器中的掩码和边界查询的交叉注意力相互作用。然后,我们提出了一个基于学习的多任务学习框架,用于同时伪装实例分割和伪装实例边界检测,该框架基于学习到的共同查询表示,并迫使模型学习强大的实例级查询表示。值得注意的是,我们模型将实例分割视为基于查询的直接序列预测问题,而无需非最大抑制等后处理。与14个先进的方法相比,我们的UQ Former显著改善了伪装实例分割的性能。我们的代码将在这个httpsURL上可用。
https://arxiv.org/abs/2308.07392
Human-AI collaborative writing has been greatly facilitated with the help of modern large language models (LLM), e.g., ChatGPT. While admitting the convenience brought by technology advancement, educators also have concerns that students might leverage LLM to partially complete their writing assignment and pass off the human-AI hybrid text as their original work. Driven by such concerns, in this study, we investigated the automatic detection of Human-AI hybrid text in education, where we formalized the hybrid text detection as a boundary detection problem, i.e., identifying the transition points between human-written content and AI-generated content. We constructed a hybrid essay dataset by partially removing sentences from the original student-written essays and then instructing ChatGPT to fill in for the incomplete essays. Then we proposed a two-step detection approach where we (1) Separated AI-generated content from human-written content during the embedding learning process; and (2) Calculated the distances between every two adjacent prototypes (a prototype is the mean of a set of consecutive sentences from the hybrid text in the embedding space) and assumed that the boundaries exist between the two prototypes that have the furthest distance from each other. Through extensive experiments, we summarized the following main findings: (1) The proposed approach consistently outperformed the baseline methods across different experiment settings; (2) The embedding learning process (i.e., step 1) can significantly boost the performance of the proposed approach; (3) When detecting boundaries for single-boundary hybrid essays, the performance of the proposed approach could be enhanced by adopting a relatively large prototype size, leading to a $22$\% improvement (against the second-best baseline method) in the in-domain setting and an $18$\% improvement in the out-of-domain setting.
借助现代大型语言模型(LLM),如ChatGPT,人类-AI合作写作已经极大地便利了。尽管承认技术进步带来的便利,教育工作者也感到担忧,学生可能会利用LLM部分完成他们的写作任务,并将人类-AI混合文本当作自己的工作。基于这些担忧,在本研究中,我们探讨了在教育中自动检测人类-AI混合文本的问题,我们将混合文本检测 formalized 为边界检测问题,即确定人类文本和AI生成文本之间的过渡点。我们通过partially 删除原始学生写作的语句,构造了混合作文数据集,并指令ChatGPT为不完整的作文填充。然后,我们提出了一种两步检测方法,其中我们在嵌入学习过程中(1)将AI生成的内容与人类写的内容分离;(2)计算每个相邻原型之间的距离(原型是混合文本在嵌入空间中连续语句的均值)并假设有两个距离最长的原型之间存在边界。通过广泛的实验,我们总结了以下主要发现:(1) proposed 方法在不同实验设置中 consistently outperforms the baseline methods;(2) 嵌入学习过程(即步骤1)可以显著增强 proposed 方法的性能;(3) 在检测单一边界混合作文的边界时,采用相对较大原型大小可以提高 proposed 方法的性能,导致在域内设置中比最佳 baseline method 提高了22%,而在跨域设置中提高了18%。
https://arxiv.org/abs/2307.12267
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video. A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR. Prior works have addressed this challenge using sophisticated techniques. In this paper, we propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection. The multi-scale neighboring attention restricts each video token to only aggregate visual contexts from its neighbor, enabling the extraction of the most distinguishing information with multi-scale feature hierarchies from high-ratio noises. The zoom-in boundary detection then focuses on local-wise discrimination of the selected top candidates for fine-grained grounding adjustment. With an end-to-end training strategy, our model achieves competitive performance on different TVG benchmarks, while also having the advantage of faster inference speed and lighter model parameters, thanks to its lightweight architecture.
时间视频grounding(TVG)的目标是从未修剪的视频中提取语言查询的时间间隔。TVG面临的一个主要挑战是低“语义噪声比率(SNR)”,这会导致在低SNR下表现更差。以前的工作已经使用了复杂的技术解决了这个问题。在本文中,我们提出了一种无功能的TVG模型,由两个核心模块组成:多尺度相邻注意力和缩放边界检测。多尺度相邻注意力限制每个视频块只能从相邻的上下文聚合,从高比率噪声中提取具有多尺度特征级联最显著的信息。缩放边界检测则专注于对所选的顶级候选进行局部区分,以进行精细的视频grounding调整。通过端到端的训练策略,我们的模型在不同类型的TVG基准测试中取得了竞争性能,同时也得益于其轻量级架构,实现了更快的推理速度和更轻量级的模型参数。
https://arxiv.org/abs/2307.10567
Biomedical named entity recognition (BNER) serves as the foundation for numerous biomedical text mining tasks. Unlike general NER, BNER require a comprehensive grasp of the domain, and incorporating external knowledge beyond training data poses a significant challenge. In this study, we propose a novel BNER framework called DMNER. By leveraging existing entity representation models SAPBERT, we tackle BNER as a two-step process: entity boundary detection and biomedical entity matching. DMNER exhibits applicability across multiple NER scenarios: 1) In supervised NER, we observe that DMNER effectively rectifies the output of baseline NER models, thereby further enhancing performance. 2) In distantly supervised NER, combining MRC and AutoNER as span boundary detectors enables DMNER to achieve satisfactory results. 3) For training NER by merging multiple datasets, we adopt a framework similar to DS-NER but additionally leverage ChatGPT to obtain high-quality phrases in the training. Through extensive experiments conducted on 10 benchmark datasets, we demonstrate the versatility and effectiveness of DMNER.
生物医学命名实体识别(BNER)是许多生物医学文本挖掘任务的基础。与一般命名实体识别不同,BNER需要对该领域进行全面的理解,而将训练数据以外的外部知识引入会产生巨大的挑战。在本研究中,我们提出了一种名为DMNER的新型BNER框架。通过利用现有的实体表示模型SAPBERT,我们处理BNER采用了两个步骤:实体边界检测和生物医学实体匹配。DMNER表现出适用于多种命名实体识别场景的能力:1)在监督BNER中,我们发现DMNER有效地修复了基线BNER模型的输出,从而进一步提高了性能。2)在远程监督BNER中,将MRC和AutoNER组合作为跨越边界检测器,使DMNER能够取得令人满意的结果。3)通过在10个基准数据集上开展广泛的实验,我们证明了DMNER的 versatility和有效性。
https://arxiv.org/abs/2306.15736
The Generic Event Boundary Detection (GEBD) task aims to build a model for segmenting videos into segments by detecting general event boundaries applicable to various classes. In this paper, based on last year's MAE-GEBD method, we have improved our model performance on the GEBD task by adjusting the data processing strategy and loss function. Based on last year's approach, we extended the application of pseudo-label to a larger dataset and made many experimental attempts. In addition, we applied focal loss to concentrate more on difficult samples and improved our model performance. Finally, we improved the segmentation alignment strategy used last year, and dynamically adjusted the segmentation alignment method according to the boundary density and duration of the video, so that our model can be more flexible and fully applicable in different situations. With our method, we achieve an F1 score of 86.03% on the Kinetics-GEBD test set, which is a 0.09% improvement in the F1 score compared to our 2022 Kinetics-GEBD method.
通用事件边界检测任务旨在建立一个模型,通过检测适用于不同类别的通用事件边界,将视频分割成片段。在本文中,基于去年MAE-GEBD方法,我们通过调整数据处理策略和损失函数,提高了GEBD任务模型的性能。基于去年的方法,我们扩大应用范围,将伪标签扩展到更大的数据集,并做了很多实验尝试。此外,我们应用聚焦损失,更关注困难样本,提高了模型性能。最后,我们改进了去年使用的分割对齐策略,并动态调整分割对齐方法,根据视频的边界密度和持续时间,以便我们的模型更灵活,在不同情况下 fully applicable。使用我们的方法,我们在Kinetics-GEBD测试集上取得了F1得分86.03%,相比我们的2022年Kinetics-GEBD方法,F1得分有0.09%的提高。
https://arxiv.org/abs/2306.15704
Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
句子分割(SBD)是自然语言处理(NLP)的基础构建块之一,错误的句子分割严重影响了后续任务的输出质量。对于算法来说,特别是考虑到使用复杂的和不同的句子结构,这是一个具有挑战性的任务。在这个研究中,我们编辑了一个多样化的多语言法律数据集,包含超过130,000个标注的语句,涵盖了6种语言。我们的实验结果显示,现有的SBD模型在多语言法律数据上的性能较差。我们基于CRF、BiLSTM-CRF和变分Transformer训练和测试了单语言和多语言模型,展示了最先进的性能。我们还证明了我们的多语言模型在葡萄牙语测试集上的零样本设置中优于所有基准模型。为了鼓励社区进一步研究和开发,我们公开了我们的数据集、模型和代码。
https://arxiv.org/abs/2305.01211
Medical image segmentation is a fundamental task in the community of medical image analysis. In this paper, a novel network architecture, referred to as Convolution, Transformer, and Operator (CTO), is proposed. CTO employs a combination of Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and an explicit boundary detection operator to achieve high recognition accuracy while maintaining an optimal balance between accuracy and efficiency. The proposed CTO follows the standard encoder-decoder segmentation paradigm, where the encoder network incorporates a popular CNN backbone for capturing local semantic information, and a lightweight ViT assistant for integrating long-range dependencies. To enhance the learning capacity on boundary, a boundary-guided decoder network is proposed that uses a boundary mask obtained from a dedicated boundary detection operator as explicit supervision to guide the decoding learning process. The performance of the proposed method is evaluated on six challenging medical image segmentation datasets, demonstrating that CTO achieves state-of-the-art accuracy with a competitive model complexity.
医学图像分割是医学图像分析社区的一项基本任务。在本文中,我们提出了一种新的网络架构,称为卷积、Transformer和操作(CTO)。CTO采用卷积神经网络(CNNs)、视觉Transformer(ViT)和明确的边界检测操作来实现高识别精度,同时保持精度和效率的最优平衡。我们提出了一种标准的编码-解码分割范式,其中编码网络采用流行的CNN基线以捕捉 local 语义信息,并使用轻量级的ViT助手以整合长期依赖关系。为了增强边界的学习能力,我们提出了一种边界引导的解码网络,使用专门用于边界检测的操作的边界掩码作为明确的监督来指导解码学习过程。我们针对六个具有挑战性的医学图像分割数据集进行了性能评估,证明了CTO实现了先进的精度,同时具有竞争力模型复杂性。
https://arxiv.org/abs/2305.00678
In this paper, we present the Semantic Boundary Conditioned Backbone (SBCB) framework, a simple yet effective training framework that is model-agnostic and boosts segmentation performance, especially around the boundaries. Motivated by the recent development in improving semantic segmentation by incorporating boundaries as auxiliary tasks, we propose a multi-task framework that uses semantic boundary detection (SBD) as an auxiliary task. The SBCB framework utilizes the nature of the SBD task, which is complementary to semantic segmentation, to improve the backbone of the segmentation head. We apply an SBD head that exploits the multi-scale features from the backbone, where the model learns low-level features in the earlier stages, and high-level semantic understanding in the later stages. This head perfectly complements the common semantic segmentation architectures where the features from the later stages are used for classification. We can improve semantic segmentation models without additional parameters during inference by only conditioning the backbone. Through extensive evaluations, we show the effectiveness of the SBCB framework by improving various popular segmentation heads and backbones by 0.5% ~ 3.0% IoU on the Cityscapes dataset and gains 1.6% ~ 4.1% in boundary Fscores. We also apply this framework on customized backbones and the emerging vision transformer models and show the effectiveness of the SBCB framework.
在本文中,我们介绍了语义边界条件基线(SBCB)框架,这是一个简单但有效的训练框架,模型无关,并可以提高分割性能,特别是在边界附近。出于最近发展的动力,将边界作为辅助任务来改进语义分割,我们提出了一个多任务框架,使用语义边界检测(SBD)作为辅助任务。SBCB框架利用SBD任务的性质,它是语义分割的互补任务,来改善分割头的结构。我们应用SBD头,利用基线中的多尺度特征,在早期的阶段,模型学习低级别的特征,在后期的阶段,模型学习高水平的语义理解。这个头完美补充了常见的语义分割架构,其中后期特征用于分类。通过广泛的评估,我们展示了SBCB框架的有效性,在Cityscapes数据集上,通过提高各种流行的分割头和基线的性能,0.5%至3.0%的IOU,并获得了边界Fscores的1.6%至4.1%。我们还将SBCB框架应用于定制的基线和新兴的视觉Transformer模型,并展示了SBCB框架的有效性。
https://arxiv.org/abs/2304.09427
The short-form videos have explosive popularity and have dominated the new social media trends. Prevailing short-video platforms,~\textit{e.g.}, Kuaishou (Kwai), TikTok, Instagram Reels, and YouTube Shorts, have changed the way we consume and create content. For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios. In this work, we release a new public Short video sHot bOundary deTection dataset, named SHOT, consisting of 853 complete short videos and 11,606 shot annotations, with 2,716 high quality shot boundary annotations in 200 test videos. Leveraging this new data wealth, we propose to optimize the model design for video SBD, by conducting neural architecture search in a search space encapsulating various advanced 3D ConvNets and Transformers. Our proposed approach, named AutoShot, achieves higher F1 scores than previous state-of-the-art approaches, e.g., outperforming TransNetV2 by 4.2%, when being derived and evaluated on our newly constructed SHOT dataset. Moreover, to validate the generalizability of the AutoShot architecture, we directly evaluate it on another three public datasets: ClipShots, BBC and RAI, and the F1 scores of AutoShot outperform previous state-of-the-art approaches by 1.1%, 0.9% and 1.2%, respectively. The SHOT dataset and code can be found in this https URL .
短视频已经迅速流行起来并主导了最新的社交媒体趋势。主流的短视频平台,例如 Kuaishou (Kwai)、TikTok、Instagram Reels 和 YouTube Shorts,已经改变了我们消费和创造内容的方式。对于视频内容创建和理解,场景下的 shot boundary detection (SBD) 是最为重要的组成部分之一。在本研究中,我们发布了一个新的公开的短片片段匹配dataset,名为 Shot,由 853 个完整的短片和 11,606 个片段标注,在 200 个测试视频中有 2,716 个高质量的片段边界标注。利用这一新的数据财富,我们建议优化视频 SDB 模型设计,通过在包含各种高级 3D 卷积神经网络和Transformer 的搜索空间中进行神经网络架构搜索。我们提出的算法名为 AutoShot,在之前的先进方法中比其相比实现了更高的 F1 得分,例如在对我们的新创建的 Shot dataset 进行推导和评估时,超越了 TransNetV2 的 4.2%。此外,为了验证 AutoShot 架构的通用性,我们直接评估了另外两个公开dataset:ClipShots、BBC 和 RAI,而 AutoShot 的 F1 得分在之前的先进方法中超越了 1.1%、0.9% 和 1.2%。Shot dataset 和代码可在 this https URL 中找到。
https://arxiv.org/abs/2304.06116
Named entity recognition (NER) is an important research problem in natural language processing. There are three types of NER tasks, including flat, nested and discontinuous entity recognition. Most previous sequential labeling models are task-specific, while recent years have witnessed the rising of generative models due to the advantage of unifying all NER tasks into the seq2seq model framework. Although achieving promising performance, our pilot studies demonstrate that existing generative models are ineffective at detecting entity boundaries and estimating entity types. This paper proposes a multi-task Transformer, which incorporates an entity boundary detection task into the named entity recognition task. More concretely, we achieve entity boundary detection by classifying the relations between tokens within the sentence. To improve the accuracy of entity-type mapping during decoding, we adopt an external knowledge base to calculate the prior entity-type distributions and then incorporate the information into the model via the self and cross-attention mechanisms. We perform experiments on an extensive set of NER benchmarks, including two flat, three nested, and three discontinuous NER datasets. Experimental results show that our approach considerably improves the generative NER model's performance.
命名实体识别(NER)是自然语言处理中一个重要的研究问题。NER任务有三种类型,包括平面实体识别、嵌套实体识别和离散实体识别。过去,大多数顺序标记模型都是任务特定的,而近年来由于将所有NER任务统一到序列到序列模型框架的优势,生成模型的数量不断增加。尽管取得了良好的性能,但我们的初步研究表明,现有的生成模型在实体边界估计和实体类型估计方面无效。本文提出了多任务Transformer模型,将实体边界检测任务融入到命名实体识别任务中。更具体地说,我们通过分类句子中 tokens之间的关系来实现实体边界检测。为了改善解码时实体类型映射的准确性,我们采用了外部知识库来计算先前实体类型分布,然后通过自我和交叉注意力机制将信息注入到模型中。我们进行了广泛的NER基准测试集实验,包括两个平面、三个嵌套和三个离散实体识别数据集。实验结果显示,我们的方法显著提高了生成NER模型的性能。
https://arxiv.org/abs/2303.10870
Generic event boundary detection (GEBD) aims to split video into chunks at a broad and diverse set of actions as humans naturally perceive event boundaries. In this study, we present an approach that considers the correlation between neighbor frames with pyramid feature maps in both spatial and temporal dimensions to construct a framework for localizing generic events in video. The features at multiple spatial dimensions of a pre-trained ResNet-50 are exploited with different views in the temporal dimension to form a temporal pyramid feature map. Based on that, the similarity between neighbor frames is calculated and projected to build a temporal pyramid similarity feature vector. A decoder with 1D convolution operations is used to decode these similarities to a new representation that incorporates their temporal relationship for later boundary score estimation. Extensive experiments conducted on the GEBD benchmark dataset show the effectiveness of our system and its variations, in which we outperformed the state-of-the-art approaches. Additional experiments on TAPOS dataset, which contains long-form videos with Olympic sport actions, demonstrated the effectiveness of our study compared to others.
https://arxiv.org/abs/2301.04288
This paper presents the first attempt to learn semantic boundary detection using image-level class labels as supervision. Our method starts by estimating coarse areas of object classes through attentions drawn by an image classification network. Since boundaries will locate somewhere between such areas of different classes, our task is formulated as a multiple instance learning (MIL) problem, where pixels on a line segment connecting areas of two different classes are regarded as a bag of boundary candidates. Moreover, we design a new neural network architecture that can learn to estimate semantic boundaries reliably even with uncertain supervision given by the MIL strategy. Our network is used to generate pseudo semantic boundary labels of training images, which are in turn used to train fully supervised models. The final model trained with our pseudo labels achieves an outstanding performance on the SBD dataset, where it is as competitive as some of previous arts trained with stronger supervision.
https://arxiv.org/abs/2212.07579
Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.
https://arxiv.org/abs/2212.06387