Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
近年来,Sign Language Recognition (SLR)已经从研究者那里获得了显著的关注,尤其是 Continuous Sign Language Recognition (CSLR) 领域,它比 Isolated Sign Language Recognition (ISLR) 具有更高的复杂性。CSLR 中的一个突出挑战是准确地检测连续视频流中孤立符号的边界。此外,现有模型对手工特征的依赖使得达到最优准确性的挑战加大。为了克服这些挑战,我们提出了一个利用 Transformer 模型的全新方法。与传统模型不同,我们的方法专注于提高准确度的同时消除手工特征的需要。Transformer 模型用于 ISLR 和 CSLR。训练过程包括使用孤立手势视频,其中从输入视频中提取的手关键点特征通过 Transformer 模型进行丰富。随后,这些丰富的特征被输入到最后一层分类层。训练好的模型与后处理方法相结合,然后应用于检测连续符号中的孤立符号边界。在两个不同的数据集上评估我们的模型,包括连续符号及其相应的孤立符号,证明了积极的结果。
https://arxiv.org/abs/2402.14720
This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 msec of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.
这篇文章研究了使用连接ist语音音素识别器输出的类熵来预测音位类之间的时间边界的可能性。推理是,熵的值应该在两个已经很好地建模(已知)的音位类之间的转换处增加,因为这是不确定性的度量。这种度量的优势在于,后验概率可以在连接ist语音素识别中直接访问。熵及其基于导数的度量被单独和结合使用。预测边界的方法从简单的阈值开始,到基于神经网络的程序。本文将不同的方法与它们的精度进行比较,即预测边界内参考边界数量与总预测边界数量之比。召回被测量为预测边界内参考边界数量与总预测边界数量之比。
https://arxiv.org/abs/2401.05717
Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.
近年来,大型语言模型(LLMs)的进步导致了高质量的机器生成文本(MGT),为无数新的用例和应用提供了可能。然而,由于滥用,轻松访问LLMs也带来了新的挑战。为了应对恶意使用,研究人员已经发布了用于有效训练与MGT相关的任务的 datasets。类似地,用于构建这些数据集的工具,但目前尚无工具能够统一它们。在这种情况下,我们介绍了TextMachina,一个模块化且可扩展的Python框架,旨在帮助创建高质量、无偏的 datasets,以构建 robust 模型,例如检测、归因或边界检测。它提供了一个用户友好的管道,抽象了构建MGT数据集的固有复杂性,例如LLM集成、提示模板化和偏差缓解。TextMachina生成的数据集的质量已在之前的 works中被评估,包括由超过100个团队共同训练的 robust MGT 检测器。
https://arxiv.org/abs/2401.03946
In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.
在这项研究中,我们研究了在重新播放缓冲区中使用好奇心来改善无标签环境和非均匀暴露于学习者时的离线多任务强化学习的方法。特别是,我们研究了好奇心在检测任务边界和保留旧转移元组方面的使用,我们分别使用这些元组来提出两种不同的缓冲器。首先,我们提出了一个带有任务分离的混合水库缓冲器(HRBTS),其中好奇心用于检测由于问题对任务无关性而无法确定的任务边界。其次,通过将好奇心用作保留旧转移元组的优先度度量,我们提出了一个混合好奇缓冲器(HCB)。最后,我们证明了这些缓冲器与标准的强化学习算法相结合可以缓解当前关于重新播放缓冲器状态的灾难性遗忘问题。我们评估了灾难性遗忘以及我们提出的缓冲器的效率,这些缓冲器在三个不同的连续强化学习环境中进行了实验。实验在经典控制任务和元世界环境中进行。实验结果表明,与现有作品相比,我们提出的缓冲器在大多数设置中具有更好的抗灾难性遗忘能力。
https://arxiv.org/abs/2312.03177
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
早期的弱监督视频 groundeding (WSVG) 方法通常因为缺少时间边界的标注而无法处理不完整的边界检测。为了在视频级别和边界级别之间填补差距, explicit-supervision 方法,即为训练生成伪时间边界的显式监督方法,已经取得了巨大的成功。然而,这些方法中的数据增强可能会破坏关键的时间信息,从而产生低伪边。在本文中,我们提出了一种新方法,在保持原始时间内容完整的同时引入更多有价值的信息来扩展不完整的边界。为此,我们提出了 EtC (扩展然后澄清) 方法,首先利用额外的信息扩展初始不完整的伪边界,然后对其进行优化以实现精确的边界。为了进一步澄清扩展边界的噪声,我们结合相互学习和一个自适应的提议级对比目标,使用一种可学习的方法来平衡不完整但干净(初始)和全面但嘈杂(扩展)边之间的精确度。实验证明,我们的方法在两个具有挑战性的 WSVG 数据集上具有优越性。
https://arxiv.org/abs/2312.02483
Precise and rapid delineation of sharp boundaries and robust semantics is essential for numerous downstream robotic tasks, such as robot grasping and manipulation, real-time semantic mapping, and online sensor calibration performed on edge computing units. Although boundary detection and semantic segmentation are complementary tasks, most studies focus on lightweight models for semantic segmentation but overlook the critical role of boundary detection. In this work, we introduce Mobile-Seed, a lightweight, dual-task framework tailored for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. The encoder is divided into two pathways: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The AFD module dynamically adapts the fusion of semantic and boundary information by learning channel-wise relationships, allowing for precise weight assignment of each channel. Furthermore, we introduce a regularization loss to mitigate the conflicts in dual-task learning and deep diversity supervision. Compared to existing methods, the proposed Mobile-Seed offers a lightweight framework to simultaneously improve semantic segmentation performance and accurately locate object boundaries. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline by 2.2 percentage points (pp) in mIoU and 4.2 pp in mF-score, while maintaining an online inference speed of 23.9 frames-per-second (FPS) with 1024x2048 resolution input on an RTX 2080 Ti GPU. Additional experiments on CamVid and PASCAL Context datasets confirm our method's generalizability. Code and additional results are publicly available at \url{this https URL}.
精确和快速的边界描绘和语义分割是许多下游机器人任务的必要条件,例如机器人抓取和操作、实时语义映射以及在边缘计算单元上进行的在线传感器校准。尽管边界检测和语义分割是互补任务,但大多数研究关注语义分割的轻量级模型,而忽视了边界检测的关键作用。在这项工作中,我们引入了移动种子(Mobile-Seed),一个轻量级、双任务框架,专为同时进行语义分割和边界检测而设计。我们的框架包括两个通道的编码器、一个活动融合解码器(AFD)和一个双任务正则化方法。编码器分为两个路径:一个捕捉类相关的语义信息,另一个从多尺度特征中辨别边界。AFD模块通过学习通道级的关系动态适应语义和边界信息的融合,允许为每个通道精确分配权重。此外,我们引入了一个正则化损失项来减轻双任务学习中的冲突和深度多样监督。与现有方法相比,移动种子(Mobile-Seed)为同时提高语义分割性能和准确找到物体边界提供了一个轻量级的框架。在Cityscapes数据集上的实验结果表明,移动种子在mIoU和mF-score上实现了与最先进的(SOTA)基线2.2个百分点的改进(pp),同时具有与1024x2048分辨率输入的RTX 2080 Ti GPU上的在线推理速度为23.9帧每秒(FPS)。在CamVid和PASCAL Context数据集上的实验进一步证实了我们的方法的泛化能力。代码和附加结果可以从该链接公开获取:\url{this <https://this URL>}。
https://arxiv.org/abs/2311.12651
Due to the rapid development of text generation models, people increasingly often encounter texts that may start out as written by a human but then continue as machine-generated results of large language models. Detecting the boundary between human-written and machine-generated parts of such texts is a very challenging problem that has not received much attention in literature. In this work, we consider and compare a number of different approaches for this artificial text boundary detection problem, comparing several predictors over features of different nature. We show that supervised fine-tuning of the RoBERTa model works well for this task in general but fails to generalize in important cross-domain and cross-generator settings, demonstrating a tendency to overfit to spurious properties of the data. Then, we propose novel approaches based on features extracted from a frozen language model's embeddings that are able to outperform both the human accuracy level and previously considered baselines on the Real or Fake Text benchmark. Moreover, we adapt perplexity-based approaches for the boundary detection task and analyze their behaviour. We analyze the robustness of all proposed classifiers in cross-domain and cross-model settings, discovering important properties of the data that can negatively influence the performance of artificial text boundary detection algorithms.
由于自然语言生成模型的快速发展,人们越来越多地遇到可能最初由人类撰写,然后继续由大型语言模型生成的大规模语言模型的文本。检测这种文本中人类撰写的和机器生成的部分边界是一个非常有挑战性的问题,在文献中受到了很少的关注。在这项工作中,我们考虑并比较了多种不同的方法来解决这个人工文本边界检测问题,在不同类型的特征上进行了比较。我们发现,监督微调的RoBERTa模型在一般情况下对此任务表现良好,但在重要的跨领域和跨生成设置中表现不佳,表明对数据中伪特征的过度拟合。然后,我们提出了一种基于从冻语言模型嵌入中提取的特征的新型方法,能够超越人类准确水平,并显著地改善之前考虑的基线。此外,我们还对边界检测任务进行了基于干扰项的适应性分析,并分析了其行为。我们分析了一切提出的分类器在跨领域和跨模型设置中的鲁棒性,发现了可能对人工文本边界检测算法性能产生负面影响的重要数据属性。
https://arxiv.org/abs/2311.08349
Holistic scene understanding includes semantic segmentation, surface normal estimation, object boundary detection, depth estimation, etc. The key aspect of this problem is to learn representation effectively, as each subtask builds upon not only correlated but also distinct attributes. Inspired by visual-prompt tuning, we propose a Task-Specific Prompts Transformer, dubbed TSP-Transformer, for holistic scene understanding. It features a vanilla transformer in the early stage and tasks-specific prompts transformer encoder in the lateral stage, where tasks-specific prompts are augmented. By doing so, the transformer layer learns the generic information from the shared parts and is endowed with task-specific capacity. First, the tasks-specific prompts serve as induced priors for each task effectively. Moreover, the task-specific prompts can be seen as switches to favor task-specific representation learning for different tasks. Extensive experiments on NYUD-v2 and PASCAL-Context show that our method achieves state-of-the-art performance, validating the effectiveness of our method for holistic scene understanding. We also provide our code in the following link this https URL.
整体场景理解包括语义分割、表面法线估计、物体边界检测、深度估计等。这个问题的关键在于有效地学习表示,因为每个子任务不仅依赖于相关属性,而且还依赖于独特的属性。受到视觉提示调整的启发,我们提出了一个任务特定提示的Transformer,称之为TSP-Transformer,用于整体场景理解。它具有一个基本的变压器(在早期阶段)和一个任务特定提示的变压器(在横向阶段),其中任务特定提示被增强。通过这样做,变压器层从共享部分学习通用信息,并具备任务特定能力。首先,任务特定提示可以作为每个任务的诱发先验。此外,任务特定提示可以被视为对不同任务为任务特定表示学习提供开关。在NYUD-v2和PASCAL-Context的实验中,我们的方法实现了最先进的性能,验证了我们对整体场景理解的有效性。我们还在下面这个链接的代码中提供了我们的方法:<https://github.com/Vision_Transformer/TSP-Transformer>
https://arxiv.org/abs/2311.03427
One of the primary obstacles in the advancement of Natural Language Processing (NLP) technologies for low-resource languages is the lack of annotated datasets for training and testing machine learning models. In this paper, we present Antarlekhaka, a tool for manual annotation of a comprehensive set of tasks relevant to NLP. The tool is Unicode-compatible, language-agnostic, Web-deployable and supports distributed annotation by multiple simultaneous annotators. The system sports user-friendly interfaces for 8 categories of annotation tasks. These, in turn, enable the annotation of a considerably larger set of NLP tasks. The task categories include two linguistic tasks not handled by any other tool, namely, sentence boundary detection and deciding canonical word order, which are important tasks for text that is in the form of poetry. We propose the idea of sequential annotation based on small text units, where an annotator performs several tasks related to a single text unit before proceeding to the next unit. The research applications of the proposed mode of multi-task annotation are also discussed. Antarlekhaka outperforms other annotation tools in objective evaluation. It has been also used for two real-life annotation tasks on two different languages, namely, Sanskrit and Bengali. The tool is available at this https URL.
自然语言处理(NLP)技术在低资源语言的发展中面临着一个主要障碍,那就是缺乏用于训练和测试机器学习模型的带注释数据集。在本文中,我们提出了Antarlekhaka,一个用于手动注释与NLP相关的全面任务的工具。该工具兼容Unicode,具有语言无关性,支持通过多个同时 annotator的分布式注释。系统具有用于8个类别的用户友好界面的 annotator。这些界面允许对NLP任务进行更大的注释。任务类别包括其他工具没有处理的两个非语言任务,即句子边界检测和决定规范的词序,这些任务对于形式为诗歌的文本非常重要。我们提出了基于小文本单元的序列注释的想法,其中annotator在前进到下一个单位之前执行多个与单个文本单元相关的任务。所提出的多任务注释模式的 research applications 也有所讨论。Antarlekhaka 在客观评估中优于其他注释工具。它还用于两个不同语言的现实生活中的两个任务,即梵语和孟加拉语。该工具可用於此https URL。
https://arxiv.org/abs/2310.07826
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks. Existing methods typically require video frames to be decoded before feeding into the network, which contains significant spatio-temporal redundancy and demands considerable computational power and storage space. To remedy these issues, we propose a novel compressed video representation learning method for event boundary detection that is fully end-to-end leveraging rich information in the compressed domain, i.e., RGB, motion vectors, residuals, and the internal group of pictures (GOP) structure, without fully decoding the video. Specifically, we use lightweight ConvNets to extract features of the P-frames in the GOPs and spatial-channel attention module (SCAM) is designed to refine the feature representations of the P-frames based on the compressed information with bidirectional information flow. To learn a suitable representation for boundary detection, we construct the local frames bag for each candidate frame and use the long short-term memory (LSTM) module to capture temporal relationships. We then compute frame differences with group similarities in the temporal domain. This module is only applied within a local window, which is critical for event boundary detection. Finally a simple classifier is used to determine the event boundaries of video sequences based on the learned feature representation. To remedy the ambiguities of annotations and speed up the training process, we use the Gaussian kernel to preprocess the ground-truth event boundaries. Extensive experiments conducted on the Kinetics-GEBD and TAPOS datasets demonstrate that the proposed method achieves considerable improvements compared to previous end-to-end approach while running at the same speed. The code is available at this https URL.
通用事件边界检测旨在定位通用、无分类事件的分割边界,将视频片段分割成块。现有的方法通常要求视频帧先解码才能输入网络,其中包含 significant spatio-temporal redundancy 并需要相当规模的计算资源和存储空间。为了解决这些问题,我们提出了一种用于事件边界检测的压缩视频表示学习方法,该方法完全端到端利用压缩域中丰富的信息,即RGB、运动向量、残留值和内部图片组(GOP)结构,而无需完全解码视频。具体来说,我们使用轻量级卷积神经网络提取GOP中的P帧特征,并使用空间通道注意力模块(SCAM)优化基于压缩信息的P帧特征表示,以双向信息流为基础。为了学习适合边界检测的特征表示,我们每个候选帧构建本地帧包,并使用长短期记忆(LSTM)模块捕捉时间关系。然后我们在时间域中计算群体相似度,以计算帧差异。该模块仅适用于本地窗口,这是事件边界检测的关键。最后,我们使用简单的分类器来确定视频序列的事件边界,以消除注释的歧义并加快训练过程。在Kinetics-GEBD和TAPOS数据集上的广泛实验表明,该方法在运行速度相同的情况下与以前的端到端方法相比取得了相当大的改进。代码在此httpsURL可用。
https://arxiv.org/abs/2309.15431
Recent span-based joint extraction models have demonstrated significant advantages in both entity recognition and relation extraction. These models treat text spans as candidate entities, and span pairs as candidate relationship tuples, achieving state-of-the-art results on datasets like ADE. However, these models encounter a significant number of non-entity spans or irrelevant span pairs during the tasks, impairing model performance significantly. To address this issue, this paper introduces a span-based multitask entity-relation joint extraction model. This approach employs the multitask learning to alleviate the impact of negative samples on entity and relation classifiers. Additionally, we leverage the Intersection over Union(IoU) concept to introduce the positional information into the entity classifier, achieving a span boundary detection. Furthermore, by incorporating the entity Logits predicted by the entity classifier into the embedded representation of entity pairs, the semantic input for the relation classifier is enriched. Experimental results demonstrate that our proposed this http URL model can effectively mitigate the adverse effects of excessive negative samples on the model performance. Furthermore, the model demonstrated commendable F1 scores of 73.61\%, 53.72\%, and 83.72\% on three widely employed public datasets, namely CoNLL04, SciERC, and ADE, respectively.
最近,基于跨度的联合实体和关系提取模型在实体识别和关系提取方面表现出了显著的优势。这些模型将文本跨度视为候选实体,并将跨度对作为候选关系元组,在类似ADE的数据集上取得了最先进的结果。然而,在这些任务中,这些模型会遇到大量非实体跨度或无关的跨度对,显著影响了模型性能。为了解决这个问题,本文介绍了基于跨度的多任务实体和关系提取模型。这种方法采用多任务学习来减轻负样本对实体和关系分类器的影响。此外,我们利用交集概念将位置信息引入实体分类器,实现了跨度边界检测。此外,通过将实体分类器预测的实体Logits嵌入到实体对的嵌入表示中,可以增加关系分类器的语义输入。实验结果显示,我们提出的这个http URL模型能够有效地减轻过度负样本对模型性能的不利影响。此外,该模型在三个广泛使用的公共数据集上(CoNLL04、SciERC和ADE)分别表现出令人赞叹的F1得分73.61%、53.72%和83.72%。
https://arxiv.org/abs/2309.09713
Accurate polyp delineation in colonoscopy is crucial for assisting in diagnosis, guiding interventions, and treatments. However, current deep-learning approaches fall short due to integrity deficiency, which often manifests as missing lesion parts. This paper introduces the integrity concept in polyp segmentation at both macro and micro levels, aiming to alleviate integrity deficiency. Specifically, the model should distinguish entire polyps at the macro level and identify all components within polyps at the micro level. Our Integrity Capturing Polyp Segmentation (IC-PolypSeg) network utilizes lightweight backbones and 3 key components for integrity ameliorating: 1) Pixel-wise feature redistribution (PFR) module captures global spatial correlations across channels in the final semantic-rich encoder features. 2) Cross-stage pixel-wise feature redistribution (CPFR) module dynamically fuses high-level semantics and low-level spatial features to capture contextual information. 3) Coarse-to-fine calibration module combines PFR and CPFR modules to achieve precise boundary detection. Extensive experiments on 5 public datasets demonstrate that the proposed IC-PolypSeg outperforms 8 state-of-the-art methods in terms of higher precision and significantly improved computational efficiency with lower computational consumption. IC-PolypSeg-EF0 employs 300 times fewer parameters than PraNet while achieving a real-time processing speed of 235 FPS. Importantly, IC-PolypSeg reduces the false negative ratio on five datasets, meeting clinical requirements.
在鼻镜检查中,准确的边界形成是协助诊断、指导干预和治疗的关键。然而,当前深度学习方法由于完整性不足而不足,这常常表现为 missing Lesion parts。本文介绍了在 macro 和 micro 级别上的完整性概念,旨在减轻完整性不足。具体来说,模型应该在 macro 级别上区分整个息肉,并在 micro 级别上识别息肉内部的所有组件。我们的完整性捕获息肉分割(IC-PolypSeg)网络使用轻量级骨架和三个关键组件来改善完整性:1)像素级特征重分配(PFR)模块捕获通道上的全局空间相关性,在最终的语义丰富的编码特征中。2)跨阶段像素级特征重分配(CPFR)模块动态地融合高层语义和低级别空间特征,以捕获上下文信息。3)粗到细校准模块将 PFR 和 CPFR 模块组合起来,以实现精确的边界检测。对 5 个公共数据集进行广泛的实验表明,所提出的 IC-PolypSeg 网络在更高的精度方面比 8 个先进的方法更好,同时减少了计算开销。IC-PolypSeg-EF0 使用比 PraNet 少 300 倍的参数,但实现了 235 FPS 的实时处理速度。重要的是,IC-PolypSeg 降低了 5 个数据集的 false negative 比率,符合临床要求。
https://arxiv.org/abs/2309.08234
Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by $41\%$ on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.
时间动作分割通常通过发现全球视觉描述符的重大差异来实现。在本文中,我们提出了对象中心的时间动作分割(OTAS)框架,以探索本地特征的优点。OTAS广义地说包括自监督的全球和本地特征提取模块以及边界选择模块,将特征融合并检测运动分割的显著边界。作为第二贡献,我们讨论了现有帧级和边界级评估 metrics 的优缺点。通过广泛的实验,我们发现OTAS平均比先前的最先进的方法高出41%,在推荐F1得分方面表现更好。令人惊讶地,OTAS在用户研究中甚至优于真实值人类标注。此外,OTAS高效 enough 以允许实时推断。
https://arxiv.org/abs/2309.06276
Music Structure Analysis (MSA) is the task aiming at identifying musical segments that compose a music track and possibly label them based on their similarity. In this paper we propose a supervised approach for the task of music boundary detection. In our approach we simultaneously learn features and convolution kernels. For this we jointly optimize -- a loss based on the Self-Similarity-Matrix (SSM) obtained with the learned features, denoted by SSM-loss, and -- a loss based on the novelty score obtained applying the learned kernels to the estimated SSM, denoted by novelty-loss. We also demonstrate that relative feature learning, through self-attention, is beneficial for the task of MSA. Finally, we compare the performances of our approach to previously proposed approaches on the standard RWC-Pop, and various subsets of SALAMI.
音乐结构分析(MSA)的任务是确定构成音乐片段的部分,并可能根据它们的相似性将它们分类。在本文中,我们提出了一个 supervised 的方法,用于音乐边界检测任务。在我们的方法中,我们同时学习特征和卷积核。为此,我们共同优化两个损失函数:一个基于学习特征的 self-similarity-Matrix(SSM)损失,另一个基于学习核的新颖性损失,用 Novelty-loss 表示。我们还证明,通过自我关注,相对特征学习对 MSA 任务有益。最后,我们比较了我们的方法和之前提出的 approaches 在标准 RWC-Pop 音乐片段和SalAMI 各种子集上的性能。
https://arxiv.org/abs/2309.02243
This paper presents a set of industrial-grade text processing models for Hungarian that achieve near state-of-the-art performance while balancing resource efficiency and accuracy. Models have been implemented in the spaCy framework, extending the HuSpaCy toolkit with several improvements to its architecture. Compared to existing NLP tools for Hungarian, all of our pipelines feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependency parsing and named entity recognition with high accuracy and throughput. We thoroughly evaluated the proposed enhancements, compared the pipelines with state-of-the-art tools and demonstrated the competitive performance of the new models in all text preprocessing steps. All experiments are reproducible and the pipelines are freely available under a permissive license.
本论文介绍了一组匈牙利语文本处理模型,这些模型在平衡资源效率和准确性方面实现了接近最新性能。这些模型在spaCy框架中实现,扩展了HuSpaCyToolkit的结构并做了多项改进。与匈牙利语语料库现有的NLP工具相比,我们的管道线 feature all basic text processing steps including tokenization, sentence-boundary detection, part-of-speech tagging, morphological feature tagging, lemmatization, dependencyParsing and named entity recognition with high accuracy and throughput. 我们彻底评估了所提出的改进,比较了管道线和最先进的工具,并展示了新模型在所有文本预处理步骤中的 competitive performance。所有实验都可以重复,管道线可以免费使用具有宽松许可证。
https://arxiv.org/abs/2308.12635
We propose Boundary-RL, a novel weakly supervised segmentation method that utilises only patch-level labels for training. We envision the segmentation as a boundary detection problem, rather than a pixel-level classification as in previous works. This outlook on segmentation may allow for boundary delineation under challenging scenarios such as where noise artefacts may be present within the region-of-interest (ROI) boundaries, where traditional pixel-level classification-based weakly supervised methods may not be able to effectively segment the ROI. Particularly of interest, ultrasound images, where intensity values represent acoustic impedance differences between boundaries, may also benefit from the boundary delineation approach. Our method uses reinforcement learning to train a controller function to localise boundaries of ROIs using a reward derived from a pre-trained boundary-presence classifier. The classifier indicates when an object boundary is encountered within a patch, as the controller modifies the patch location in a sequential Markov decision process. The classifier itself is trained using only binary patch-level labels of object presence, which are the only labels used during training of the entire boundary delineation framework, and serves as a weak signal to inform the boundary delineation. The use of a controller function ensures that a sliding window over the entire image is not necessary. It also prevents possible false-positive or -negative cases by minimising number of patches passed to the boundary-presence classifier. We evaluate our proposed approach for a clinically relevant task of prostate gland segmentation on trans-rectal ultrasound images. We show improved performance compared to other tested weakly supervised methods, using the same labels e.g., multiple instance learning.
我们提出了boundary-RL,一种 novel 弱监督分割方法,仅使用块级标签进行训练。我们设想分割是一种边界检测问题,而不是在以前的作品中所采用的像素级分类问题。这种对分割的看法可以在遇到挑战的情况下实现边界分割,例如在兴趣区域(ROI)的边界内可能存在噪声效应,传统的基于像素级分类的弱监督方法可能无法有效地分割 ROI。尤其是,超声波图像,其强度值表示边界之间的 acoustic impedance 差异,也可能受益于边界分割方法。我们的方法使用强化学习来训练控制器函数,通过从预先训练的边界存在分类器中获取奖励,来定位 ROI 的边界。分类器表示何时对象边界在一个块内出现,因为控制器在Sequential 马尔可夫决策过程中对块位置进行修改。分类器本身仅使用二进制块级标签表示物体存在,这些标签是在整个边界分割框架训练中使用的唯一的标签,并成为告诉边界分割的弱信号。使用控制器函数确保了整个图像不必进行滑动窗口。它还可以防止可能的假阳性或阴性情况,通过最小化传递给边界存在分类器的块的数量。我们评估了我们所提出的方法在跨Rectified房地产分割相关任务中的表现。我们表现出与其他测试的弱监督方法相比更好的性能,使用相同的标签,例如多实例学习。
https://arxiv.org/abs/2308.11376
This paper introduces our system designed for Track 2, which focuses on locating manipulated regions, in the second Audio Deepfake Detection Challenge (ADD 2023). Our approach involves the utilization of multiple detection systems to identify splicing regions and determine their authenticity. Specifically, we train and integrate two frame-level systems: one for boundary detection and the other for deepfake detection. Additionally, we employ a third VAE model trained exclusively on genuine data to determine the authenticity of a given audio clip. Through the fusion of these three systems, our top-performing solution for the ADD challenge achieves an impressive 82.23% sentence accuracy and an F1 score of 60.66%. This results in a final ADD score of 0.6713, securing the first rank in Track 2 of ADD 2023.
本论文介绍了我们为第二个音频深度伪造检测挑战(ADD 2023)设计的系统,该系统专注于确定剪辑区域,重点关注如何定位修改区域。我们的方法是利用多个检测系统来识别剪辑区域并确定其真实性。具体而言,我们训练并整合了两个帧级别的系统:一个用于边界检测,另一个用于深度伪造检测。此外,我们使用训练唯一地基于真实数据的第三个VAE模型来确定给定音频片段的真实性。通过将这些三个系统的融合,我们ADD挑战中表现最佳的解决方案取得了令人印象深刻的82.23%语句准确性和60.66%的F1得分。这导致最终ADD得分为0.6713,确保了ADD 2023 track 2的第一排名。
https://arxiv.org/abs/2308.10281
Recently, there has been a high demand for accelerating and improving the detection of automatic cadastral mapping. As this problem is in its starting point, there are many methods of computer vision and deep learning that have not been considered yet. In this paper, we focus on deep learning and provide three geometric post-processing methods that improve the quality of the work. Our framework includes two parts, each of which consists of a few phases. Our solution to this problem uses instance segmentation. In the first part, we use Mask R-CNN with the backbone of pre-trained ResNet-50 on the ImageNet dataset. In the second phase, we apply three geometric post-processing methods to the output of the first part to get better overall output. Here, we also use computational geometry to introduce a new method for simplifying lines which we call it pocket-based simplification algorithm. For evaluating the quality of our solution, we use popular formulas in this field which are recall, precision and F-score. The highest recall we gain is 95 percent which also maintains high Precision of 72 percent. This resulted in an F-score of 82 percent. Implementing instance segmentation using Mask R-CNN with some geometric post-processes to its output gives us promising results for this field. Also, results show that pocket-based simplification algorithms work better for simplifying lines than Douglas-Puecker algorithm.
最近,加速和提高自动地形映射的发现的需求很高。由于这个问题处于开始阶段,还没有考虑许多计算机视觉和深度学习的方法。在本文中,我们关注深度学习,并提供三个几何后处理方法,以提高工作的质量。我们的框架包括两个部分,每个部分包括几个阶段。我们解决这个问题的方法使用实例分割。在第一部分中,我们使用Mask R-CNN,其基础是在ImageNet数据集上预先训练的ResNet-50。在第二部分中,我们应用三个几何后处理方法到第一部分的输出,以获得更好的整体输出。在这里,我们还使用计算几何来介绍一种新的方法,以简化线条,我们称之为“口袋为基础的简化算法”。为了评估我们的解决方案的质量,我们使用该领域流行的公式,即召回、精度和F-score。我们获得的最高的召回率为95%,同时也保持了72%的高精度。这导致了F-score的82%。使用Mask R-CNN加上一些几何后处理将其输出实例分割方法,为我们该领域提供了令人期望的结果。此外,结果显示,口袋为基础的简化算法对于简化线条比Douglas-Puecker算法更有效。
https://arxiv.org/abs/2309.16708
Due to the high similarity between camouflaged instances and the background, the recently proposed camouflaged instance segmentation (CIS) faces challenges in accurate localization and instance segmentation. To this end, inspired by query-based transformers, we propose a unified query-based multi-task learning framework for camouflaged instance segmentation, termed UQFormer, which builds a set of mask queries and a set of boundary queries to learn a shared composed query representation and efficiently integrates global camouflaged object region and boundary cues, for simultaneous instance segmentation and instance boundary detection in camouflaged scenarios. Specifically, we design a composed query learning paradigm that learns a shared representation to capture object region and boundary features by the cross-attention interaction of mask queries and boundary queries in the designed multi-scale unified learning transformer decoder. Then, we present a transformer-based multi-task learning framework for simultaneous camouflaged instance segmentation and camouflaged instance boundary detection based on the learned composed query representation, which also forces the model to learn a strong instance-level query representation. Notably, our model views the instance segmentation as a query-based direct set prediction problem, without other post-processing such as non-maximal suppression. Compared with 14 state-of-the-art approaches, our UQFormer significantly improves the performance of camouflaged instance segmentation. Our code will be available at this https URL.
由于伪装实例与背景具有很高的相似性,最近提出的伪装实例分割(CIS)面临精确定位和实例分割的挑战。为此,我们基于查询的Transformer提出了一个统一的查询基多任务学习框架,称为UQ Former,该框架构建了一系列掩码查询和边界查询,学习共同组成的查询表示,并高效集成全球伪装对象区域和边界 cues,在伪装场景中实现同时实例分割和实例边界检测。具体来说,我们设计了一个组合查询学习范式,通过学习共同的表示来捕捉对象区域和边界特征,通过设计的多尺度统一学习Transformer解码器中的掩码和边界查询的交叉注意力相互作用。然后,我们提出了一个基于学习的多任务学习框架,用于同时伪装实例分割和伪装实例边界检测,该框架基于学习到的共同查询表示,并迫使模型学习强大的实例级查询表示。值得注意的是,我们模型将实例分割视为基于查询的直接序列预测问题,而无需非最大抑制等后处理。与14个先进的方法相比,我们的UQ Former显著改善了伪装实例分割的性能。我们的代码将在这个httpsURL上可用。
https://arxiv.org/abs/2308.07392
Human-AI collaborative writing has been greatly facilitated with the help of modern large language models (LLM), e.g., ChatGPT. While admitting the convenience brought by technology advancement, educators also have concerns that students might leverage LLM to partially complete their writing assignment and pass off the human-AI hybrid text as their original work. Driven by such concerns, in this study, we investigated the automatic detection of Human-AI hybrid text in education, where we formalized the hybrid text detection as a boundary detection problem, i.e., identifying the transition points between human-written content and AI-generated content. We constructed a hybrid essay dataset by partially removing sentences from the original student-written essays and then instructing ChatGPT to fill in for the incomplete essays. Then we proposed a two-step detection approach where we (1) Separated AI-generated content from human-written content during the embedding learning process; and (2) Calculated the distances between every two adjacent prototypes (a prototype is the mean of a set of consecutive sentences from the hybrid text in the embedding space) and assumed that the boundaries exist between the two prototypes that have the furthest distance from each other. Through extensive experiments, we summarized the following main findings: (1) The proposed approach consistently outperformed the baseline methods across different experiment settings; (2) The embedding learning process (i.e., step 1) can significantly boost the performance of the proposed approach; (3) When detecting boundaries for single-boundary hybrid essays, the performance of the proposed approach could be enhanced by adopting a relatively large prototype size, leading to a $22$\% improvement (against the second-best baseline method) in the in-domain setting and an $18$\% improvement in the out-of-domain setting.
借助现代大型语言模型(LLM),如ChatGPT,人类-AI合作写作已经极大地便利了。尽管承认技术进步带来的便利,教育工作者也感到担忧,学生可能会利用LLM部分完成他们的写作任务,并将人类-AI混合文本当作自己的工作。基于这些担忧,在本研究中,我们探讨了在教育中自动检测人类-AI混合文本的问题,我们将混合文本检测 formalized 为边界检测问题,即确定人类文本和AI生成文本之间的过渡点。我们通过partially 删除原始学生写作的语句,构造了混合作文数据集,并指令ChatGPT为不完整的作文填充。然后,我们提出了一种两步检测方法,其中我们在嵌入学习过程中(1)将AI生成的内容与人类写的内容分离;(2)计算每个相邻原型之间的距离(原型是混合文本在嵌入空间中连续语句的均值)并假设有两个距离最长的原型之间存在边界。通过广泛的实验,我们总结了以下主要发现:(1) proposed 方法在不同实验设置中 consistently outperforms the baseline methods;(2) 嵌入学习过程(即步骤1)可以显著增强 proposed 方法的性能;(3) 在检测单一边界混合作文的边界时,采用相对较大原型大小可以提高 proposed 方法的性能,导致在域内设置中比最佳 baseline method 提高了22%,而在跨域设置中提高了18%。
https://arxiv.org/abs/2307.12267