Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
由于地理环境多样性、复杂的地形和密集聚落,使用遥感图像进行自动识别城市乡村边界是一项具有挑战性的任务。本文提出了一种新颖且高效的神经网络模型,称为UV-Mamba,用于准确地在高分辨率遥感图像中检测边界。UV-Mamba通过引入可变形卷积(DCN)缓解了在状态空间模型(SSM)中随着图像尺寸增加而产生的长序列建模内存损失问题。其架构采用编码器-解码器框架,包括一个具有四个可变形状态空间增强(DSSA)块的编码器和一个整合提取的语义信息的解码器。我们对北京和西安数据集进行了实验,结果表明,UV-Mamba实现了最先进的性能。具体来说,我们的模型在 Beijing 和 Xi'an 数据集上分别实现了 73.3% 和 78.1% 的IoU,分别比以前的最佳模型提高了1.2%和3.4%,同时在推理速度上快了6倍,而在参数数量上小了40倍。源代码和预训练模型可在补充材料中获取。
https://arxiv.org/abs/2409.03431
This work presents the INBD network proposed by Gillert et al. in CVPR-2023 and studies its application for delineating tree rings in RGB images of Pinus taeda cross sections captured by a smartphone (UruDendro dataset), which are images with different characteristics from the ones used to train the method. The INBD network operates in two stages: first, it segments the background, pith, and ring boundaries. In the second stage, the image is transformed into polar coordinates, and ring boundaries are iteratively segmented from the pith to the bark. Both stages are based on the U-Net architecture. The method achieves an F-Score of 77.5, a mAR of 0.540, and an ARAND of 0.205 on the evaluation set. The code for the experiments is available at this https URL.
本文介绍了Gillert等人提出并在CVPR-2023会议上发表的INBD网络,并研究了其在捕捉智能手机捕获的Pinus taeda剖面图像中的树木环的定义应用。与训练方法使用的图像具有不同的特点。INBD网络分为两个阶段:第一阶段是分割背景、皮层和环边界;第二阶段是将图像转换为极坐标,然后从皮层到木质部依次分割环边界。两个阶段都基于U-Net架构。该方法在评估集上的F-分数为77.5,mAR为0.540,ARAND为0.205。实验代码可在此处访问的链接中找到。
https://arxiv.org/abs/2408.14343
Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{this https URL}.
通用事件边界检测(GEBD)是一种模型,受到人类在观看视频时将视频划分为有意义的时间段的视觉认知行为的启发,可以在各种应用中找到用处,如视频编辑等。在本文中,我们证明了最先进的GEBD模型通常会优先考虑最终性能,导致推理速度较低,阻碍了在现实场景中的高效部署。通过实验重新审视GEBD模型的架构,我们发现了几个令人惊讶的发现,为解决这一挑战做出了贡献。首先,我们发现一个简洁的GEBD基线模型已经实现了很好的性能,而无需进行复杂的设计。其次,我们发现广泛应用于GEBD模型的图像领域骨架可以包含大量的架构冗余,这激发了我们逐步“现代化”每个组件,提高效率。第三,我们展示了使用图像领域骨架进行空间-然后-时间学习的GEBD模型可能存在分心问题,这可能是GEBD的低效反派。使用视频领域骨架共同进行空间-然后-时间建模是解决这个问题的有效方法。我们的探索结果是一系列GEBD模型,名为EfficientGEBD,在相同骨架上显著超过了之前的SOTA方法,性能提高了1.7%至280%的速度。我们的研究鼓励社区设计考虑模型复杂性的现代GEBD方法,尤其是在资源感知应用中。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2407.12622
Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.
通用事件边界检测(GEBD)旨在通过自然感知的人类事件边界来精确地定位事件边界,对于理解长格式视频至关重要。由于通用边界的多样性,包括不同的视频表现、物体和动作,这项任务仍然具有挑战性。现有的方法通常通过相同的协议检测各种边界,而不考虑它们的独特特点和检测难度,导致性能低下。从直觉上讲,更智能和合理的边界检测方法是通过考虑它们的特殊性质来动态地检测边界。因此,我们提出了一个名为DyBDet的新通用事件边界检测动态管道。通过引入多出口网络架构,DyBDet可以自动学习为不同视频片段分配的子网,实现对各种边界的细粒度检测。此外,还提出了多级差异检测器,以确保通用边界可以有效地被识别和适应处理。在具有挑战性的Kinetics-GEBD和TAPOS数据集上的实验表明,采用动态策略显著提高了GEBD任务的表现和效率,与现有技术的水平相比,显着改善了性能和效率。
https://arxiv.org/abs/2407.04274
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228
Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper proposes a novel 2-Dimensional Pointer-based Machine Reading Comprehension for Moment Retrieval Choice (2DP-2MRC) model to address the issue of imprecise localization in clip-based methods while maintaining lower computational complexity than moment-based methods. Specifically, we introduce an AV-Encoder to capture coarse-grained information at moment and video levels. Additionally, a 2D pointer encoder module is introduced to further enhance boundary detection for target moment. Extensive experiments on the HiREST dataset demonstrate that 2DP-2MRC significantly outperforms existing baseline models.
翻译: Moment 检索旨在根据给定的自然语言查询在未剪辑的视频中查找最相关的时刻。现有的解决方案可以大致分为基于时刻和基于剪辑的方法。前者通常需要进行大量的计算,而后者,由于忽略了粗粒度信息,与基于时刻的模型相比,通常表现不佳。因此,本文提出了一种新颖的2维指针基于机器阅读理解(2DP-2MRC)模型来解决基于剪辑方法的精确度问题,同时保持较低的计算复杂性。具体来说,我们引入了一个AV编码器来捕捉时刻和视频级别的粗粒度信息。此外,还引入了一个2D指针编码器模块来进一步增强目标时刻的边界检测。在HiREST数据集上的大量实验证明,2DP-2MRC显著优于现有基线模型。
https://arxiv.org/abs/2406.06201
Computational morphology handles the language processing at the word level. It is one of the foundational tasks in the NLP pipeline for the development of higher level NLP applications. It mainly deals with the processing of words and word forms. Computational Morphology addresses various sub problems such as morpheme boundary detection, lemmatization, morphological feature tagging, morphological reinflection etc. In this paper, we present exhaustive survey of the methods for developing computational morphology related tools. We survey the literature in the chronological order starting from the conventional methods till the recent evolution of deep neural network based approaches. We also review the existing datasets available for this task across the languages. We discuss about the effectiveness of neural model compared with the traditional models and present some unique challenges associated with building the computational morphology tools. We conclude by discussing some recent and open research issues in this field.
计算形态学处理语料中的语言处理。它是自然语言处理(NLP)中开发高级NLP应用程序的基本任务之一。它主要涉及处理单词和词形。计算形态学解决了许多子问题,如词素边检测、词形化、词形特征标记、词形重塑等。在本文中,我们全面调查了用于开发计算形态学相关工具的方法。我们按时间顺序调查文献,从传统的方法到基于深度神经网络方法的最近发展。我们还回顾了此任务现有数据集。我们讨论了与传统模型相比,神经模型的有效性以及与构建计算形态学工具相关的独特挑战。我们最后讨论了该领域的一些最近和开放的研究问题。
https://arxiv.org/abs/2406.05424
A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at this http URL.
为在机器人感知中广泛应用基于学习的模型的一个关键挑战是,在实现准确预测的同时显著减少所需的注释训练数据量。这对于降低操作成本至关重要,同时也加速了部署时间。在这项工作中,我们通过利用视觉基础模型留下的基础工作来解决PASTEL(部分自动驾驶视觉要素分割)的挑战。我们利用该模型的描述性图像特征来训练两个轻量级的网络头进行语义分割和目标边界检测,使用非常少的注释训练样本。然后,通过一种新颖的融合模块合并它们的预测,该模块基于归一化的切点产生全景图。为了进一步提高性能,我们还利用基于特征的相似度方案选择的自监督图像进行自训练。我们通过PASTEL来强调我们的方法在自动驾驶和农业机器人感知领域的关键应用。在广泛的实验中,我们证明了即使使用更少的注释,PASTEL在语义分割方面的性能也远远超过了以前的方法。我们工作的代码公开在這個網址:http://www.example.com/。
https://arxiv.org/abs/2405.19035
How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at this https URL.
如何在没有实例级监督的情况下提取实例级别的分割掩码是弱监督实例分割(WSIS)的主要挑战。流行的WSIS方法通过学习 inter-pixel 关系估计偏移场(DF),并聚类以确定实例。然而,得到的实例中心点固有不稳定,并且在不同的聚类算法之间显著变化。在本文中,我们提出了边界辅助实例分割(BAISeg),它是一种新的WSIS范例,实现了使用像素级注释进行实例分割。BAISeg由实例感知边界检测(IABD)分支和语义分割分支组成。IABD分支通过预测类无关的实例边界而不是实例中心点来确定实例,因此与之前基于DF的 approaches不同。特别地,我们在IABD分支上提出了级联融合模块(CFM)和深度互注意(DMA),以获得丰富的上下文信息并捕捉弱响应的实例边界。在训练阶段,我们采用 Pixel-to-Pixel Contrast 增强 IABD分支的判别能力。这进一步增强了实例边界的连续性和关闭性。在PASCAL VOC 2012和MS COCO上进行的大量实验证明了我们方法的的有效性,我们仅使用像素级注释就取得了显著的性能。代码将在此处 available at this <https:// URL>。
https://arxiv.org/abs/2406.18558
This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs demonstrate promising OCR capabilities but produce unsatisfactory results due to cell omission and misalignment, and they notably exhibit insufficient spatial and format recognition skills, motivating future work to enhance VLMs' spreadsheet data comprehension capabilities using our methods to generate extensive spreadsheet-image pairs in various settings.
本文探讨了在电子表格理解方面的视觉语言模型的能力。我们提出了三个自监督挑战,并相应地定义了评估指标,以全面评估VLM在光学字符识别(OCR)、空间感知和视觉格式识别方面的能力。此外,我们还利用电子表格表检测任务来评估VLM的整体性能,通过将这些挑战集成起来。为了更深入地研究VLM,我们提出了三个电子表格到图像设置:行宽调整、样式改变和地址增强。我们提出了适应不同设置的提示来解决上述任务。值得注意的是,为了充分利用VLM在理解文本而不是二维定位方面的优势,我们在电子表格边界检测中编码单元值的四边界。我们的研究结果表明,VLM显示出有前景的OCR能力,但由于单元格省略和错位,其结果并不理想。此外,它们在空间感知和格式识别方面的能力尤为不足,这激励未来的工作使用我们的方法增强VLMs的电子表格数据理解能力,以便在各种设置中生成广泛的电子表格图像对。
https://arxiv.org/abs/2405.16234
The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection shared task in the SemEval-2024 competition aims to tackle the problem of misusing collaborative human-AI writing. Although there are a lot of existing detectors of AI content, they are often designed to give a binary answer and thus may not be suitable for more nuanced problem of finding the boundaries between human-written and machine-generated texts, while hybrid human-AI writing becomes more and more popular. In this paper, we address the boundary detection problem. Particularly, we present a pipeline for augmenting data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score, according to the leaderboard of the competition, with this pipeline.
多生成器、多领域和多语言的黑色盒式机器翻译文本检测在SemEval-2024竞赛中旨在解决协同人类-人工智能写作中误用人工智能内容的问题。尽管有很多现有的人工智能内容检测器,但它们通常被设计为只给出二进制答案,因此可能不适合更细微的问题,即在人类撰写的和人工智能生成的文本之间找到边界。而混合人类-人工智能写作变得越来越受欢迎。在本文中,我们解决了边界检测问题。特别是,我们提出了一种用于增强数据以进行监督微调DeBERTaV3的管道。根据比赛的领导者板,我们使用这个管道获得了最优秀的MAE分数。
https://arxiv.org/abs/2405.10629
Relational triple extraction is crucial work for the automatic construction of knowledge graphs. Existing methods only construct shallow representations from a token or token pair-level. However, previous works ignore local spatial dependencies of relational triples, resulting in a weakness of entity pair boundary detection. To tackle this problem, we propose a novel Region-based Table Filling method (RTF). We devise a novel region-based tagging scheme and bi-directional decoding strategy, which regard each relational triple as a region on the relation-specific table, and identifies triples by determining two endpoints of each region. We also introduce convolution to construct region-level table representations from a spatial perspective which makes triples easier to be captured. In addition, we share partial tagging scores among different relations to improve learning efficiency of relation classifier. Experimental results show that our method achieves state-of-the-art with better generalization capability on three variants of two widely used benchmark datasets.
关系三元提取对于知识图的自动构建至关重要。现有的方法仅从词或词对级别构建浅层次表示。然而,之前的 works忽略了关系三元组的局部空间依赖性,导致实体对边界检测的弱点。为了解决这个问题,我们提出了一个新颖的区域为基础的表填充方法(RTF)。我们设计了一个新颖的区域为基础的标签方案和双向解码策略,将每个关系三元组视为关系特定表中的一个区域,通过确定每个区域的两个端点来确定三元组。我们还引入卷积来从空间角度构建区域级别的表表示,使得三元组更容易被捕捉。此外,我们还在不同关系之间共享部分标签得分,以提高关系分类器的学习效率。实验结果表明,我们的方法在三个广泛使用基准数据集上实现了最先进的性能,且在扩展性方面表现出色。
https://arxiv.org/abs/2404.19154
With the increasing prevalence of text generated by large language models (LLMs), there is a growing concern about distinguishing between LLM-generated and human-written texts in order to prevent the misuse of LLMs, such as the dissemination of misleading information and academic dishonesty. Previous research has primarily focused on classifying text as either entirely human-written or LLM-generated, neglecting the detection of mixed texts that contain both types of content. This paper explores LLMs' ability to identify boundaries in human-written and machine-generated mixed texts. We approach this task by transforming it into a token classification problem and regard the label turning point as the boundary. Notably, our ensemble model of LLMs achieved first place in the 'Human-Machine Mixed Text Detection' sub-task of the SemEval'24 Competition Task 8. Additionally, we investigate factors that influence the capability of LLMs in detecting boundaries within mixed texts, including the incorporation of extra layers on top of LLMs, combination of segmentation loss, and the impact of pretraining. Our findings aim to provide valuable insights for future research in this area.
随着大型语言模型(LLMs)生成的文本越来越多的普遍,人们越来越关注在LLM生成的和人类撰写的文本之间进行区分,以防止LLM的滥用,例如传播误导性信息和学术不端行为。之前的研究主要集中在将文本归类为完全由人类撰写或完全由LLM生成的两类,而忽略了检测包含两种内容的混合文本。本文探讨LLM在人类和机器生成混合文本中的边界识别能力。我们将其转化为一个词分类问题,将标签转折点视为边界。值得注意的是,我们的LLM模型在SemEval'24竞赛任务8的“人类-机器混合文本检测”子任务中获得了第一名的成绩。此外,我们研究了影响LLM在混合文本中检测边界能力的因素,包括在LLM之上集成额外的层、分段损失的组合以及预训练的影响。我们的研究结果旨在为该领域未来的研究提供宝贵的洞见。
https://arxiv.org/abs/2404.00899
Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
近年来,Sign Language Recognition (SLR)已经从研究者那里获得了显著的关注,尤其是 Continuous Sign Language Recognition (CSLR) 领域,它比 Isolated Sign Language Recognition (ISLR) 具有更高的复杂性。CSLR 中的一个突出挑战是准确地检测连续视频流中孤立符号的边界。此外,现有模型对手工特征的依赖使得达到最优准确性的挑战加大。为了克服这些挑战,我们提出了一个利用 Transformer 模型的全新方法。与传统模型不同,我们的方法专注于提高准确度的同时消除手工特征的需要。Transformer 模型用于 ISLR 和 CSLR。训练过程包括使用孤立手势视频,其中从输入视频中提取的手关键点特征通过 Transformer 模型进行丰富。随后,这些丰富的特征被输入到最后一层分类层。训练好的模型与后处理方法相结合,然后应用于检测连续符号中的孤立符号边界。在两个不同的数据集上评估我们的模型,包括连续符号及其相应的孤立符号,证明了积极的结果。
https://arxiv.org/abs/2402.14720
Generic Event Boundary Detection (GEBD) task aims to recognize generic, taxonomy-free boundaries that segment a video into meaningful events. Current methods typically involve a neural model trained on a large volume of data, demanding substantial computational power and storage space. We explore two pivotal questions pertaining to GEBD: Can non-parametric algorithms outperform unsupervised neural methods? Does motion information alone suffice for high performance? This inquiry drives us to algorithmically harness motion cues for identifying generic event boundaries in videos. In this work, we propose FlowGEBD, a non-parametric, unsupervised technique for GEBD. Our approach entails two algorithms utilizing optical flow: (i) Pixel Tracking and (ii) Flow Normalization. By conducting thorough experimentation on the challenging Kinetics-GEBD and TAPOS datasets, our results establish FlowGEBD as the new state-of-the-art (SOTA) among unsupervised methods. FlowGEBD exceeds the neural models on the Kinetics-GEBD dataset by obtaining an F1@0.05 score of 0.713 with an absolute gain of 31.7% compared to the unsupervised baseline and achieves an average F1 score of 0.623 on the TAPOS validation dataset.
Generic Event Boundary Detection (GEBD)任务旨在识别将视频分割为有意义事件的通用、无类别的边界。目前的方法通常涉及在一大群数据上训练的神经模型,需要大量的计算能力和存储空间。我们探讨两个与GEBD相关的关键问题:非参数算法是否能够超越无监督的神经方法?是否仅凭运动信息就足以实现高性能?这个问题推动我们去算法地利用运动线索来识别视频中的通用事件边界。在这项工作中,我们提出了FlowGEBD,一种非参数、无监督的GEBD技术。我们的方法包括两个利用光流的方法:(i)像素跟踪和(ii)光流归一化。通过对Kinetics-GEBD和TAPOS数据集的深入实验,我们的结果使FlowGEBD成为无监督方法中的最先进的(SOTA)。FlowGEBD在Kinetics-GEBD数据集上超过了神经模型,获得了0.713的F1@0.05得分,相对增益为31.7%,同时在TAPOS验证数据集上的平均F1得分达到0.623。
https://arxiv.org/abs/2404.18935
This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 msec of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.
这篇文章研究了使用连接ist语音音素识别器输出的类熵来预测音位类之间的时间边界的可能性。推理是,熵的值应该在两个已经很好地建模(已知)的音位类之间的转换处增加,因为这是不确定性的度量。这种度量的优势在于,后验概率可以在连接ist语音素识别中直接访问。熵及其基于导数的度量被单独和结合使用。预测边界的方法从简单的阈值开始,到基于神经网络的程序。本文将不同的方法与它们的精度进行比较,即预测边界内参考边界数量与总预测边界数量之比。召回被测量为预测边界内参考边界数量与总预测边界数量之比。
https://arxiv.org/abs/2401.05717
Recent advancements in Large Language Models (LLMs) have led to high-quality Machine-Generated Text (MGT), giving rise to countless new use cases and applications. However, easy access to LLMs is posing new challenges due to misuse. To address malicious usage, researchers have released datasets to effectively train models on MGT-related tasks. Similar strategies are used to compile these datasets, but no tool currently unifies them. In this scenario, we introduce TextMachina, a modular and extensible Python framework, designed to aid in the creation of high-quality, unbiased datasets to build robust models for MGT-related tasks such as detection, attribution, or boundary detection. It provides a user-friendly pipeline that abstracts away the inherent intricacies of building MGT datasets, such as LLM integrations, prompt templating, and bias mitigation. The quality of the datasets generated by TextMachina has been assessed in previous works, including shared tasks where more than one hundred teams trained robust MGT detectors.
近年来,大型语言模型(LLMs)的进步导致了高质量的机器生成文本(MGT),为无数新的用例和应用提供了可能。然而,由于滥用,轻松访问LLMs也带来了新的挑战。为了应对恶意使用,研究人员已经发布了用于有效训练与MGT相关的任务的 datasets。类似地,用于构建这些数据集的工具,但目前尚无工具能够统一它们。在这种情况下,我们介绍了TextMachina,一个模块化且可扩展的Python框架,旨在帮助创建高质量、无偏的 datasets,以构建 robust 模型,例如检测、归因或边界检测。它提供了一个用户友好的管道,抽象了构建MGT数据集的固有复杂性,例如LLM集成、提示模板化和偏差缓解。TextMachina生成的数据集的质量已在之前的 works中被评估,包括由超过100个团队共同训练的 robust MGT 检测器。
https://arxiv.org/abs/2401.03946
In this work, we investigate the means of using curiosity on replay buffers to improve offline multi-task continual reinforcement learning when tasks, which are defined by the non-stationarity in the environment, are non labeled and not evenly exposed to the learner in time. In particular, we investigate the use of curiosity both as a tool for task boundary detection and as a priority metric when it comes to retaining old transition tuples, which we respectively use to propose two different buffers. Firstly, we propose a Hybrid Reservoir Buffer with Task Separation (HRBTS), where curiosity is used to detect task boundaries that are not known due to the task agnostic nature of the problem. Secondly, by using curiosity as a priority metric when it comes to retaining old transition tuples, a Hybrid Curious Buffer (HCB) is proposed. We ultimately show that these buffers, in conjunction with regular reinforcement learning algorithms, can be used to alleviate the catastrophic forgetting issue suffered by the state of the art on replay buffers when the agent's exposure to tasks is not equal along time. We evaluate catastrophic forgetting and the efficiency of our proposed buffers against the latest works such as the Hybrid Reservoir Buffer (HRB) and the Multi-Time Scale Replay Buffer (MTR) in three different continual reinforcement learning settings. Experiments were done on classical control tasks and Metaworld environment. Experiments show that our proposed replay buffers display better immunity to catastrophic forgetting compared to existing works in most of the settings.
在这项研究中,我们研究了在重新播放缓冲区中使用好奇心来改善无标签环境和非均匀暴露于学习者时的离线多任务强化学习的方法。特别是,我们研究了好奇心在检测任务边界和保留旧转移元组方面的使用,我们分别使用这些元组来提出两种不同的缓冲器。首先,我们提出了一个带有任务分离的混合水库缓冲器(HRBTS),其中好奇心用于检测由于问题对任务无关性而无法确定的任务边界。其次,通过将好奇心用作保留旧转移元组的优先度度量,我们提出了一个混合好奇缓冲器(HCB)。最后,我们证明了这些缓冲器与标准的强化学习算法相结合可以缓解当前关于重新播放缓冲器状态的灾难性遗忘问题。我们评估了灾难性遗忘以及我们提出的缓冲器的效率,这些缓冲器在三个不同的连续强化学习环境中进行了实验。实验在经典控制任务和元世界环境中进行。实验结果表明,与现有作品相比,我们提出的缓冲器在大多数设置中具有更好的抗灾难性遗忘能力。
https://arxiv.org/abs/2312.03177
Early weakly supervised video grounding (WSVG) methods often struggle with incomplete boundary detection due to the absence of temporal boundary annotations. To bridge the gap between video-level and boundary-level annotation, explicit-supervision methods, i.e., generating pseudo-temporal boundaries for training, have achieved great success. However, data augmentations in these methods might disrupt critical temporal information, yielding poor pseudo boundaries. In this paper, we propose a new perspective that maintains the integrity of the original temporal content while introducing more valuable information for expanding the incomplete boundaries. To this end, we propose EtC (Expand then Clarify), first use the additional information to expand the initial incomplete pseudo boundaries, and subsequently refine these expanded ones to achieve precise boundaries. Motivated by video continuity, i.e., visual similarity across adjacent frames, we use powerful multimodal large language models (MLLMs) to annotate each frame within initial pseudo boundaries, yielding more comprehensive descriptions for expanded boundaries. To further clarify the noise of expanded boundaries, we combine mutual learning with a tailored proposal-level contrastive objective to use a learnable approach to harmonize a balance between incomplete yet clean (initial) and comprehensive yet noisy (expanded) boundaries for more precise ones. Experiments demonstrate the superiority of our method on two challenging WSVG datasets.
早期的弱监督视频 groundeding (WSVG) 方法通常因为缺少时间边界的标注而无法处理不完整的边界检测。为了在视频级别和边界级别之间填补差距, explicit-supervision 方法,即为训练生成伪时间边界的显式监督方法,已经取得了巨大的成功。然而,这些方法中的数据增强可能会破坏关键的时间信息,从而产生低伪边。在本文中,我们提出了一种新方法,在保持原始时间内容完整的同时引入更多有价值的信息来扩展不完整的边界。为此,我们提出了 EtC (扩展然后澄清) 方法,首先利用额外的信息扩展初始不完整的伪边界,然后对其进行优化以实现精确的边界。为了进一步澄清扩展边界的噪声,我们结合相互学习和一个自适应的提议级对比目标,使用一种可学习的方法来平衡不完整但干净(初始)和全面但嘈杂(扩展)边之间的精确度。实验证明,我们的方法在两个具有挑战性的 WSVG 数据集上具有优越性。
https://arxiv.org/abs/2312.02483
Precise and rapid delineation of sharp boundaries and robust semantics is essential for numerous downstream robotic tasks, such as robot grasping and manipulation, real-time semantic mapping, and online sensor calibration performed on edge computing units. Although boundary detection and semantic segmentation are complementary tasks, most studies focus on lightweight models for semantic segmentation but overlook the critical role of boundary detection. In this work, we introduce Mobile-Seed, a lightweight, dual-task framework tailored for simultaneous semantic segmentation and boundary detection. Our framework features a two-stream encoder, an active fusion decoder (AFD) and a dual-task regularization approach. The encoder is divided into two pathways: one captures category-aware semantic information, while the other discerns boundaries from multi-scale features. The AFD module dynamically adapts the fusion of semantic and boundary information by learning channel-wise relationships, allowing for precise weight assignment of each channel. Furthermore, we introduce a regularization loss to mitigate the conflicts in dual-task learning and deep diversity supervision. Compared to existing methods, the proposed Mobile-Seed offers a lightweight framework to simultaneously improve semantic segmentation performance and accurately locate object boundaries. Experiments on the Cityscapes dataset have shown that Mobile-Seed achieves notable improvement over the state-of-the-art (SOTA) baseline by 2.2 percentage points (pp) in mIoU and 4.2 pp in mF-score, while maintaining an online inference speed of 23.9 frames-per-second (FPS) with 1024x2048 resolution input on an RTX 2080 Ti GPU. Additional experiments on CamVid and PASCAL Context datasets confirm our method's generalizability. Code and additional results are publicly available at \url{this https URL}.
精确和快速的边界描绘和语义分割是许多下游机器人任务的必要条件,例如机器人抓取和操作、实时语义映射以及在边缘计算单元上进行的在线传感器校准。尽管边界检测和语义分割是互补任务,但大多数研究关注语义分割的轻量级模型,而忽视了边界检测的关键作用。在这项工作中,我们引入了移动种子(Mobile-Seed),一个轻量级、双任务框架,专为同时进行语义分割和边界检测而设计。我们的框架包括两个通道的编码器、一个活动融合解码器(AFD)和一个双任务正则化方法。编码器分为两个路径:一个捕捉类相关的语义信息,另一个从多尺度特征中辨别边界。AFD模块通过学习通道级的关系动态适应语义和边界信息的融合,允许为每个通道精确分配权重。此外,我们引入了一个正则化损失项来减轻双任务学习中的冲突和深度多样监督。与现有方法相比,移动种子(Mobile-Seed)为同时提高语义分割性能和准确找到物体边界提供了一个轻量级的框架。在Cityscapes数据集上的实验结果表明,移动种子在mIoU和mF-score上实现了与最先进的(SOTA)基线2.2个百分点的改进(pp),同时具有与1024x2048分辨率输入的RTX 2080 Ti GPU上的在线推理速度为23.9帧每秒(FPS)。在CamVid和PASCAL Context数据集上的实验进一步证实了我们的方法的泛化能力。代码和附加结果可以从该链接公开获取:\url{this <https://this URL>}。
https://arxiv.org/abs/2311.12651