Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
https://arxiv.org/abs/2411.19772
Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.
不连续命名实体识别(DNER)提出了一个具有挑战性的问题,即实体可能分散在多个非相邻的标记中,这使得传统的序列标注方法变得不足。现有的方法主要依赖于定制化的标签方案来处理这些不连续的实体,导致模型与特定的标签策略紧密耦合,并且缺乏跨多种数据集的一般化能力。为了解决这些问题,我们提出了TriG-NER,一种新型的三元组网格框架,引入了一种通用的方法来学习用于提取不连续实体的健壮的词级别表示。我们的框架在词级别应用了三元组损失函数,其中相似性由存在于同一实体中的词对定义,有效拉近相似项并推开不相似项。这种做法增强了实体边界的检测,并通过专注于灵活网格结构内的词对关系减少了对特定标签方案的依赖。我们在三个基准DNER数据集上评估了TriG-NER,并展示了其在现有基于网格架构上的显著改进。这些结果突显了我们框架捕获复杂实体结构的有效性及其适应多种标签方案的能力,为不连续实体提取树立了一个新的标杆。
https://arxiv.org/abs/2411.01839
With increasing usage of generative models for text generation and widespread use of machine generated texts in various domains, being able to distinguish between human written and machine generated texts is a significant challenge. While existing models and proprietary systems focus on identifying whether given text is entirely human written or entirely machine generated, only a few systems provide insights at sentence or paragraph level at likelihood of being machine generated at a non reliable accuracy level, working well only for a set of domains and generators. This paper introduces few reliable approaches for the novel task of identifying which part of a given text is machine generated at a word level while comparing results from different approaches and methods. We present a comparison with proprietary systems , performance of our model on unseen domains' and generators' texts. The findings reveal significant improvements in detection accuracy along with comparison on other aspects of detection capabilities. Finally we discuss potential avenues for improvement and implications of our work. The proposed model is also well suited for detecting which parts of a text are machine generated in outputs of Instruct variants of many LLMs.
https://arxiv.org/abs/2410.16659
Achieving precise medical image segmentation is vital for effective treatment planning and accurate disease diagnosis. Traditional fully-supervised deep learning methods, though highly precise, are heavily reliant on large volumes of labeled data, which are often difficult to obtain due to the expertise required for medical annotations. This has led to the rise of semi-supervised learning approaches that utilize both labeled and unlabeled data to mitigate the label scarcity issue. In this paper, we introduce the Manifold-Aware Local Feature Modeling Network (MANet), which enhances the U-Net architecture by incorporating manifold supervision signals. This approach focuses on improving boundary accuracy, which is crucial for reliable medical diagnosis. To further extend the versatility of our method, we propose two variants: MA-Sobel and MA-Canny. The MA-Sobel variant employs the Sobel operator, which is effective for both 2D and 3D data, while the MA-Canny variant utilizes the Canny operator, specifically designed for 2D images, to refine boundary detection. These variants allow our method to adapt to various medical image modalities and dimensionalities, ensuring broader applicability. Our extensive experiments on datasets such as ACDC, LA, and Pancreas-NIH demonstrate that MANet consistently surpasses state-of-the-art methods in performance metrics like Dice and Jaccard scores. The proposed method also shows improved generalization across various semi-supervised segmentation networks, highlighting its robustness and effectiveness. Visual analysis of segmentation results confirms that MANet offers clearer and more accurate class boundaries, underscoring the value of manifold information in medical image segmentation.
https://arxiv.org/abs/2410.10287
With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.
随着视频理解的不断发展,针对视频级别的 Temporal 视频分析任务(TAD、TAS 和 GEBD)也呈现出繁荣的趋势。虽然针对特定任务的视频理解模型在各个任务上都表现出优异的性能,但目前尚无一个能够同时处理多个任务的统一框架,这为下一代的 AI 带来了很好的发展前景。为此,在本文中,我们提出了一个名为 Temporal2Seq 的统一框架,将其作为这些 Temporal 视频理解任务的输出序列表示。有了这个统一的表示,Temporal2Seq 可以在单个架构上训练通用模型来处理不同的视频理解任务。在没有多任务学习(MTL)基准的情况下,我们通过借用 TAD、TAS 和 GEBD 任务的数据集来构建了一个全面的联合训练数据集。我们在三个任务相应的测试集上评估了我们的 Temporal2Seq 通用模型,证明了 Temporal2Seq 在各种任务上可以产生合理的結果,并比基于单任务训练在这个框架上具有优势。我们还研究了我们的通用模型在新任务数据集上的泛化性能,结果表明,与特定模型相比,我们的通用模型具有更优越的泛化性能。
https://arxiv.org/abs/2409.18478
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.
我们观察到将未标记的语音段分割成单词类似段并将其聚类成一个词典一直是一个长期存在的问题。以前的方法使用评分模型与动态规划相结合来寻找最优的分割。然而,我们在这里提出了一种非常简单的策略:我们使用相邻自监督特征之间的差异预测单词边界,然后将预测的段聚类为词典。为了进行公平的比较,我们用更好的特征和边界约束更新较旧的ES-KMeans动态规划方法。在五语言的ZeroSpeech基准测试中,与新ES-KMeans+方法相比,我们的简单方法给出了类似的最先进的成果,而速度却快了几乎5倍。
https://arxiv.org/abs/2409.14486
Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
由于地理环境多样性、复杂的地形和密集聚落,使用遥感图像进行自动识别城市乡村边界是一项具有挑战性的任务。本文提出了一种新颖且高效的神经网络模型,称为UV-Mamba,用于准确地在高分辨率遥感图像中检测边界。UV-Mamba通过引入可变形卷积(DCN)缓解了在状态空间模型(SSM)中随着图像尺寸增加而产生的长序列建模内存损失问题。其架构采用编码器-解码器框架,包括一个具有四个可变形状态空间增强(DSSA)块的编码器和一个整合提取的语义信息的解码器。我们对北京和西安数据集进行了实验,结果表明,UV-Mamba实现了最先进的性能。具体来说,我们的模型在 Beijing 和 Xi'an 数据集上分别实现了 73.3% 和 78.1% 的IoU,分别比以前的最佳模型提高了1.2%和3.4%,同时在推理速度上快了6倍,而在参数数量上小了40倍。源代码和预训练模型可在补充材料中获取。
https://arxiv.org/abs/2409.03431
This work presents the INBD network proposed by Gillert et al. in CVPR-2023 and studies its application for delineating tree rings in RGB images of Pinus taeda cross sections captured by a smartphone (UruDendro dataset), which are images with different characteristics from the ones used to train the method. The INBD network operates in two stages: first, it segments the background, pith, and ring boundaries. In the second stage, the image is transformed into polar coordinates, and ring boundaries are iteratively segmented from the pith to the bark. Both stages are based on the U-Net architecture. The method achieves an F-Score of 77.5, a mAR of 0.540, and an ARAND of 0.205 on the evaluation set. The code for the experiments is available at this https URL.
本文介绍了Gillert等人提出并在CVPR-2023会议上发表的INBD网络,并研究了其在捕捉智能手机捕获的Pinus taeda剖面图像中的树木环的定义应用。与训练方法使用的图像具有不同的特点。INBD网络分为两个阶段:第一阶段是分割背景、皮层和环边界;第二阶段是将图像转换为极坐标,然后从皮层到木质部依次分割环边界。两个阶段都基于U-Net架构。该方法在评估集上的F-分数为77.5,mAR为0.540,ARAND为0.205。实验代码可在此处访问的链接中找到。
https://arxiv.org/abs/2408.14343
Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{this https URL}.
通用事件边界检测(GEBD)是一种模型,受到人类在观看视频时将视频划分为有意义的时间段的视觉认知行为的启发,可以在各种应用中找到用处,如视频编辑等。在本文中,我们证明了最先进的GEBD模型通常会优先考虑最终性能,导致推理速度较低,阻碍了在现实场景中的高效部署。通过实验重新审视GEBD模型的架构,我们发现了几个令人惊讶的发现,为解决这一挑战做出了贡献。首先,我们发现一个简洁的GEBD基线模型已经实现了很好的性能,而无需进行复杂的设计。其次,我们发现广泛应用于GEBD模型的图像领域骨架可以包含大量的架构冗余,这激发了我们逐步“现代化”每个组件,提高效率。第三,我们展示了使用图像领域骨架进行空间-然后-时间学习的GEBD模型可能存在分心问题,这可能是GEBD的低效反派。使用视频领域骨架共同进行空间-然后-时间建模是解决这个问题的有效方法。我们的探索结果是一系列GEBD模型,名为EfficientGEBD,在相同骨架上显著超过了之前的SOTA方法,性能提高了1.7%至280%的速度。我们的研究鼓励社区设计考虑模型复杂性的现代GEBD方法,尤其是在资源感知应用中。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2407.12622
Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.
通用事件边界检测(GEBD)旨在通过自然感知的人类事件边界来精确地定位事件边界,对于理解长格式视频至关重要。由于通用边界的多样性,包括不同的视频表现、物体和动作,这项任务仍然具有挑战性。现有的方法通常通过相同的协议检测各种边界,而不考虑它们的独特特点和检测难度,导致性能低下。从直觉上讲,更智能和合理的边界检测方法是通过考虑它们的特殊性质来动态地检测边界。因此,我们提出了一个名为DyBDet的新通用事件边界检测动态管道。通过引入多出口网络架构,DyBDet可以自动学习为不同视频片段分配的子网,实现对各种边界的细粒度检测。此外,还提出了多级差异检测器,以确保通用边界可以有效地被识别和适应处理。在具有挑战性的Kinetics-GEBD和TAPOS数据集上的实验表明,采用动态策略显著提高了GEBD任务的表现和效率,与现有技术的水平相比,显着改善了性能和效率。
https://arxiv.org/abs/2407.04274
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228
Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper proposes a novel 2-Dimensional Pointer-based Machine Reading Comprehension for Moment Retrieval Choice (2DP-2MRC) model to address the issue of imprecise localization in clip-based methods while maintaining lower computational complexity than moment-based methods. Specifically, we introduce an AV-Encoder to capture coarse-grained information at moment and video levels. Additionally, a 2D pointer encoder module is introduced to further enhance boundary detection for target moment. Extensive experiments on the HiREST dataset demonstrate that 2DP-2MRC significantly outperforms existing baseline models.
翻译: Moment 检索旨在根据给定的自然语言查询在未剪辑的视频中查找最相关的时刻。现有的解决方案可以大致分为基于时刻和基于剪辑的方法。前者通常需要进行大量的计算,而后者,由于忽略了粗粒度信息,与基于时刻的模型相比,通常表现不佳。因此,本文提出了一种新颖的2维指针基于机器阅读理解(2DP-2MRC)模型来解决基于剪辑方法的精确度问题,同时保持较低的计算复杂性。具体来说,我们引入了一个AV编码器来捕捉时刻和视频级别的粗粒度信息。此外,还引入了一个2D指针编码器模块来进一步增强目标时刻的边界检测。在HiREST数据集上的大量实验证明,2DP-2MRC显著优于现有基线模型。
https://arxiv.org/abs/2406.06201
Computational morphology handles the language processing at the word level. It is one of the foundational tasks in the NLP pipeline for the development of higher level NLP applications. It mainly deals with the processing of words and word forms. Computational Morphology addresses various sub problems such as morpheme boundary detection, lemmatization, morphological feature tagging, morphological reinflection etc. In this paper, we present exhaustive survey of the methods for developing computational morphology related tools. We survey the literature in the chronological order starting from the conventional methods till the recent evolution of deep neural network based approaches. We also review the existing datasets available for this task across the languages. We discuss about the effectiveness of neural model compared with the traditional models and present some unique challenges associated with building the computational morphology tools. We conclude by discussing some recent and open research issues in this field.
计算形态学处理语料中的语言处理。它是自然语言处理(NLP)中开发高级NLP应用程序的基本任务之一。它主要涉及处理单词和词形。计算形态学解决了许多子问题,如词素边检测、词形化、词形特征标记、词形重塑等。在本文中,我们全面调查了用于开发计算形态学相关工具的方法。我们按时间顺序调查文献,从传统的方法到基于深度神经网络方法的最近发展。我们还回顾了此任务现有数据集。我们讨论了与传统模型相比,神经模型的有效性以及与构建计算形态学工具相关的独特挑战。我们最后讨论了该领域的一些最近和开放的研究问题。
https://arxiv.org/abs/2406.05424
A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at this http URL.
为在机器人感知中广泛应用基于学习的模型的一个关键挑战是,在实现准确预测的同时显著减少所需的注释训练数据量。这对于降低操作成本至关重要,同时也加速了部署时间。在这项工作中,我们通过利用视觉基础模型留下的基础工作来解决PASTEL(部分自动驾驶视觉要素分割)的挑战。我们利用该模型的描述性图像特征来训练两个轻量级的网络头进行语义分割和目标边界检测,使用非常少的注释训练样本。然后,通过一种新颖的融合模块合并它们的预测,该模块基于归一化的切点产生全景图。为了进一步提高性能,我们还利用基于特征的相似度方案选择的自监督图像进行自训练。我们通过PASTEL来强调我们的方法在自动驾驶和农业机器人感知领域的关键应用。在广泛的实验中,我们证明了即使使用更少的注释,PASTEL在语义分割方面的性能也远远超过了以前的方法。我们工作的代码公开在這個網址:http://www.example.com/。
https://arxiv.org/abs/2405.19035
How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at this https URL.
如何在没有实例级监督的情况下提取实例级别的分割掩码是弱监督实例分割(WSIS)的主要挑战。流行的WSIS方法通过学习 inter-pixel 关系估计偏移场(DF),并聚类以确定实例。然而,得到的实例中心点固有不稳定,并且在不同的聚类算法之间显著变化。在本文中,我们提出了边界辅助实例分割(BAISeg),它是一种新的WSIS范例,实现了使用像素级注释进行实例分割。BAISeg由实例感知边界检测(IABD)分支和语义分割分支组成。IABD分支通过预测类无关的实例边界而不是实例中心点来确定实例,因此与之前基于DF的 approaches不同。特别地,我们在IABD分支上提出了级联融合模块(CFM)和深度互注意(DMA),以获得丰富的上下文信息并捕捉弱响应的实例边界。在训练阶段,我们采用 Pixel-to-Pixel Contrast 增强 IABD分支的判别能力。这进一步增强了实例边界的连续性和关闭性。在PASCAL VOC 2012和MS COCO上进行的大量实验证明了我们方法的的有效性,我们仅使用像素级注释就取得了显著的性能。代码将在此处 available at this <https:// URL>。
https://arxiv.org/abs/2406.18558
This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs demonstrate promising OCR capabilities but produce unsatisfactory results due to cell omission and misalignment, and they notably exhibit insufficient spatial and format recognition skills, motivating future work to enhance VLMs' spreadsheet data comprehension capabilities using our methods to generate extensive spreadsheet-image pairs in various settings.
本文探讨了在电子表格理解方面的视觉语言模型的能力。我们提出了三个自监督挑战,并相应地定义了评估指标,以全面评估VLM在光学字符识别(OCR)、空间感知和视觉格式识别方面的能力。此外,我们还利用电子表格表检测任务来评估VLM的整体性能,通过将这些挑战集成起来。为了更深入地研究VLM,我们提出了三个电子表格到图像设置:行宽调整、样式改变和地址增强。我们提出了适应不同设置的提示来解决上述任务。值得注意的是,为了充分利用VLM在理解文本而不是二维定位方面的优势,我们在电子表格边界检测中编码单元值的四边界。我们的研究结果表明,VLM显示出有前景的OCR能力,但由于单元格省略和错位,其结果并不理想。此外,它们在空间感知和格式识别方面的能力尤为不足,这激励未来的工作使用我们的方法增强VLMs的电子表格数据理解能力,以便在各种设置中生成广泛的电子表格图像对。
https://arxiv.org/abs/2405.16234
The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection shared task in the SemEval-2024 competition aims to tackle the problem of misusing collaborative human-AI writing. Although there are a lot of existing detectors of AI content, they are often designed to give a binary answer and thus may not be suitable for more nuanced problem of finding the boundaries between human-written and machine-generated texts, while hybrid human-AI writing becomes more and more popular. In this paper, we address the boundary detection problem. Particularly, we present a pipeline for augmenting data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score, according to the leaderboard of the competition, with this pipeline.
多生成器、多领域和多语言的黑色盒式机器翻译文本检测在SemEval-2024竞赛中旨在解决协同人类-人工智能写作中误用人工智能内容的问题。尽管有很多现有的人工智能内容检测器,但它们通常被设计为只给出二进制答案,因此可能不适合更细微的问题,即在人类撰写的和人工智能生成的文本之间找到边界。而混合人类-人工智能写作变得越来越受欢迎。在本文中,我们解决了边界检测问题。特别是,我们提出了一种用于增强数据以进行监督微调DeBERTaV3的管道。根据比赛的领导者板,我们使用这个管道获得了最优秀的MAE分数。
https://arxiv.org/abs/2405.10629
Relational triple extraction is crucial work for the automatic construction of knowledge graphs. Existing methods only construct shallow representations from a token or token pair-level. However, previous works ignore local spatial dependencies of relational triples, resulting in a weakness of entity pair boundary detection. To tackle this problem, we propose a novel Region-based Table Filling method (RTF). We devise a novel region-based tagging scheme and bi-directional decoding strategy, which regard each relational triple as a region on the relation-specific table, and identifies triples by determining two endpoints of each region. We also introduce convolution to construct region-level table representations from a spatial perspective which makes triples easier to be captured. In addition, we share partial tagging scores among different relations to improve learning efficiency of relation classifier. Experimental results show that our method achieves state-of-the-art with better generalization capability on three variants of two widely used benchmark datasets.
关系三元提取对于知识图的自动构建至关重要。现有的方法仅从词或词对级别构建浅层次表示。然而,之前的 works忽略了关系三元组的局部空间依赖性,导致实体对边界检测的弱点。为了解决这个问题,我们提出了一个新颖的区域为基础的表填充方法(RTF)。我们设计了一个新颖的区域为基础的标签方案和双向解码策略,将每个关系三元组视为关系特定表中的一个区域,通过确定每个区域的两个端点来确定三元组。我们还引入卷积来从空间角度构建区域级别的表表示,使得三元组更容易被捕捉。此外,我们还在不同关系之间共享部分标签得分,以提高关系分类器的学习效率。实验结果表明,我们的方法在三个广泛使用基准数据集上实现了最先进的性能,且在扩展性方面表现出色。
https://arxiv.org/abs/2404.19154
With the increasing prevalence of text generated by large language models (LLMs), there is a growing concern about distinguishing between LLM-generated and human-written texts in order to prevent the misuse of LLMs, such as the dissemination of misleading information and academic dishonesty. Previous research has primarily focused on classifying text as either entirely human-written or LLM-generated, neglecting the detection of mixed texts that contain both types of content. This paper explores LLMs' ability to identify boundaries in human-written and machine-generated mixed texts. We approach this task by transforming it into a token classification problem and regard the label turning point as the boundary. Notably, our ensemble model of LLMs achieved first place in the 'Human-Machine Mixed Text Detection' sub-task of the SemEval'24 Competition Task 8. Additionally, we investigate factors that influence the capability of LLMs in detecting boundaries within mixed texts, including the incorporation of extra layers on top of LLMs, combination of segmentation loss, and the impact of pretraining. Our findings aim to provide valuable insights for future research in this area.
随着大型语言模型(LLMs)生成的文本越来越多的普遍,人们越来越关注在LLM生成的和人类撰写的文本之间进行区分,以防止LLM的滥用,例如传播误导性信息和学术不端行为。之前的研究主要集中在将文本归类为完全由人类撰写或完全由LLM生成的两类,而忽略了检测包含两种内容的混合文本。本文探讨LLM在人类和机器生成混合文本中的边界识别能力。我们将其转化为一个词分类问题,将标签转折点视为边界。值得注意的是,我们的LLM模型在SemEval'24竞赛任务8的“人类-机器混合文本检测”子任务中获得了第一名的成绩。此外,我们研究了影响LLM在混合文本中检测边界能力的因素,包括在LLM之上集成额外的层、分段损失的组合以及预训练的影响。我们的研究结果旨在为该领域未来的研究提供宝贵的洞见。
https://arxiv.org/abs/2404.00899
Sign Language Recognition (SLR) has garnered significant attention from researchers in recent years, particularly the intricate domain of Continuous Sign Language Recognition (CSLR), which presents heightened complexity compared to Isolated Sign Language Recognition (ISLR). One of the prominent challenges in CSLR pertains to accurately detecting the boundaries of isolated signs within a continuous video stream. Additionally, the reliance on handcrafted features in existing models poses a challenge to achieving optimal accuracy. To surmount these challenges, we propose a novel approach utilizing a Transformer-based model. Unlike traditional models, our approach focuses on enhancing accuracy while eliminating the need for handcrafted features. The Transformer model is employed for both ISLR and CSLR. The training process involves using isolated sign videos, where hand keypoint features extracted from the input video are enriched using the Transformer model. Subsequently, these enriched features are forwarded to the final classification layer. The trained model, coupled with a post-processing method, is then applied to detect isolated sign boundaries within continuous sign videos. The evaluation of our model is conducted on two distinct datasets, including both continuous signs and their corresponding isolated signs, demonstrates promising results.
近年来,Sign Language Recognition (SLR)已经从研究者那里获得了显著的关注,尤其是 Continuous Sign Language Recognition (CSLR) 领域,它比 Isolated Sign Language Recognition (ISLR) 具有更高的复杂性。CSLR 中的一个突出挑战是准确地检测连续视频流中孤立符号的边界。此外,现有模型对手工特征的依赖使得达到最优准确性的挑战加大。为了克服这些挑战,我们提出了一个利用 Transformer 模型的全新方法。与传统模型不同,我们的方法专注于提高准确度的同时消除手工特征的需要。Transformer 模型用于 ISLR 和 CSLR。训练过程包括使用孤立手势视频,其中从输入视频中提取的手关键点特征通过 Transformer 模型进行丰富。随后,这些丰富的特征被输入到最后一层分类层。训练好的模型与后处理方法相结合,然后应用于检测连续符号中的孤立符号边界。在两个不同的数据集上评估我们的模型,包括连续符号及其相应的孤立符号,证明了积极的结果。
https://arxiv.org/abs/2402.14720