Efficient use of cultivated areas is a necessary factor for sustainable development of agriculture and ensuring food security. Along with the rapid development of satellite technologies in developed countries, new methods are being searched for accurate and operational identification of cultivated areas. In this context, identification of cropland boundaries based on spectral analysis of data obtained from satellite images is considered one of the most optimal and accurate methods in modern agriculture. This article proposes a new approach to determine the suitability and green index of cultivated areas using satellite data obtained through the "Google Earth Engine" (GEE) platform. In this approach, two powerful algorithms, "SNIC (Simple Non-Iterative Clustering) Super Pixels" and "Canny Edge Detection Method", are combined. The SNIC algorithm combines pixels in a satellite image into larger regions (super pixels) with similar characteristics, thereby providing better image analysis. The Canny Edge Detection Method detects sharp changes (edges) in the image to determine the precise boundaries of agricultural fields. This study, carried out using high-resolution multispectral data from the Sentinel-2 satellite and the Google Earth Engine JavaScript API, has shown that the proposed method is effective in accurately and reliably classifying randomly selected agricultural fields. The combined use of these two tools allows for more accurate determination of the boundaries of agricultural fields by minimizing the effects of outliers in satellite images. As a result, more accurate and reliable maps can be created for agricultural monitoring and resource management over large areas based on the obtained data. By expanding the application capabilities of cloud-based platforms and artificial intelligence methods in the agricultural field.
高效利用耕种区域是农业可持续发展和保障粮食安全的一个重要因素。随着发达国家卫星技术的迅速发展,人们正在寻找准确且操作性强的方式来识别耕地。在这种背景下,基于从卫星图像获取的数据进行光谱分析以确定耕地边界的方法被认为是现代农业中最为优化和精确的方法之一。本文提出了一种新方法,利用通过“Google Earth Engine”(GEE)平台获得的卫星数据来确定耕种区域的适宜性和绿色指数。该方法结合了两个强大的算法:“SNIC(Simple Non-Iterative Clustering)超像素”算法以及“Canny边缘检测法”。 SNIC算法将卫星图像中的像素组合成具有类似特性的较大区域(即超像素),从而提供更好的图像分析能力。而Canny边缘检测法则用于识别图像中急剧变化的边界,以确定农业用地的确切边界。这项研究使用了来自Sentinel-2卫星的高分辨率多光谱数据以及Google Earth Engine JavaScript API,并表明所提出的方法在准确且可靠地分类随机选择的农田方面非常有效。 通过结合这两种工具的使用,可以更精确地确定农业土地的边界,减少卫星图像中异常值的影响。基于获得的数据,可以在大面积范围内创建更加准确和可靠的农业监测与资源管理地图。这扩展了云计算平台及人工智能方法在农业领域的应用能力。
https://arxiv.org/abs/2502.04529
Aspect Sentiment Triplet Extraction (ASTE) is a thriving research area with impressive outcomes being achieved on high-resource languages. However, the application of cross-lingual transfer to the ASTE task has been relatively unexplored, and current code-switching methods still suffer from term boundary detection issues and out-of-dictionary problems. In this study, we introduce a novel Test-Time Code-SWitching (TT-CSW) framework, which bridges the gap between the bilingual training phase and the monolingual test-time prediction. During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs. In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation. Extensive experiments on cross-lingual ASTE datasets validate the effectiveness of our proposed method. We achieve an average improvement of 3.7% in terms of weighted-averaged F1 in four datasets with different languages. Additionally, we set a benchmark using ChatGPT and GPT-4, and demonstrate that even smaller generative models fine-tuned with our proposed TT-CSW framework surpass ChatGPT and GPT-4 by 14.2% and 5.0% respectively.
方面情感三元组抽取(ASTE)是一个充满活力的研究领域,在高资源语言上已经取得了令人瞩目的成果。然而,跨语言迁移在ASTE任务中的应用相对较少探索,当前的代码切换方法仍然存在术语边界检测问题和词典外的问题。在这项研究中,我们引入了一种新颖的测试时间代码切换(TT-CSW)框架,该框架弥合了双语训练阶段与单语测试时预测之间的差距。在训练过程中,基于双语文本代码切换的数据开发了一个生成模型,并且可以为双语输入产生双语ASTE三元组。在测试阶段,我们采用一种基于对齐的代码切换技术进行测试时间增强。跨语言ASTE数据集上的大量实验证明了我们提出方法的有效性。我们在四个不同语言的数据集中实现了加权平均F1分数3.7%的平均改进。此外,我们使用ChatGPT和GPT-4设置了基准,并证明即使是经过我们的TT-CSW框架微调的小型生成模型也分别超越了ChatGPT和GPT-4 14.2% 和5.0%。
https://arxiv.org/abs/2501.14144
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at this https URL.
在这篇论文中,我们提出了一种无监督的语音分割方法,该方法建立在先前研究的方法(如说话人识别)的基础上,并适用于广泛的声学-语义区别,从而为通用的无监督语音分割方法铺平了道路。与传统的语音和音频分割主要关注输入信号中的频谱变化(例如,音素划分)不同,我们的方法试图将口语内容划分为具有不同声学-语义风格的片段,并专注于那些难以转化为文本的信息,例如情感或说话人的身份。大多数语音分割任务仅处理一种风格的变化,例如情感记录,而我们提出的方法旨在处理多种声学-语义风格变化。 通过利用最近在语音语言模型(SLM)方面的进展,我们提出了一种简单无监督的分割方法来对给定的口语内容进行划分。我们通过对几个不同设置进行实证研究,证明了所提议方法的有效性。结果表明,在边界检测、片段纯净度和过度分段方面,我们的方法优于评估中的基准方法。 代码可在以下网址获得:[此 URL]
https://arxiv.org/abs/2501.03711
Multi-class semantic segmentation remains a cornerstone challenge in computer vision. Yet, dataset creation remains excessively demanding in time and effort, especially for specialized domains. Active Learning (AL) mitigates this challenge by selecting data points for annotation strategically. However, existing patch-based AL methods often overlook boundary pixels critical information, essential for accurate segmentation. We present OREAL, a novel patch-based AL method designed for multi-class semantic segmentation. OREAL enhances boundary detection by employing maximum aggregation of pixel-wise uncertainty scores. Additionally, we introduce one-vs-rest entropy, a novel uncertainty score function that computes class-wise uncertainties while achieving implicit class balancing during dataset creation. Comprehensive experiments across diverse datasets and model architectures validate our hypothesis.
多类语义分割仍然是计算机视觉中的一个核心挑战。然而,数据集的创建在时间和精力上仍然极其耗费,特别是在专业化领域中更是如此。主动学习(AL)通过战略性地选择注释的数据点来缓解这一挑战。但是,现有的基于补丁的AL方法往往忽视了边界像素的关键信息,而这些信息对于准确分割至关重要。我们提出了OREAL,这是一种专为多类语义分割设计的新颖的基于补丁的AL方法。OREAL 通过运用像素级不确定度分数的最大聚合来增强边界检测。此外,我们引入了一种新的不确定性评分函数——一对其余熵(one-vs-rest entropy),该函数在创建数据集的过程中计算类别级别的不确定度,同时实现隐式的类别平衡。广泛的实验跨越了不同的数据集和模型架构,验证了我们的假设。
https://arxiv.org/abs/2412.06470
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
https://arxiv.org/abs/2411.19772
Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.
不连续命名实体识别(DNER)提出了一个具有挑战性的问题,即实体可能分散在多个非相邻的标记中,这使得传统的序列标注方法变得不足。现有的方法主要依赖于定制化的标签方案来处理这些不连续的实体,导致模型与特定的标签策略紧密耦合,并且缺乏跨多种数据集的一般化能力。为了解决这些问题,我们提出了TriG-NER,一种新型的三元组网格框架,引入了一种通用的方法来学习用于提取不连续实体的健壮的词级别表示。我们的框架在词级别应用了三元组损失函数,其中相似性由存在于同一实体中的词对定义,有效拉近相似项并推开不相似项。这种做法增强了实体边界的检测,并通过专注于灵活网格结构内的词对关系减少了对特定标签方案的依赖。我们在三个基准DNER数据集上评估了TriG-NER,并展示了其在现有基于网格架构上的显著改进。这些结果突显了我们框架捕获复杂实体结构的有效性及其适应多种标签方案的能力,为不连续实体提取树立了一个新的标杆。
https://arxiv.org/abs/2411.01839
With increasing usage of generative models for text generation and widespread use of machine generated texts in various domains, being able to distinguish between human written and machine generated texts is a significant challenge. While existing models and proprietary systems focus on identifying whether given text is entirely human written or entirely machine generated, only a few systems provide insights at sentence or paragraph level at likelihood of being machine generated at a non reliable accuracy level, working well only for a set of domains and generators. This paper introduces few reliable approaches for the novel task of identifying which part of a given text is machine generated at a word level while comparing results from different approaches and methods. We present a comparison with proprietary systems , performance of our model on unseen domains' and generators' texts. The findings reveal significant improvements in detection accuracy along with comparison on other aspects of detection capabilities. Finally we discuss potential avenues for improvement and implications of our work. The proposed model is also well suited for detecting which parts of a text are machine generated in outputs of Instruct variants of many LLMs.
https://arxiv.org/abs/2410.16659
Achieving precise medical image segmentation is vital for effective treatment planning and accurate disease diagnosis. Traditional fully-supervised deep learning methods, though highly precise, are heavily reliant on large volumes of labeled data, which are often difficult to obtain due to the expertise required for medical annotations. This has led to the rise of semi-supervised learning approaches that utilize both labeled and unlabeled data to mitigate the label scarcity issue. In this paper, we introduce the Manifold-Aware Local Feature Modeling Network (MANet), which enhances the U-Net architecture by incorporating manifold supervision signals. This approach focuses on improving boundary accuracy, which is crucial for reliable medical diagnosis. To further extend the versatility of our method, we propose two variants: MA-Sobel and MA-Canny. The MA-Sobel variant employs the Sobel operator, which is effective for both 2D and 3D data, while the MA-Canny variant utilizes the Canny operator, specifically designed for 2D images, to refine boundary detection. These variants allow our method to adapt to various medical image modalities and dimensionalities, ensuring broader applicability. Our extensive experiments on datasets such as ACDC, LA, and Pancreas-NIH demonstrate that MANet consistently surpasses state-of-the-art methods in performance metrics like Dice and Jaccard scores. The proposed method also shows improved generalization across various semi-supervised segmentation networks, highlighting its robustness and effectiveness. Visual analysis of segmentation results confirms that MANet offers clearer and more accurate class boundaries, underscoring the value of manifold information in medical image segmentation.
https://arxiv.org/abs/2410.10287
With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.
随着视频理解的不断发展,针对视频级别的 Temporal 视频分析任务(TAD、TAS 和 GEBD)也呈现出繁荣的趋势。虽然针对特定任务的视频理解模型在各个任务上都表现出优异的性能,但目前尚无一个能够同时处理多个任务的统一框架,这为下一代的 AI 带来了很好的发展前景。为此,在本文中,我们提出了一个名为 Temporal2Seq 的统一框架,将其作为这些 Temporal 视频理解任务的输出序列表示。有了这个统一的表示,Temporal2Seq 可以在单个架构上训练通用模型来处理不同的视频理解任务。在没有多任务学习(MTL)基准的情况下,我们通过借用 TAD、TAS 和 GEBD 任务的数据集来构建了一个全面的联合训练数据集。我们在三个任务相应的测试集上评估了我们的 Temporal2Seq 通用模型,证明了 Temporal2Seq 在各种任务上可以产生合理的結果,并比基于单任务训练在这个框架上具有优势。我们还研究了我们的通用模型在新任务数据集上的泛化性能,结果表明,与特定模型相比,我们的通用模型具有更优越的泛化性能。
https://arxiv.org/abs/2409.18478
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.
我们观察到将未标记的语音段分割成单词类似段并将其聚类成一个词典一直是一个长期存在的问题。以前的方法使用评分模型与动态规划相结合来寻找最优的分割。然而,我们在这里提出了一种非常简单的策略:我们使用相邻自监督特征之间的差异预测单词边界,然后将预测的段聚类为词典。为了进行公平的比较,我们用更好的特征和边界约束更新较旧的ES-KMeans动态规划方法。在五语言的ZeroSpeech基准测试中,与新ES-KMeans+方法相比,我们的简单方法给出了类似的最先进的成果,而速度却快了几乎5倍。
https://arxiv.org/abs/2409.14486
Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
由于地理环境多样性、复杂的地形和密集聚落,使用遥感图像进行自动识别城市乡村边界是一项具有挑战性的任务。本文提出了一种新颖且高效的神经网络模型,称为UV-Mamba,用于准确地在高分辨率遥感图像中检测边界。UV-Mamba通过引入可变形卷积(DCN)缓解了在状态空间模型(SSM)中随着图像尺寸增加而产生的长序列建模内存损失问题。其架构采用编码器-解码器框架,包括一个具有四个可变形状态空间增强(DSSA)块的编码器和一个整合提取的语义信息的解码器。我们对北京和西安数据集进行了实验,结果表明,UV-Mamba实现了最先进的性能。具体来说,我们的模型在 Beijing 和 Xi'an 数据集上分别实现了 73.3% 和 78.1% 的IoU,分别比以前的最佳模型提高了1.2%和3.4%,同时在推理速度上快了6倍,而在参数数量上小了40倍。源代码和预训练模型可在补充材料中获取。
https://arxiv.org/abs/2409.03431
This work presents the INBD network proposed by Gillert et al. in CVPR-2023 and studies its application for delineating tree rings in RGB images of Pinus taeda cross sections captured by a smartphone (UruDendro dataset), which are images with different characteristics from the ones used to train the method. The INBD network operates in two stages: first, it segments the background, pith, and ring boundaries. In the second stage, the image is transformed into polar coordinates, and ring boundaries are iteratively segmented from the pith to the bark. Both stages are based on the U-Net architecture. The method achieves an F-Score of 77.5, a mAR of 0.540, and an ARAND of 0.205 on the evaluation set. The code for the experiments is available at this https URL.
本文介绍了Gillert等人提出并在CVPR-2023会议上发表的INBD网络,并研究了其在捕捉智能手机捕获的Pinus taeda剖面图像中的树木环的定义应用。与训练方法使用的图像具有不同的特点。INBD网络分为两个阶段:第一阶段是分割背景、皮层和环边界;第二阶段是将图像转换为极坐标,然后从皮层到木质部依次分割环边界。两个阶段都基于U-Net架构。该方法在评估集上的F-分数为77.5,mAR为0.540,ARAND为0.205。实验代码可在此处访问的链接中找到。
https://arxiv.org/abs/2408.14343
Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{this https URL}.
通用事件边界检测(GEBD)是一种模型,受到人类在观看视频时将视频划分为有意义的时间段的视觉认知行为的启发,可以在各种应用中找到用处,如视频编辑等。在本文中,我们证明了最先进的GEBD模型通常会优先考虑最终性能,导致推理速度较低,阻碍了在现实场景中的高效部署。通过实验重新审视GEBD模型的架构,我们发现了几个令人惊讶的发现,为解决这一挑战做出了贡献。首先,我们发现一个简洁的GEBD基线模型已经实现了很好的性能,而无需进行复杂的设计。其次,我们发现广泛应用于GEBD模型的图像领域骨架可以包含大量的架构冗余,这激发了我们逐步“现代化”每个组件,提高效率。第三,我们展示了使用图像领域骨架进行空间-然后-时间学习的GEBD模型可能存在分心问题,这可能是GEBD的低效反派。使用视频领域骨架共同进行空间-然后-时间建模是解决这个问题的有效方法。我们的探索结果是一系列GEBD模型,名为EfficientGEBD,在相同骨架上显著超过了之前的SOTA方法,性能提高了1.7%至280%的速度。我们的研究鼓励社区设计考虑模型复杂性的现代GEBD方法,尤其是在资源感知应用中。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2407.12622
Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.
通用事件边界检测(GEBD)旨在通过自然感知的人类事件边界来精确地定位事件边界,对于理解长格式视频至关重要。由于通用边界的多样性,包括不同的视频表现、物体和动作,这项任务仍然具有挑战性。现有的方法通常通过相同的协议检测各种边界,而不考虑它们的独特特点和检测难度,导致性能低下。从直觉上讲,更智能和合理的边界检测方法是通过考虑它们的特殊性质来动态地检测边界。因此,我们提出了一个名为DyBDet的新通用事件边界检测动态管道。通过引入多出口网络架构,DyBDet可以自动学习为不同视频片段分配的子网,实现对各种边界的细粒度检测。此外,还提出了多级差异检测器,以确保通用边界可以有效地被识别和适应处理。在具有挑战性的Kinetics-GEBD和TAPOS数据集上的实验表明,采用动态策略显著提高了GEBD任务的表现和效率,与现有技术的水平相比,显着改善了性能和效率。
https://arxiv.org/abs/2407.04274
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228
Moment retrieval aims to locate the most relevant moment in an untrimmed video based on a given natural language query. Existing solutions can be roughly categorized into moment-based and clip-based methods. The former often involves heavy computations, while the latter, due to overlooking coarse-grained information, typically underperforms compared to moment-based models. Hence, this paper proposes a novel 2-Dimensional Pointer-based Machine Reading Comprehension for Moment Retrieval Choice (2DP-2MRC) model to address the issue of imprecise localization in clip-based methods while maintaining lower computational complexity than moment-based methods. Specifically, we introduce an AV-Encoder to capture coarse-grained information at moment and video levels. Additionally, a 2D pointer encoder module is introduced to further enhance boundary detection for target moment. Extensive experiments on the HiREST dataset demonstrate that 2DP-2MRC significantly outperforms existing baseline models.
翻译: Moment 检索旨在根据给定的自然语言查询在未剪辑的视频中查找最相关的时刻。现有的解决方案可以大致分为基于时刻和基于剪辑的方法。前者通常需要进行大量的计算,而后者,由于忽略了粗粒度信息,与基于时刻的模型相比,通常表现不佳。因此,本文提出了一种新颖的2维指针基于机器阅读理解(2DP-2MRC)模型来解决基于剪辑方法的精确度问题,同时保持较低的计算复杂性。具体来说,我们引入了一个AV编码器来捕捉时刻和视频级别的粗粒度信息。此外,还引入了一个2D指针编码器模块来进一步增强目标时刻的边界检测。在HiREST数据集上的大量实验证明,2DP-2MRC显著优于现有基线模型。
https://arxiv.org/abs/2406.06201
Computational morphology handles the language processing at the word level. It is one of the foundational tasks in the NLP pipeline for the development of higher level NLP applications. It mainly deals with the processing of words and word forms. Computational Morphology addresses various sub problems such as morpheme boundary detection, lemmatization, morphological feature tagging, morphological reinflection etc. In this paper, we present exhaustive survey of the methods for developing computational morphology related tools. We survey the literature in the chronological order starting from the conventional methods till the recent evolution of deep neural network based approaches. We also review the existing datasets available for this task across the languages. We discuss about the effectiveness of neural model compared with the traditional models and present some unique challenges associated with building the computational morphology tools. We conclude by discussing some recent and open research issues in this field.
计算形态学处理语料中的语言处理。它是自然语言处理(NLP)中开发高级NLP应用程序的基本任务之一。它主要涉及处理单词和词形。计算形态学解决了许多子问题,如词素边检测、词形化、词形特征标记、词形重塑等。在本文中,我们全面调查了用于开发计算形态学相关工具的方法。我们按时间顺序调查文献,从传统的方法到基于深度神经网络方法的最近发展。我们还回顾了此任务现有数据集。我们讨论了与传统模型相比,神经模型的有效性以及与构建计算形态学工具相关的独特挑战。我们最后讨论了该领域的一些最近和开放的研究问题。
https://arxiv.org/abs/2406.05424
A key challenge for the widespread application of learning-based models for robotic perception is to significantly reduce the required amount of annotated training data while achieving accurate predictions. This is essential not only to decrease operating costs but also to speed up deployment time. In this work, we address this challenge for PAnoptic SegmenTation with fEw Labels (PASTEL) by exploiting the groundwork paved by visual foundation models. We leverage descriptive image features from such a model to train two lightweight network heads for semantic segmentation and object boundary detection, using very few annotated training samples. We then merge their predictions via a novel fusion module that yields panoptic maps based on normalized cut. To further enhance the performance, we utilize self-training on unlabeled images selected by a feature-driven similarity scheme. We underline the relevance of our approach by employing PASTEL to important robot perception use cases from autonomous driving and agricultural robotics. In extensive experiments, we demonstrate that PASTEL significantly outperforms previous methods for label-efficient segmentation even when using fewer annotations. The code of our work is publicly available at this http URL.
为在机器人感知中广泛应用基于学习的模型的一个关键挑战是,在实现准确预测的同时显著减少所需的注释训练数据量。这对于降低操作成本至关重要,同时也加速了部署时间。在这项工作中,我们通过利用视觉基础模型留下的基础工作来解决PASTEL(部分自动驾驶视觉要素分割)的挑战。我们利用该模型的描述性图像特征来训练两个轻量级的网络头进行语义分割和目标边界检测,使用非常少的注释训练样本。然后,通过一种新颖的融合模块合并它们的预测,该模块基于归一化的切点产生全景图。为了进一步提高性能,我们还利用基于特征的相似度方案选择的自监督图像进行自训练。我们通过PASTEL来强调我们的方法在自动驾驶和农业机器人感知领域的关键应用。在广泛的实验中,我们证明了即使使用更少的注释,PASTEL在语义分割方面的性能也远远超过了以前的方法。我们工作的代码公开在這個網址:http://www.example.com/。
https://arxiv.org/abs/2405.19035
How to extract instance-level masks without instance-level supervision is the main challenge of weakly supervised instance segmentation (WSIS). Popular WSIS methods estimate a displacement field (DF) via learning inter-pixel relations and perform clustering to identify instances. However, the resulting instance centroids are inherently unstable and vary significantly across different clustering algorithms. In this paper, we propose Boundary-Assisted Instance Segmentation (BAISeg), which is a novel paradigm for WSIS that realizes instance segmentation with pixel-level annotations. BAISeg comprises an instance-aware boundary detection (IABD) branch and a semantic segmentation branch. The IABD branch identifies instances by predicting class-agnostic instance boundaries rather than instance centroids, therefore, it is different from previous DF-based approaches. In particular, we proposed the Cascade Fusion Module (CFM) and the Deep Mutual Attention (DMA) in the IABD branch to obtain rich contextual information and capture instance boundaries with weak responses. During the training phase, we employed Pixel-to-Pixel Contrast to enhance the discriminative capacity of the IABD branch. This further strengthens the continuity and closedness of the instance boundaries. Extensive experiments on PASCAL VOC 2012 and MS COCO demonstrate the effectiveness of our approach, and we achieve considerable performance with only pixel-level annotations. The code will be available at this https URL.
如何在没有实例级监督的情况下提取实例级别的分割掩码是弱监督实例分割(WSIS)的主要挑战。流行的WSIS方法通过学习 inter-pixel 关系估计偏移场(DF),并聚类以确定实例。然而,得到的实例中心点固有不稳定,并且在不同的聚类算法之间显著变化。在本文中,我们提出了边界辅助实例分割(BAISeg),它是一种新的WSIS范例,实现了使用像素级注释进行实例分割。BAISeg由实例感知边界检测(IABD)分支和语义分割分支组成。IABD分支通过预测类无关的实例边界而不是实例中心点来确定实例,因此与之前基于DF的 approaches不同。特别地,我们在IABD分支上提出了级联融合模块(CFM)和深度互注意(DMA),以获得丰富的上下文信息并捕捉弱响应的实例边界。在训练阶段,我们采用 Pixel-to-Pixel Contrast 增强 IABD分支的判别能力。这进一步增强了实例边界的连续性和关闭性。在PASCAL VOC 2012和MS COCO上进行的大量实验证明了我们方法的的有效性,我们仅使用像素级注释就取得了显著的性能。代码将在此处 available at this <https:// URL>。
https://arxiv.org/abs/2406.18558
This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs demonstrate promising OCR capabilities but produce unsatisfactory results due to cell omission and misalignment, and they notably exhibit insufficient spatial and format recognition skills, motivating future work to enhance VLMs' spreadsheet data comprehension capabilities using our methods to generate extensive spreadsheet-image pairs in various settings.
本文探讨了在电子表格理解方面的视觉语言模型的能力。我们提出了三个自监督挑战,并相应地定义了评估指标,以全面评估VLM在光学字符识别(OCR)、空间感知和视觉格式识别方面的能力。此外,我们还利用电子表格表检测任务来评估VLM的整体性能,通过将这些挑战集成起来。为了更深入地研究VLM,我们提出了三个电子表格到图像设置:行宽调整、样式改变和地址增强。我们提出了适应不同设置的提示来解决上述任务。值得注意的是,为了充分利用VLM在理解文本而不是二维定位方面的优势,我们在电子表格边界检测中编码单元值的四边界。我们的研究结果表明,VLM显示出有前景的OCR能力,但由于单元格省略和错位,其结果并不理想。此外,它们在空间感知和格式识别方面的能力尤为不足,这激励未来的工作使用我们的方法增强VLMs的电子表格数据理解能力,以便在各种设置中生成广泛的电子表格图像对。
https://arxiv.org/abs/2405.16234