Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in this https URL.
在开放世界环境中学习技能对于开发能够通过组合基本技能来处理各种任务的代理至关重要。然而,网上的演示视频通常很长且未经过分割和标注,这使得它们难以被分割并用技能标识符进行标记。与现有方法依赖序列采样或人工标注不同,我们提出了一种基于自监督学习的方法,将这些长视频分割成一系列具有语义感知和技能一致性的小段。受人类认知事件分割理论的启发,我们引入了Skill Boundary Detection(SBD),这是一种无需注释的时间视频分割算法。 SBD通过利用预训练无条件动作预测模型产生的预测误差来检测视频中的技能边界。该方法基于假设:预测误差显著增加表明执行的动作或技能发生了转变。 我们在《我的世界》(Minecraft)中测试了我们的方法,这是一个拥有丰富开放世界模拟和大量在线游戏录像的游戏平台。我们发现由SBD生成的片段能够将条件策略在短期原子技能任务上的平均性能分别提高了63.7%和52.1%,以及长期任务上对应的层次化代理性能分别提升了11.3%和20.8%。 我们的方法可以利用多样化的YouTube视频来训练遵循指令的智能体。项目页面可以在提供的URL中找到。
https://arxiv.org/abs/2503.10684
Compared to conventional wheeled transportation systems designed for flat surfaces, soft robots exhibit exceptional adaptability to various terrains, enabling stable movement in complex environments. However, due to the risk of collision with obstacles and barriers, most soft robots rely on sensors for navigation in unstructured environments with uncertain boundaries. In this work, we present the WHERE-Bot, a wheel-less everting soft robot capable of omnidirectional locomotion. Our WHERE-Bot can navigate through unstructured environments by leveraging its structural and motion advantages rather than relying on sensors for boundary detection. By configuring a spring toy ``Slinky'' into a loop shape, the WHERE-Bot performs multiple rotational motions: spiral-rotating along the hub circumference, self-rotating around the hub's center, and orbiting around a certain point. The robot's trajectories can be reprogrammed by actively altering its mass distribution. The WHERE-Bot shows significant potential for boundary exploration in unstructured environments.
与专为平坦表面设计的传统轮式运输系统相比,软体机器人在各种地形中表现出卓越的适应性,能够在复杂环境中实现稳定移动。然而,由于存在与障碍物和屏障碰撞的风险,大多数软体机器人依赖传感器进行未结构化环境中的导航,在这种环境下边界是不确定的。在这项工作中,我们介绍了WHERE-Bot,这是一种无轮、能够全方位运动的翻卷式软体机器人。我们的WHERE-Bot能够在没有依靠传感器检测边界的前提下,通过利用其独特的结构和运动优势来穿越未结构化的环境进行导航。 通过将弹簧玩具“Slinky”配置成环形,WHERE-Bot可以执行多种旋转动作:沿着中心轮缘螺旋旋转、围绕中心点自转以及绕某个定点公转。该机器人的轨迹可以通过主动改变其质量分布而重新编程。WHERE-Bot在未结构化环境中进行边界探索方面展现出巨大的潜力。
https://arxiv.org/abs/2503.07245
Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at this https URL.
无修剪视频中的时间定位,旨在识别特定的时间戳,在视频理解中至关重要但仍然具有挑战性。这一任务包括若干子任务,如时间动作定位、时间视频对齐、时刻检索和通用事件边界检测等。现有方法通常针对具体任务设计,并且在跨域应用方面缺乏泛化能力。本文提出了TimeLoc,这是一个统一的端到端框架,用于处理多个任务的时间戳定位。首先,我们的方法采用了一种简单而有效的单阶段定位模型,支持以文本查询作为输入并输出多个动作。其次,我们通过端到端方式联合训练视频编码器和定位模型。为了高效地处理长视频,我们引入了时间分块技术,使得能够处理超过30k帧的视频。第三,我们发现使用多阶段微调策略对预训练文本编码器进行细化,进一步增强了基于文本条件下的定位效果。 TimeLoc在多个基准测试中取得了最先进的结果:THUMOS14和EPIC-Kitchens-100上的mAP分别比之前最佳方法高出+1.3%和+1.9%,Kinetics-GEBD上高出+1.1%,QVHighlights上的mAP为+2.94%,以及在TACoS和Charades-STA(R1@0.5)的视频时间对齐任务中分别提高了+11.5%和+6.7%。 我们的代码和检查点将在此网址上发布。
https://arxiv.org/abs/2503.06526
Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.
声音事件检测(SED)从自监督学习(SSL)方法中受益匪浅,特别是用于SED的掩码音频变换器(MAT-SED),它利用掩码块预测来重建缺失的音频片段。然而,虽然有效捕捉全局依赖性,但掩码块预测会扰乱瞬态声音事件,并且缺乏对时间顺序的显式约束,使其不太适合细粒度事件边界的检测。 为了解决这些问题,我们提出了JiTTER(拼图时间变换器用于事件重建),这是一种针对基于变压器的SED改进时序建模能力的自监督学习框架。JiTTER 引入了一种层次化的时间打乱重构策略,在块级和帧级随机地对音频序列进行打乱,迫使模型重新构建正确的时序顺序。这种预训练目标鼓励模型同时学习全局事件结构和细粒度瞬态细节,从而提高其检测具有急剧开始和结束特性的事件的能力。 此外,我们还在块打乱期间加入了噪声注入,提供了一种微妙的扰动机制,进一步规范特征学习并增强模型鲁棒性。在DESED数据集上的实验结果表明,JiTTER 超过了MAT-SED,在PSDS指标上提高了5.89%,证明了显式时间推理在基于SSL的声音事件表示学习中的有效性。 我们的研究发现表明,结构化的时间重构任务相比简单的掩码预测提供了一种更为有效的预训练范例用于声音事件的表示学习。
https://arxiv.org/abs/2502.20857
Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.
现有的监督动作分割方法依赖于注意力机制或时间卷积来捕捉帧级分类的质量,以捕获时间依赖性。即使是基于边界检测的方法也主要依赖初始帧级别分类的准确性,在预测质量较低的情况下可能会忽略精确识别段和边界的细节。为了解决这个问题,本文提出了ASESM(通过显式相似度测量的动作分割),通过在帧之间以及预测之间引入显式的相似度评估来增强分割精度。我们的监督学习架构将多分辨率帧级特征作为多个Transformer编码器的输入。生成的多个帧级别预测被用于相似性投票以获得高质量初始预测。我们应用了一个新的基于连续帧间特征相似性的边界修正算法,通过迭代的学习过程逐步调整边界位置。随后,经过多次时间卷积阶段进一步细化纠正后的预测结果。在后期处理中,我们可以选择再次执行边界修正,并通过测量连续预测之间的相似度来移除段内的离群类别以实现平滑化操作。 此外,我们还提出了一种完全无监督的边界检测校正算法,仅基于特征相似性而不需任何训练即可识别出段边界。在50Salads、GTEA和Breakfast数据集上的实验展示了该监督与非监督算法的有效性。代码和模型已在Github上公开提供。
https://arxiv.org/abs/2502.10713
Efficient use of cultivated areas is a necessary factor for sustainable development of agriculture and ensuring food security. Along with the rapid development of satellite technologies in developed countries, new methods are being searched for accurate and operational identification of cultivated areas. In this context, identification of cropland boundaries based on spectral analysis of data obtained from satellite images is considered one of the most optimal and accurate methods in modern agriculture. This article proposes a new approach to determine the suitability and green index of cultivated areas using satellite data obtained through the "Google Earth Engine" (GEE) platform. In this approach, two powerful algorithms, "SNIC (Simple Non-Iterative Clustering) Super Pixels" and "Canny Edge Detection Method", are combined. The SNIC algorithm combines pixels in a satellite image into larger regions (super pixels) with similar characteristics, thereby providing better image analysis. The Canny Edge Detection Method detects sharp changes (edges) in the image to determine the precise boundaries of agricultural fields. This study, carried out using high-resolution multispectral data from the Sentinel-2 satellite and the Google Earth Engine JavaScript API, has shown that the proposed method is effective in accurately and reliably classifying randomly selected agricultural fields. The combined use of these two tools allows for more accurate determination of the boundaries of agricultural fields by minimizing the effects of outliers in satellite images. As a result, more accurate and reliable maps can be created for agricultural monitoring and resource management over large areas based on the obtained data. By expanding the application capabilities of cloud-based platforms and artificial intelligence methods in the agricultural field.
高效利用耕种区域是农业可持续发展和保障粮食安全的一个重要因素。随着发达国家卫星技术的迅速发展,人们正在寻找准确且操作性强的方式来识别耕地。在这种背景下,基于从卫星图像获取的数据进行光谱分析以确定耕地边界的方法被认为是现代农业中最为优化和精确的方法之一。本文提出了一种新方法,利用通过“Google Earth Engine”(GEE)平台获得的卫星数据来确定耕种区域的适宜性和绿色指数。该方法结合了两个强大的算法:“SNIC(Simple Non-Iterative Clustering)超像素”算法以及“Canny边缘检测法”。 SNIC算法将卫星图像中的像素组合成具有类似特性的较大区域(即超像素),从而提供更好的图像分析能力。而Canny边缘检测法则用于识别图像中急剧变化的边界,以确定农业用地的确切边界。这项研究使用了来自Sentinel-2卫星的高分辨率多光谱数据以及Google Earth Engine JavaScript API,并表明所提出的方法在准确且可靠地分类随机选择的农田方面非常有效。 通过结合这两种工具的使用,可以更精确地确定农业土地的边界,减少卫星图像中异常值的影响。基于获得的数据,可以在大面积范围内创建更加准确和可靠的农业监测与资源管理地图。这扩展了云计算平台及人工智能方法在农业领域的应用能力。
https://arxiv.org/abs/2502.04529
Aspect Sentiment Triplet Extraction (ASTE) is a thriving research area with impressive outcomes being achieved on high-resource languages. However, the application of cross-lingual transfer to the ASTE task has been relatively unexplored, and current code-switching methods still suffer from term boundary detection issues and out-of-dictionary problems. In this study, we introduce a novel Test-Time Code-SWitching (TT-CSW) framework, which bridges the gap between the bilingual training phase and the monolingual test-time prediction. During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs. In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation. Extensive experiments on cross-lingual ASTE datasets validate the effectiveness of our proposed method. We achieve an average improvement of 3.7% in terms of weighted-averaged F1 in four datasets with different languages. Additionally, we set a benchmark using ChatGPT and GPT-4, and demonstrate that even smaller generative models fine-tuned with our proposed TT-CSW framework surpass ChatGPT and GPT-4 by 14.2% and 5.0% respectively.
方面情感三元组抽取(ASTE)是一个充满活力的研究领域,在高资源语言上已经取得了令人瞩目的成果。然而,跨语言迁移在ASTE任务中的应用相对较少探索,当前的代码切换方法仍然存在术语边界检测问题和词典外的问题。在这项研究中,我们引入了一种新颖的测试时间代码切换(TT-CSW)框架,该框架弥合了双语训练阶段与单语测试时预测之间的差距。在训练过程中,基于双语文本代码切换的数据开发了一个生成模型,并且可以为双语输入产生双语ASTE三元组。在测试阶段,我们采用一种基于对齐的代码切换技术进行测试时间增强。跨语言ASTE数据集上的大量实验证明了我们提出方法的有效性。我们在四个不同语言的数据集中实现了加权平均F1分数3.7%的平均改进。此外,我们使用ChatGPT和GPT-4设置了基准,并证明即使是经过我们的TT-CSW框架微调的小型生成模型也分别超越了ChatGPT和GPT-4 14.2% 和5.0%。
https://arxiv.org/abs/2501.14144
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at this https URL.
在这篇论文中,我们提出了一种无监督的语音分割方法,该方法建立在先前研究的方法(如说话人识别)的基础上,并适用于广泛的声学-语义区别,从而为通用的无监督语音分割方法铺平了道路。与传统的语音和音频分割主要关注输入信号中的频谱变化(例如,音素划分)不同,我们的方法试图将口语内容划分为具有不同声学-语义风格的片段,并专注于那些难以转化为文本的信息,例如情感或说话人的身份。大多数语音分割任务仅处理一种风格的变化,例如情感记录,而我们提出的方法旨在处理多种声学-语义风格变化。 通过利用最近在语音语言模型(SLM)方面的进展,我们提出了一种简单无监督的分割方法来对给定的口语内容进行划分。我们通过对几个不同设置进行实证研究,证明了所提议方法的有效性。结果表明,在边界检测、片段纯净度和过度分段方面,我们的方法优于评估中的基准方法。 代码可在以下网址获得:[此 URL]
https://arxiv.org/abs/2501.03711
Multi-class semantic segmentation remains a cornerstone challenge in computer vision. Yet, dataset creation remains excessively demanding in time and effort, especially for specialized domains. Active Learning (AL) mitigates this challenge by selecting data points for annotation strategically. However, existing patch-based AL methods often overlook boundary pixels critical information, essential for accurate segmentation. We present OREAL, a novel patch-based AL method designed for multi-class semantic segmentation. OREAL enhances boundary detection by employing maximum aggregation of pixel-wise uncertainty scores. Additionally, we introduce one-vs-rest entropy, a novel uncertainty score function that computes class-wise uncertainties while achieving implicit class balancing during dataset creation. Comprehensive experiments across diverse datasets and model architectures validate our hypothesis.
多类语义分割仍然是计算机视觉中的一个核心挑战。然而,数据集的创建在时间和精力上仍然极其耗费,特别是在专业化领域中更是如此。主动学习(AL)通过战略性地选择注释的数据点来缓解这一挑战。但是,现有的基于补丁的AL方法往往忽视了边界像素的关键信息,而这些信息对于准确分割至关重要。我们提出了OREAL,这是一种专为多类语义分割设计的新颖的基于补丁的AL方法。OREAL 通过运用像素级不确定度分数的最大聚合来增强边界检测。此外,我们引入了一种新的不确定性评分函数——一对其余熵(one-vs-rest entropy),该函数在创建数据集的过程中计算类别级别的不确定度,同时实现隐式的类别平衡。广泛的实验跨越了不同的数据集和模型架构,验证了我们的假设。
https://arxiv.org/abs/2412.06470
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
https://arxiv.org/abs/2411.19772
Discontinuous Named Entity Recognition (DNER) presents a challenging problem where entities may be scattered across multiple non-adjacent tokens, making traditional sequence labelling approaches inadequate. Existing methods predominantly rely on custom tagging schemes to handle these discontinuous entities, resulting in models tightly coupled to specific tagging strategies and lacking generalisability across diverse datasets. To address these challenges, we propose TriG-NER, a novel Triplet-Grid Framework that introduces a generalisable approach to learning robust token-level representations for discontinuous entity extraction. Our framework applies triplet loss at the token level, where similarity is defined by word pairs existing within the same entity, effectively pulling together similar and pushing apart dissimilar ones. This approach enhances entity boundary detection and reduces the dependency on specific tagging schemes by focusing on word-pair relationships within a flexible grid structure. We evaluate TriG-NER on three benchmark DNER datasets and demonstrate significant improvements over existing grid-based architectures. These results underscore our framework's effectiveness in capturing complex entity structures and its adaptability to various tagging schemes, setting a new benchmark for discontinuous entity extraction.
不连续命名实体识别(DNER)提出了一个具有挑战性的问题,即实体可能分散在多个非相邻的标记中,这使得传统的序列标注方法变得不足。现有的方法主要依赖于定制化的标签方案来处理这些不连续的实体,导致模型与特定的标签策略紧密耦合,并且缺乏跨多种数据集的一般化能力。为了解决这些问题,我们提出了TriG-NER,一种新型的三元组网格框架,引入了一种通用的方法来学习用于提取不连续实体的健壮的词级别表示。我们的框架在词级别应用了三元组损失函数,其中相似性由存在于同一实体中的词对定义,有效拉近相似项并推开不相似项。这种做法增强了实体边界的检测,并通过专注于灵活网格结构内的词对关系减少了对特定标签方案的依赖。我们在三个基准DNER数据集上评估了TriG-NER,并展示了其在现有基于网格架构上的显著改进。这些结果突显了我们框架捕获复杂实体结构的有效性及其适应多种标签方案的能力,为不连续实体提取树立了一个新的标杆。
https://arxiv.org/abs/2411.01839
With increasing usage of generative models for text generation and widespread use of machine generated texts in various domains, being able to distinguish between human written and machine generated texts is a significant challenge. While existing models and proprietary systems focus on identifying whether given text is entirely human written or entirely machine generated, only a few systems provide insights at sentence or paragraph level at likelihood of being machine generated at a non reliable accuracy level, working well only for a set of domains and generators. This paper introduces few reliable approaches for the novel task of identifying which part of a given text is machine generated at a word level while comparing results from different approaches and methods. We present a comparison with proprietary systems , performance of our model on unseen domains' and generators' texts. The findings reveal significant improvements in detection accuracy along with comparison on other aspects of detection capabilities. Finally we discuss potential avenues for improvement and implications of our work. The proposed model is also well suited for detecting which parts of a text are machine generated in outputs of Instruct variants of many LLMs.
https://arxiv.org/abs/2410.16659
Achieving precise medical image segmentation is vital for effective treatment planning and accurate disease diagnosis. Traditional fully-supervised deep learning methods, though highly precise, are heavily reliant on large volumes of labeled data, which are often difficult to obtain due to the expertise required for medical annotations. This has led to the rise of semi-supervised learning approaches that utilize both labeled and unlabeled data to mitigate the label scarcity issue. In this paper, we introduce the Manifold-Aware Local Feature Modeling Network (MANet), which enhances the U-Net architecture by incorporating manifold supervision signals. This approach focuses on improving boundary accuracy, which is crucial for reliable medical diagnosis. To further extend the versatility of our method, we propose two variants: MA-Sobel and MA-Canny. The MA-Sobel variant employs the Sobel operator, which is effective for both 2D and 3D data, while the MA-Canny variant utilizes the Canny operator, specifically designed for 2D images, to refine boundary detection. These variants allow our method to adapt to various medical image modalities and dimensionalities, ensuring broader applicability. Our extensive experiments on datasets such as ACDC, LA, and Pancreas-NIH demonstrate that MANet consistently surpasses state-of-the-art methods in performance metrics like Dice and Jaccard scores. The proposed method also shows improved generalization across various semi-supervised segmentation networks, highlighting its robustness and effectiveness. Visual analysis of segmentation results confirms that MANet offers clearer and more accurate class boundaries, underscoring the value of manifold information in medical image segmentation.
https://arxiv.org/abs/2410.10287
With the development of video understanding, there is a proliferation of tasks for clip-level temporal video analysis, including temporal action detection (TAD), temporal action segmentation (TAS), and generic event boundary detection (GEBD). While task-specific video understanding models have exhibited outstanding performance in each task, there remains a dearth of a unified framework capable of simultaneously addressing multiple tasks, which is a promising direction for the next generation of AI. To this end, in this paper, we propose a single unified framework, coined as Temporal2Seq, to formulate the output of these temporal video understanding tasks as a sequence of discrete tokens. With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks. In the absence of multi-task learning (MTL) benchmarks, we compile a comprehensive co-training dataset by borrowing the datasets from TAD, TAS, and GEBD tasks. We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks and achieve advantages compared with single-task training on this framework. We also investigate the generalization performance of our generalist model on new datasets from different tasks, which yields superior performance to the specific model.
随着视频理解的不断发展,针对视频级别的 Temporal 视频分析任务(TAD、TAS 和 GEBD)也呈现出繁荣的趋势。虽然针对特定任务的视频理解模型在各个任务上都表现出优异的性能,但目前尚无一个能够同时处理多个任务的统一框架,这为下一代的 AI 带来了很好的发展前景。为此,在本文中,我们提出了一个名为 Temporal2Seq 的统一框架,将其作为这些 Temporal 视频理解任务的输出序列表示。有了这个统一的表示,Temporal2Seq 可以在单个架构上训练通用模型来处理不同的视频理解任务。在没有多任务学习(MTL)基准的情况下,我们通过借用 TAD、TAS 和 GEBD 任务的数据集来构建了一个全面的联合训练数据集。我们在三个任务相应的测试集上评估了我们的 Temporal2Seq 通用模型,证明了 Temporal2Seq 在各种任务上可以产生合理的結果,并比基于单任务训练在这个框架上具有优势。我们还研究了我们的通用模型在新任务数据集上的泛化性能,结果表明,与特定模型相比,我们的通用模型具有更优越的泛化性能。
https://arxiv.org/abs/2409.18478
We look at the long-standing problem of segmenting unlabeled speech into word-like segments and clustering these into a lexicon. Several previous methods use a scoring model coupled with dynamic programming to find an optimal segmentation. Here we propose a much simpler strategy: we predict word boundaries using the dissimilarity between adjacent self-supervised features, then we cluster the predicted segments to construct a lexicon. For a fair comparison, we update the older ES-KMeans dynamic programming method with better features and boundary constraints. On the five-language ZeroSpeech benchmarks, our simple approach gives similar state-of-the-art results compared to the new ES-KMeans+ method, while being almost five times faster.
我们观察到将未标记的语音段分割成单词类似段并将其聚类成一个词典一直是一个长期存在的问题。以前的方法使用评分模型与动态规划相结合来寻找最优的分割。然而,我们在这里提出了一种非常简单的策略:我们使用相邻自监督特征之间的差异预测单词边界,然后将预测的段聚类为词典。为了进行公平的比较,我们用更好的特征和边界约束更新较旧的ES-KMeans动态规划方法。在五语言的ZeroSpeech基准测试中,与新ES-KMeans+方法相比,我们的简单方法给出了类似的最先进的成果,而速度却快了几乎5倍。
https://arxiv.org/abs/2409.14486
Owing to the diverse geographical environments, intricate landscapes, and high-density settlements, the automatic identification of urban village boundaries using remote sensing images is a highly challenging task. This paper proposes a novel and efficient neural network model called UV-Mamba for accurate boundary detection in high-resolution remote sensing images. UV-Mamba mitigates the memory loss problem in long sequence modeling, which arises in state space model (SSM) with increasing image size, by incorporating deformable convolutions (DCN). Its architecture utilizes an encoder-decoder framework, includes an encoder with four deformable state space augmentation (DSSA) blocks for efficient multi-level semantic extraction and a decoder to integrate the extracted semantic information. We conducted experiments on the Beijing and Xi'an datasets, and the results show that UV-Mamba achieves state-of-the-art performance. Specifically, our model achieves 73.3% and 78.1% IoU on the Beijing and Xi'an datasets, respectively, representing improvements of 1.2% and 3.4% IoU over the previous best model, while also being 6x faster in inference speed and 40x smaller in parameter count. Source code and pre-trained models are available in the supplementary material.
由于地理环境多样性、复杂的地形和密集聚落,使用遥感图像进行自动识别城市乡村边界是一项具有挑战性的任务。本文提出了一种新颖且高效的神经网络模型,称为UV-Mamba,用于准确地在高分辨率遥感图像中检测边界。UV-Mamba通过引入可变形卷积(DCN)缓解了在状态空间模型(SSM)中随着图像尺寸增加而产生的长序列建模内存损失问题。其架构采用编码器-解码器框架,包括一个具有四个可变形状态空间增强(DSSA)块的编码器和一个整合提取的语义信息的解码器。我们对北京和西安数据集进行了实验,结果表明,UV-Mamba实现了最先进的性能。具体来说,我们的模型在 Beijing 和 Xi'an 数据集上分别实现了 73.3% 和 78.1% 的IoU,分别比以前的最佳模型提高了1.2%和3.4%,同时在推理速度上快了6倍,而在参数数量上小了40倍。源代码和预训练模型可在补充材料中获取。
https://arxiv.org/abs/2409.03431
This work presents the INBD network proposed by Gillert et al. in CVPR-2023 and studies its application for delineating tree rings in RGB images of Pinus taeda cross sections captured by a smartphone (UruDendro dataset), which are images with different characteristics from the ones used to train the method. The INBD network operates in two stages: first, it segments the background, pith, and ring boundaries. In the second stage, the image is transformed into polar coordinates, and ring boundaries are iteratively segmented from the pith to the bark. Both stages are based on the U-Net architecture. The method achieves an F-Score of 77.5, a mAR of 0.540, and an ARAND of 0.205 on the evaluation set. The code for the experiments is available at this https URL.
本文介绍了Gillert等人提出并在CVPR-2023会议上发表的INBD网络,并研究了其在捕捉智能手机捕获的Pinus taeda剖面图像中的树木环的定义应用。与训练方法使用的图像具有不同的特点。INBD网络分为两个阶段:第一阶段是分割背景、皮层和环边界;第二阶段是将图像转换为极坐标,然后从皮层到木质部依次分割环边界。两个阶段都基于U-Net架构。该方法在评估集上的F-分数为77.5,mAR为0.540,ARAND为0.205。实验代码可在此处访问的链接中找到。
https://arxiv.org/abs/2408.14343
Generic event boundary detection (GEBD), inspired by human visual cognitive behaviors of consistently segmenting videos into meaningful temporal chunks, finds utility in various applications such as video editing and. In this paper, we demonstrate that SOTA GEBD models often prioritize final performance over model complexity, resulting in low inference speed and hindering efficient deployment in real-world scenarios. We contribute to addressing this challenge by experimentally reexamining the architecture of GEBD models and uncovering several surprising findings. Firstly, we reveal that a concise GEBD baseline model already achieves promising performance without any sophisticated design. Secondly, we find that the widely applied image-domain backbones in GEBD models can contain plenty of architecture redundancy, motivating us to gradually ``modernize'' each component to enhance efficiency. Thirdly, we show that the GEBD models using image-domain backbones conducting the spatiotemporal learning in a spatial-then-temporal greedy manner can suffer from a distraction issue, which might be the inefficient villain for GEBD. Using a video-domain backbone to jointly conduct spatiotemporal modeling is an effective solution for this issue. The outcome of our exploration is a family of GEBD models, named EfficientGEBD, significantly outperforms the previous SOTA methods by up to 1.7\% performance gain and 280\% speedup under the same backbone. Our research prompts the community to design modern GEBD methods with the consideration of model complexity, particularly in resource-aware applications. The code is available at \url{this https URL}.
通用事件边界检测(GEBD)是一种模型,受到人类在观看视频时将视频划分为有意义的时间段的视觉认知行为的启发,可以在各种应用中找到用处,如视频编辑等。在本文中,我们证明了最先进的GEBD模型通常会优先考虑最终性能,导致推理速度较低,阻碍了在现实场景中的高效部署。通过实验重新审视GEBD模型的架构,我们发现了几个令人惊讶的发现,为解决这一挑战做出了贡献。首先,我们发现一个简洁的GEBD基线模型已经实现了很好的性能,而无需进行复杂的设计。其次,我们发现广泛应用于GEBD模型的图像领域骨架可以包含大量的架构冗余,这激发了我们逐步“现代化”每个组件,提高效率。第三,我们展示了使用图像领域骨架进行空间-然后-时间学习的GEBD模型可能存在分心问题,这可能是GEBD的低效反派。使用视频领域骨架共同进行空间-然后-时间建模是解决这个问题的有效方法。我们的探索结果是一系列GEBD模型,名为EfficientGEBD,在相同骨架上显著超过了之前的SOTA方法,性能提高了1.7%至280%的速度。我们的研究鼓励社区设计考虑模型复杂性的现代GEBD方法,尤其是在资源感知应用中。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2407.12622
Generic event boundary detection (GEBD) aims at pinpointing event boundaries naturally perceived by humans, playing a crucial role in understanding long-form videos. Given the diverse nature of generic boundaries, spanning different video appearances, objects, and actions, this task remains challenging. Existing methods usually detect various boundaries by the same protocol, regardless of their distinctive characteristics and detection difficulties, resulting in suboptimal performance. Intuitively, a more intelligent and reasonable way is to adaptively detect boundaries by considering their special properties. In light of this, we propose a novel dynamic pipeline for generic event boundaries named DyBDet. By introducing a multi-exit network architecture, DyBDet automatically learns the subnet allocation to different video snippets, enabling fine-grained detection for various boundaries. Besides, a multi-order difference detector is also proposed to ensure generic boundaries can be effectively identified and adaptively processed. Extensive experiments on the challenging Kinetics-GEBD and TAPOS datasets demonstrate that adopting the dynamic strategy significantly benefits GEBD tasks, leading to obvious improvements in both performance and efficiency compared to the current state-of-the-art.
通用事件边界检测(GEBD)旨在通过自然感知的人类事件边界来精确地定位事件边界,对于理解长格式视频至关重要。由于通用边界的多样性,包括不同的视频表现、物体和动作,这项任务仍然具有挑战性。现有的方法通常通过相同的协议检测各种边界,而不考虑它们的独特特点和检测难度,导致性能低下。从直觉上讲,更智能和合理的边界检测方法是通过考虑它们的特殊性质来动态地检测边界。因此,我们提出了一个名为DyBDet的新通用事件边界检测动态管道。通过引入多出口网络架构,DyBDet可以自动学习为不同视频片段分配的子网,实现对各种边界的细粒度检测。此外,还提出了多级差异检测器,以确保通用边界可以有效地被识别和适应处理。在具有挑战性的Kinetics-GEBD和TAPOS数据集上的实验表明,采用动态策略显著提高了GEBD任务的表现和效率,与现有技术的水平相比,显着改善了性能和效率。
https://arxiv.org/abs/2407.04274
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best method in the tasks of semantic segmentation, human parsing, and object boundary detection, respectively. The code is available at \url{this https URL}.
多种任务密集场景理解是一种学会多个密集预测任务模型的方法,具有广泛的应用场景。建模长距离依赖和增强跨任务交互是多任务密集预测的关键。在本文中,我们提出了MTMamba,一种基于Mamba的多任务场景理解新架构。它包含两种核心模块:自任务Mamba(STM)模块和跨任务Mamba(CTM)模块。STM通过利用Mamba处理长距离依赖,而CTM明确建模了任务交互以促进任务间信息交流。在NYUDv2和PASCAL-Context数据集上的实验证明,MTMamba相对于基于Transformer和CNN的方法具有卓越的性能。值得注意的是,在PASCAL-Context数据集上,MTMamba在语义分割、人解析和物体边界检测等任务上分别实现了+2.08、+5.01和+4.90的改进,超过了前最佳方法。代码可在此处访问:\url{https:// this https URL }。
https://arxiv.org/abs/2407.02228