This paper presents a novel method for enhancing the adaptability of Proportional-Integral-Derivative (PID) controllers in industrial systems using event-based dynamic game theory, which enables the PID controllers to self-learn, optimize, and fine-tune themselves. In contrast to conventional self-learning approaches, our proposed framework offers an event-driven control strategy and game-theoretic learning algorithms. The players collaborate with the PID controllers to dynamically adjust their gains in response to set point changes and disturbances. We provide a theoretical analysis showing sound convergence guarantees for the game given suitable stability ranges of the PID controlled loop. We further introduce an automatic boundary detection mechanism, which helps the players to find an optimal initialization of action spaces and significantly reduces the exploration time. The efficacy of this novel methodology is validated through its implementation in the temperature control loop of a printing press machine. Eventually, the outcomes of the proposed intelligent self-tuning PID controllers are highly promising, particularly in terms of reducing overshoot and settling time.
本文提出了一种新颖的方法,利用基于事件的动态博弈理论来增强工业系统中比例-积分-微分(PID)控制器的适应性。这种方法使PID控制器能够自我学习、优化和精细调整。与传统的自学习方法不同,我们提出的框架提供了一种基于事件驱动的控制策略和博弈论学习算法。参与者与PID控制器协作,根据设定点变化和扰动动态调节其增益。 本文还提供了理论分析,证明了在适当的稳定性范围内,给定的游戏具有良好的收敛保证。此外,我们引入了一个自动边界检测机制,帮助玩家找到最优的动作空间初始化,并显著减少探索时间。 该新颖方法的有效性通过将其应用于印刷机温度控制回路的实施得到了验证。最终,提出的智能自适应PID控制器的结果非常有前景,特别是在减少超调和稳定时间方面表现突出。
https://arxiv.org/abs/2506.13164
While reasoning-augmented large language models (RLLMs) significantly enhance complex task performance through extended reasoning chains, they inevitably introduce substantial unnecessary token consumption, particularly for simpler problems where Short Chain-of-Thought (Short CoT) suffices. This overthinking phenomenon leads to inefficient resource usage without proportional accuracy gains. To address this issue, we propose Self-Route, a dynamic reasoning framework that automatically selects between general and reasoning modes based on model capability estimation. Our approach introduces a lightweight pre-inference stage to extract capability-aware embeddings from hidden layer representations, enabling real-time evaluation of the model's ability to solve problems. We further construct Gradient-10K, a model difficulty estimation-based dataset with dense complexity sampling, to train the router for precise capability boundary detection. Extensive experiments demonstrate that Self-Route achieves comparable accuracy to reasoning models while reducing token consumption by 30-55\% across diverse benchmarks. The proposed framework demonstrates consistent effectiveness across models with different parameter scales and reasoning paradigms, highlighting its general applicability and practical value.
尽管增强型大型语言模型(RLLM)通过扩展的推理链显著提高了复杂任务的表现,但对于那些只需简短思维链条(Short CoT)就能解决的简单问题而言,它们不可避免地引入了大量不必要的标记消耗,导致资源使用效率低下而没有相应的准确度提升。为了解决这一问题,我们提出了Self-Route,这是一种动态推理框架,能够根据模型能力估计自动在一般模式和推理模式之间进行选择。我们的方法引入了一个轻量级的预推理阶段,从隐藏层表示中提取出具有认知能力嵌入,从而实现实时评估模型解决问题的能力。此外,我们构建了Gradient-10K,这是一个基于模型难度估算的数据集,并且包含了密集复杂度抽样,用于训练路由器以实现精确的能力边界检测。广泛的实验表明,Self-Route在减少标记消耗30%-55%的同时,与推理模型的准确性相当,在各种基准测试中均表现出色。该框架展示出了跨不同参数规模和推理范式的持续有效性,突显了其广泛适用性和实际价值。
https://arxiv.org/abs/2505.20664
This paper proposes a tropical geometry-based edge detection framework that reformulates convolution and gradient computations using min-plus and max-plus algebra. The tropical formulation emphasizes dominant intensity variations, contributing to sharper and more continuous edge representations. Three variants are explored: an adaptive threshold-based method, a multi-kernel min-plus method, and a max-plus method emphasizing structural continuity. The framework integrates multi-scale processing, Hessian filtering, and wavelet shrinkage to enhance edge transitions while maintaining computational efficiency. Experiments on MATLAB built-in grayscale and color images suggest that tropical formulations integrated with classical operators, such as Canny and LoG, can improve boundary detection in low-contrast and textured regions. Quantitative evaluation using standard edge metrics indicates favorable edge clarity and structural coherence. These results highlight the potential of tropical algebra as a scalable and noise-aware formulation for edge detection in practical image analysis tasks.
本文提出了一种基于热带几何的边缘检测框架,该框架使用极小-极大加法代数(min-plus和max-plus)重新表述卷积和梯度计算。这种热带公式化强调了主要强度变化,有助于获得更清晰、更连续的边缘表示。文中探讨了三种变体:自适应阈值方法、多核极小-极大加法方法以及着重于结构连续性的极大-加法方法。该框架集成了多尺度处理、赫斯(Hessian)滤波和小波收缩,以增强边缘过渡并保持计算效率。在MATLAB内置的灰度和彩色图像上进行的实验表明,与经典算子如Canny和LoG集成的传统边界的热带公式化方法可以在低对比度和纹理区域提高边界检测能力。使用标准边缘指标进行的定量评估显示了有利的边缘清晰度和结构连贯性。这些结果突显了热带代数作为边缘检测实用图像分析任务中可扩展且对噪声敏感的表述方案的潜力。
https://arxiv.org/abs/2505.18625
Medical image segmentation is a pivotal task within the realms of medical image analysis and computer vision. While current methods have shown promise in accurately segmenting major regions of interest, the precise segmentation of boundary areas remains challenging. In this study, we propose a novel network architecture named CTO, which combines Convolutional Neural Networks (CNNs), Vision Transformer (ViT) models, and explicit edge detection operators to tackle this challenge. CTO surpasses existing methods in terms of segmentation accuracy and strikes a better balance between accuracy and efficiency, without the need for additional data inputs or label injections. Specifically, CTO adheres to the canonical encoder-decoder network paradigm, with a dual-stream encoder network comprising a mainstream CNN stream for capturing local features and an auxiliary StitchViT stream for integrating long-range dependencies. Furthermore, to enhance the model's ability to learn boundary areas, we introduce a boundary-guided decoder network that employs binary boundary masks generated by dedicated edge detection operators to provide explicit guidance during the decoding process. We validate the performance of CTO through extensive experiments conducted on seven challenging medical image segmentation datasets, namely ISIC 2016, PH2, ISIC 2018, CoNIC, LiTS17, and BTCV. Our experimental results unequivocally demonstrate that CTO achieves state-of-the-art accuracy on these datasets while maintaining competitive model complexity. The codes have been released at: this https URL.
医学图像分割是医学图像分析和计算机视觉领域中的核心任务之一。尽管目前的方法在准确划分主要感兴趣区域方面已显示出潜力,但精确地分割边界区域仍然是一个挑战。在这项研究中,我们提出了一种新的网络架构,名为CTO(Convolutional Transformer with Operators),它结合了卷积神经网络(CNN)、视觉变换器(ViT)模型和显式的边缘检测算子来解决这一难题。CTO在分割精度方面超越了现有的方法,并且在保持效率的同时达到了更好的精度与效率之间的平衡,无需额外的数据输入或标签注入。 具体而言,CTO遵循经典的编码-解码网络范式,具有一个双流编码器网络,包括一条主流的CNN流用于捕获局部特征和一条辅助的StitchViT流用于整合长距离依赖关系。此外,为了增强模型学习边界区域的能力,我们引入了一个由专用边缘检测算子生成的二进制边界掩码引导解码过程的边界指导式解码网络。 通过在七个具有挑战性的医学图像分割数据集上进行广泛的实验验证(即ISIC 2016、PH2、ISIC 2018、CoNIC、LiTS17和BTCV),我们证明了CTO在这类任务中能够达到最先进的精度,同时保持竞争的模型复杂度。相关代码已发布在:[此处提供链接]。 请注意,在上述翻译中,“this https URL”应当替换为实际发布的代码仓库或项目的具体网址以供参考。
https://arxiv.org/abs/2505.04652
Understanding actions within surgical workflows is essential for evaluating post-operative outcomes. However, capturing long sequences of actions performed in surgical settings poses challenges, as individual surgeons have their unique approaches shaped by their expertise, leading to significant variability. To tackle this complex problem, we focused on segmentation with precise boundaries, a demanding task due to the inherent variability in action durations and the subtle transitions often observed in untrimmed videos. These transitions, marked by ambiguous starting and ending points, complicate the segmentation process. Traditional models, such as MS-TCN, which depend on large receptive fields, frequently face challenges of over-segmentation (resulting in fragmented segments) or under-segmentation (merging distinct actions). Both of these issues negatively impact the quality of segmentation. To overcome these challenges, we present the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention, designed to enhance action segmentation. Our proposed approach incorporates a novel unified loss function that treats action classification and boundary detection as distinct yet interdependent tasks. Unlike traditional binary boundary detection methods, our boundary voting mechanism accurately identifies start and end points by leveraging contextual information. Extensive experiments using three challenging surgical datasets demonstrate the superior performance of the proposed method, achieving state-of-the-art results in F1 scores at thresholds of 25% and 50%, while also delivering comparable performance in other metrics.
理解手术工作流程中的操作对于评估术后结果至关重要。然而,在手术环境中捕捉长时间的操作序列面临着挑战,因为每位外科医生都有自己独特的操作方式,这些方式受到其专业知识的影响,导致了显著的差异性。为了应对这一复杂问题,我们专注于具有精确边界的分割任务,这是一项极具挑战性的任务,原因是动作持续时间的内在变异性以及在未经修剪的视频中常见的微妙过渡。由于模糊不清的起点和终点标记,这些过渡使分割过程变得复杂化。 传统的模型如MS-TCN依赖于大的感受野,但经常面临过度分割(导致片段化)或欠分割(将不同的动作合并在一起)的问题。这两种情况都会影响分割的质量。为了解决这些问题,我们提出了一个多阶段边界感知变换网络(MSBATN),采用分层滑动窗口注意机制来增强操作分割。我们的方法包括一种新颖的统一损失函数,该函数将动作分类和边界检测视为相互独立却又彼此依赖的任务。与传统的二进制边界检测方法不同,我们的边界投票机制通过利用上下文信息准确地识别起点和终点。 使用三个具有挑战性的手术数据集进行的广泛实验表明了我们提出的方法在F1分数(阈值为25%和50%)上的优越性能,并且在其他指标上也提供了可比的结果。
https://arxiv.org/abs/2504.18756
The recent surge in open-source text-to-video generation models has significantly energized the research community, yet their dependence on proprietary training datasets remains a key constraint. While existing open datasets like Koala-36M employ algorithmic filtering of web-scraped videos from early platforms, they still lack the quality required for fine-tuning advanced video generation models. We present Tiger200K, a manually curated high visual quality video dataset sourced from User-Generated Content (UGC) platforms. By prioritizing visual fidelity and aesthetic quality, Tiger200K underscores the critical role of human expertise in data curation, and providing high-quality, temporally consistent video-text pairs for fine-tuning and optimizing video generation architectures through a simple but effective pipeline including shot boundary detection, OCR, border detecting, motion filter and fine bilingual caption. The dataset will undergo ongoing expansion and be released as an open-source initiative to advance research and applications in video generative models. Project page: this https URL
近期,开源文本到视频生成模型的激增显著激发了研究社区的热情,然而它们对专有训练数据集的依赖仍然是一个主要限制。虽然现有的公开数据集如Koala-36M采用了从早期平台抓取并算法过滤网络视频的方法,但这些数据集仍然缺乏用于精细调整先进视频生成模型所需的高质量内容。我们推出了Tiger200K,这是一个来自用户生成内容(UGC)平台的、由人工精挑细选而成的高视觉质量视频数据集。 通过优先考虑视觉保真度和审美品质,Tiger200K突显了人类专业知识在数据整理中的关键作用,并提供了高质量的时间连贯性视频-文本对,用于通过包括镜头边界检测、OCR(光学字符识别)、边框检测、运动过滤及精细双语描述在内的简单但有效的管道进行微调和优化视频生成架构。该数据集将不断扩展并作为开源倡议发布,以推进视频生成模型的研究和应用。 项目页面:[此链接](https://this-url.com)
https://arxiv.org/abs/2504.15182
Video Anomaly Detection~(VAD) focuses on identifying anomalies within videos. Supervised methods require an amount of in-domain training data and often struggle to generalize to unseen anomalies. In contrast, training-free methods leverage the intrinsic world knowledge of large language models (LLMs) to detect anomalies but face challenges in localizing fine-grained visual transitions and diverse events. Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning. Specifically, EventVAD first employs dynamic spatiotemporal graph modeling with time-decay constraints to capture event-aware video features. Then, it performs adaptive noise filtering and uses signal ratio thresholding to detect event boundaries via unsupervised statistical features. The statistical boundary detection module reduces the complexity of processing long videos for MLLMs and improves their temporal reasoning through event consistency. Finally, it utilizes a hierarchical prompting strategy to guide MLLMs in performing reasoning before determining final decisions. We conducted extensive experiments on the UCF-Crime and XD-Violence datasets. The results demonstrate that EventVAD with a 7B MLLM achieves state-of-the-art (SOTA) in training-free settings, outperforming strong baselines that use 7B or larger MLLMs.
视频异常检测(Video Anomaly Detection,VAD)主要关注识别视频中的异常情况。监督方法需要一定量的领域内训练数据,并且通常难以泛化到未见过的异常情况中去。相比之下,无训练的方法利用大型语言模型(Large Language Models, LLMs)固有的世界知识来检测异常,但面临着在定位细粒度视觉转换和多样事件方面的挑战。 因此,我们提出了EventVAD,这是一种基于事件感知的视频异常检测框架,结合了定制化的动态图架构和多模态LLMs,并通过时间-事件推理将二者相结合。具体来说,EventVAD首先使用带有时间衰减约束的动态时空图模型来捕捉以事件为中心的视频特征。然后,它执行自适应噪声过滤,并利用信号比率阈值检测事件边界,这借助于无监督统计特性实现。该统计边界检测模块降低了长时间视频处理对于多模态LLMs(Multimodal LLMs, MLLMs)的复杂性,并通过事件一致性提高了它们的时间推理能力。最后,它采用分层提示策略来引导MLLMS进行推理并最终做出决定。 我们在UCF-Crime和XD-Violence数据集上进行了广泛的实验。结果显示,在无训练设置下,使用7B参数量级MLLM的EventVAD达到了最先进的性能(State-of-the-Art, SOTA),超过了使用7B及以上规模LLMs的强大基线模型。
https://arxiv.org/abs/2504.13092
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on SAMURAI for precise object segmentation across frames, a Temporal Analyzer powered by TRACE-Uni for accurate event boundary detection and temporal analysis, and a Captioner using InternVL-2.5 for generating detailed object-centric descriptions. Through spatiotemporal visual prompts and chain-of-thought reasoning, our framework generates detailed, temporally-aware descriptions of objects' attributes, actions, statuses, interactions, and environmental contexts without requiring additional training data. CAT-V supports flexible user interactions through various visual prompts (points, bounding boxes, and irregular regions) and maintains temporal sensitivity by tracking object states and interactions across different time segments. Our approach addresses limitations of existing video captioning methods, which either produce overly abstract descriptions or lack object-level precision, enabling fine-grained, object-specific descriptions while maintaining temporal coherence and spatial accuracy. The GitHub repository for this project is available at this https URL
我们介绍了CAT-V(Caption AnyThing in Video),这是一个无需训练的框架,用于细粒度以对象为中心的视频描述,该框架能够通过时间对用户选择的对象进行详细描述。CAT-V集成了三个关键组件:基于SAMURAI的Segmenter,可在帧间实现精确的对象分割;由TRACE-Uni提供动力的Temporal Analyzer,可准确检测事件边界并进行时间分析;以及使用InternVL-2.5生成详细以对象为中心描述的Captioner。通过时空视觉提示和链式思维推理,我们的框架能够无需额外训练数据即可生成对物体属性、动作、状态、交互及环境背景具有时间意识的详细描述。CAT-V支持通过各种视觉提示(点、边界框和不规则区域)进行灵活的用户互动,并通过跟踪不同时间段内对象的状态和交互来保持时间敏感性。 我们的方法解决了现有视频描述方法存在的局限,这些方法要么产生过于抽象的描述,要么缺乏对单个物体级别的精确度。CAT-V能够在维护时间和空间准确性的同时生成细粒度、特定于每个对象的描述。该项目的GitHub仓库在此 https URL 获取。
https://arxiv.org/abs/2504.05541
We present NUPunkt and CharBoundary, two sentence boundary detection libraries optimized for high-precision, high-throughput processing of legal text in large-scale applications such as due diligence, e-discovery, and legal research. These libraries address the critical challenges posed by legal documents containing specialized citations, abbreviations, and complex sentence structures that confound general-purpose sentence boundary detectors. Our experimental evaluation on five diverse legal datasets comprising over 25,000 documents and 197,000 annotated sentence boundaries demonstrates that NUPunkt achieves 91.1% precision while processing 10 million characters per second with modest memory requirements (432 MB). CharBoundary models offer balanced and adjustable precision-recall tradeoffs, with the large model achieving the highest F1 score (0.782) among all tested methods. Notably, NUPunkt provides a 29-32% precision improvement over general-purpose tools while maintaining exceptional throughput, processing multi-million document collections in minutes rather than hours. Both libraries run efficiently on standard CPU hardware without requiring specialized accelerators. NUPunkt is implemented in pure Python with zero external dependencies, while CharBoundary relies only on scikit-learn and optional ONNX runtime integration for optimized performance. Both libraries are available under the MIT license, can be installed via PyPI, and can be interactively tested at this https URL. These libraries address critical precision issues in retrieval-augmented generation systems by preserving coherent legal concepts across sentences, where each percentage improvement in precision yields exponentially greater reductions in context fragmentation, creating cascading benefits throughout retrieval pipelines and significantly enhancing downstream reasoning quality.
我们介绍了NUPunkt和CharBoundary,这两款专为大规模应用(如尽职调查、电子发现和法律研究)中高精度处理法律文本而优化的句子边界检测库。这些库解决了由含有专门引用、缩写以及复杂句式结构的法律文件对通用句子边界检测器提出的挑战。我们在五个多样化的法律数据集上进行了实验评估,该数据集中包含超过25,000份文档和197,000个注释的句子边界,结果表明NUPunkt在每秒处理1千万字符的同时实现了高达91.1%的精度,并且内存需求适中(432MB)。CharBoundary模型提供了平衡且可调节的精确度-召回率权衡,在所有测试方法中,大型模型获得了最高的F1分数(0.782)。 特别值得注意的是,NUPunkt在保持卓越吞吐量的同时,相较于通用工具提升了29至32%的精度,可以在几分钟内处理数百万文档集合,而不是几个小时。这两个库都在标准CPU硬件上高效运行,不需要专用加速器。NUPunkt完全用纯Python编写,并且没有任何外部依赖项,而CharBoundary仅依赖于scikit-learn和可选的ONNX运行时集成以实现优化性能。 两个库均采用MIT许可协议发布,可以通过PyPI安装,并在[此链接](https://this-url.com)提供交互式测试。这些库通过保留句子间的连贯法律概念来解决检索增强生成系统中的关键精确度问题,每个百分比的精度提升都会产生指数级减少的上下文碎片化,从而在整个检索管道中带来连锁效应并显著提高下游推理质量。
https://arxiv.org/abs/2504.04131
This memoir explores two fundamental aspects of Natural Language Processing (NLP): the creation of linguistic resources and the evaluation of NLP system performance. Over the past decade, my work has focused on developing a morpheme-based annotation scheme for the Korean language that captures linguistic properties from morphology to semantics. This approach has achieved state-of-the-art results in various NLP tasks, including part-of-speech tagging, dependency parsing, and named entity recognition. Additionally, this work provides a comprehensive analysis of segmentation granularity and its critical impact on NLP system performance. In parallel with linguistic resource development, I have proposed a novel evaluation framework, the jp-algorithm, which introduces an alignment-based method to address challenges in preprocessing tasks like tokenization and sentence boundary detection (SBD). Traditional evaluation methods assume identical tokenization and sentence lengths between gold standards and system outputs, limiting their applicability to real-world data. The jp-algorithm overcomes these limitations, enabling robust end-to-end evaluations across a variety of NLP tasks. It enhances accuracy and flexibility by incorporating linear-time alignment while preserving the complexity of traditional evaluation metrics. This memoir provides key insights into the processing of morphologically rich languages, such as Korean, while offering a generalizable framework for evaluating diverse end-to-end NLP systems. My contributions lay the foundation for future developments, with broader implications for multilingual resource development and system evaluation.
这部回忆录探讨了自然语言处理(NLP)的两个基本方面:语言资源的创建和评估NLP系统性能。在过去十年里,我的工作主要集中在为韩语开发一种基于音节的语言注释方案,该方案捕捉从形态学到语义的各种语言特性。这种方法在包括词性标注、依存句法分析和命名实体识别在内的各种NLP任务中实现了最先进的结果。此外,这项工作还提供了对分段粒度及其对NLP系统性能的批判性影响的全面分析。 与语言资源开发并行地,我还提出了一种新颖的评估框架——jp算法,该算法引入了一种基于对齐的方法来解决预处理任务(如标记化和句子边界检测)中的挑战。传统的评估方法假设黄金标准和系统输出之间的分词和句长是相同的,这限制了它们在现实世界数据上的适用性。jp算法克服了这些限制,使得能够在各种NLP任务中进行稳健的端到端评估。通过结合线性时间对齐并保留传统评估指标的复杂度,它增强了准确性和灵活性。 这部回忆录为丰富形态语言(如韩语)的处理提供了关键见解,并提供了一种通用化的框架来评价多种端到端NLP系统。我的贡献奠定了未来发展的基础,并且对未来多语言资源开发和系统评估具有更广泛的意义。
https://arxiv.org/abs/2504.01342
Semi-supervised semantic segmentation (SS-SS) aims to mitigate the heavy annotation burden of dense pixel labeling by leveraging abundant unlabeled images alongside a small labeled set. While current teacher-student consistency regularization methods achieve strong results, they often overlook a critical challenge: the precise delineation of object boundaries. In this paper, we propose BoundMatch, a novel multi-task SS-SS framework that explicitly integrates semantic boundary detection into the consistency regularization pipeline. Our core mechanism, Boundary Consistency Regularized Multi-Task Learning (BCRM), enforces prediction agreement between teacher and student models on both segmentation masks and detailed semantic boundaries. To further enhance performance and sharpen contours, BoundMatch incorporates two lightweight fusion modules: Boundary-Semantic Fusion (BSF) injects learned boundary cues into the segmentation decoder, while Spatial Gradient Fusion (SGF) refines boundary predictions using mask gradients, leading to higher-quality boundary pseudo-labels. This framework is built upon SAMTH, a strong teacher-student baseline featuring a Harmonious Batch Normalization (HBN) update strategy for improved stability. Extensive experiments on diverse datasets including Cityscapes, BDD100K, SYNTHIA, ADE20K, and Pascal VOC show that BoundMatch achieves competitive performance against state-of-the-art methods while significantly improving boundary-specific evaluation metrics. We also demonstrate its effectiveness in realistic large-scale unlabeled data scenarios and on lightweight architectures designed for mobile deployment.
半监督语义分割(SS-SS)旨在通过利用大量未标记的图像和少量已标注的数据集来减轻密集像素标签注释的工作负担。尽管当前的教师-学生一致性正则化方法取得了强大的成果,但它们往往忽视了一个关键挑战:精确地划分物体边界。在本文中,我们提出了一种名为BoundMatch的新颖多任务半监督语义分割框架,该框架明确将语义边界的检测集成到了一致性正则化的管道之中。我们的核心机制是边界一致性正则化多任务学习(BCRM),它要求教师和学生模型在分割掩码及详细的语义边界上达成预测的一致性。为了进一步提升性能并锐化轮廓,BoundMatch结合了两个轻量级融合模块:边界-语义融合(BSF)将学习到的边界线索注入到解码器中,而空间梯度融合(SGF)则利用掩码梯度细化边界预测,从而生成更高质量的边界伪标签。该框架基于SAMTH构建,这是一个强大的教师-学生基线模型,采用了和谐批量归一化(HBN)更新策略以提高稳定性。 在包括Cityscapes、BDD100K、SYNTHIA、ADE20K和Pascal VOC在内的多样化数据集上进行的广泛实验表明,BoundMatch取得了与最先进方法相当的表现,并显著提升了特定边界评价指标。此外,我们还展示了该框架在现实大规模未标注数据场景以及面向移动部署的轻量级架构上的有效性。
https://arxiv.org/abs/2503.23519
Learning skills in open-world environments is essential for developing agents capable of handling a variety of tasks by combining basic skills. Online demonstration videos are typically long but unsegmented, making them difficult to segment and label with skill identifiers. Unlike existing methods that rely on sequence sampling or human labeling, we have developed a self-supervised learning-based approach to segment these long videos into a series of semantic-aware and skill-consistent segments. Drawing inspiration from human cognitive event segmentation theory, we introduce Skill Boundary Detection (SBD), an annotation-free temporal video segmentation algorithm. SBD detects skill boundaries in a video by leveraging prediction errors from a pretrained unconditional action-prediction model. This approach is based on the assumption that a significant increase in prediction error indicates a shift in the skill being executed. We evaluated our method in Minecraft, a rich open-world simulator with extensive gameplay videos available online. Our SBD-generated segments improved the average performance of conditioned policies by 63.7% and 52.1% on short-term atomic skill tasks, and their corresponding hierarchical agents by 11.3% and 20.8% on long-horizon tasks. Our method can leverage the diverse YouTube videos to train instruction-following agents. The project page can be found in this https URL.
在开放世界环境中学习技能对于开发能够通过组合基本技能来处理各种任务的代理至关重要。然而,网上的演示视频通常很长且未经过分割和标注,这使得它们难以被分割并用技能标识符进行标记。与现有方法依赖序列采样或人工标注不同,我们提出了一种基于自监督学习的方法,将这些长视频分割成一系列具有语义感知和技能一致性的小段。受人类认知事件分割理论的启发,我们引入了Skill Boundary Detection(SBD),这是一种无需注释的时间视频分割算法。 SBD通过利用预训练无条件动作预测模型产生的预测误差来检测视频中的技能边界。该方法基于假设:预测误差显著增加表明执行的动作或技能发生了转变。 我们在《我的世界》(Minecraft)中测试了我们的方法,这是一个拥有丰富开放世界模拟和大量在线游戏录像的游戏平台。我们发现由SBD生成的片段能够将条件策略在短期原子技能任务上的平均性能分别提高了63.7%和52.1%,以及长期任务上对应的层次化代理性能分别提升了11.3%和20.8%。 我们的方法可以利用多样化的YouTube视频来训练遵循指令的智能体。项目页面可以在提供的URL中找到。
https://arxiv.org/abs/2503.10684
Compared to conventional wheeled transportation systems designed for flat surfaces, soft robots exhibit exceptional adaptability to various terrains, enabling stable movement in complex environments. However, due to the risk of collision with obstacles and barriers, most soft robots rely on sensors for navigation in unstructured environments with uncertain boundaries. In this work, we present the WHERE-Bot, a wheel-less everting soft robot capable of omnidirectional locomotion. Our WHERE-Bot can navigate through unstructured environments by leveraging its structural and motion advantages rather than relying on sensors for boundary detection. By configuring a spring toy ``Slinky'' into a loop shape, the WHERE-Bot performs multiple rotational motions: spiral-rotating along the hub circumference, self-rotating around the hub's center, and orbiting around a certain point. The robot's trajectories can be reprogrammed by actively altering its mass distribution. The WHERE-Bot shows significant potential for boundary exploration in unstructured environments.
与专为平坦表面设计的传统轮式运输系统相比,软体机器人在各种地形中表现出卓越的适应性,能够在复杂环境中实现稳定移动。然而,由于存在与障碍物和屏障碰撞的风险,大多数软体机器人依赖传感器进行未结构化环境中的导航,在这种环境下边界是不确定的。在这项工作中,我们介绍了WHERE-Bot,这是一种无轮、能够全方位运动的翻卷式软体机器人。我们的WHERE-Bot能够在没有依靠传感器检测边界的前提下,通过利用其独特的结构和运动优势来穿越未结构化的环境进行导航。 通过将弹簧玩具“Slinky”配置成环形,WHERE-Bot可以执行多种旋转动作:沿着中心轮缘螺旋旋转、围绕中心点自转以及绕某个定点公转。该机器人的轨迹可以通过主动改变其质量分布而重新编程。WHERE-Bot在未结构化环境中进行边界探索方面展现出巨大的潜力。
https://arxiv.org/abs/2503.07245
Temporal localization in untrimmed videos, which aims to identify specific timestamps, is crucial for video understanding but remains challenging. This task encompasses several subtasks, including temporal action localization, temporal video grounding, moment retrieval, and generic event boundary detection. Existing methods in each subfield are typically designed for specific tasks and lack generalizability across domains. In this paper, we propose TimeLoc, a unified end-to-end framework for timestamp localization that can handle multiple tasks. First, our approach employs a simple yet effective one-stage localization model that supports text queries as input and multiple actions as output. Second, we jointly train the video encoder and localization model in an end-to-end manner. To efficiently process long videos, we introduce temporal chunking, enabling the handling of videos with over 30k frames. Third, we find that fine-tuning pre-trained text encoders with a multi-stage training strategy further enhances text-conditioned localization. TimeLoc achieves state-of-the-art results across multiple benchmarks: +1.3% and +1.9% mAP over previous best methods on THUMOS14 and EPIC-Kitchens-100, +1.1% on Kinetics-GEBD, +2.94% mAP on QVHighlights, and significant improvements in temporal video grounding (+11.5% on TACoS and +6.7% on Charades-STA under R1@0.5). Our code and checkpoints will be released at this https URL.
无修剪视频中的时间定位,旨在识别特定的时间戳,在视频理解中至关重要但仍然具有挑战性。这一任务包括若干子任务,如时间动作定位、时间视频对齐、时刻检索和通用事件边界检测等。现有方法通常针对具体任务设计,并且在跨域应用方面缺乏泛化能力。本文提出了TimeLoc,这是一个统一的端到端框架,用于处理多个任务的时间戳定位。首先,我们的方法采用了一种简单而有效的单阶段定位模型,支持以文本查询作为输入并输出多个动作。其次,我们通过端到端方式联合训练视频编码器和定位模型。为了高效地处理长视频,我们引入了时间分块技术,使得能够处理超过30k帧的视频。第三,我们发现使用多阶段微调策略对预训练文本编码器进行细化,进一步增强了基于文本条件下的定位效果。 TimeLoc在多个基准测试中取得了最先进的结果:THUMOS14和EPIC-Kitchens-100上的mAP分别比之前最佳方法高出+1.3%和+1.9%,Kinetics-GEBD上高出+1.1%,QVHighlights上的mAP为+2.94%,以及在TACoS和Charades-STA(R1@0.5)的视频时间对齐任务中分别提高了+11.5%和+6.7%。 我们的代码和检查点将在此网址上发布。
https://arxiv.org/abs/2503.06526
Sound event detection (SED) has significantly benefited from self-supervised learning (SSL) approaches, particularly masked audio transformer for SED (MAT-SED), which leverages masked block prediction to reconstruct missing audio segments. However, while effective in capturing global dependencies, masked block prediction disrupts transient sound events and lacks explicit enforcement of temporal order, making it less suitable for fine-grained event boundary detection. To address these limitations, we propose JiTTER (Jigsaw Temporal Transformer for Event Reconstruction), an SSL framework designed to enhance temporal modeling in transformer-based SED. JiTTER introduces a hierarchical temporal shuffle reconstruction strategy, where audio sequences are randomly shuffled at both the block-level and frame-level, forcing the model to reconstruct the correct temporal order. This pretraining objective encourages the model to learn both global event structures and fine-grained transient details, improving its ability to detect events with sharp onset-offset characteristics. Additionally, we incorporate noise injection during block shuffle, providing a subtle perturbation mechanism that further regularizes feature learning and enhances model robustness. Experimental results on the DESED dataset demonstrate that JiTTER outperforms MAT-SED, achieving a 5.89% improvement in PSDS, highlighting the effectiveness of explicit temporal reasoning in SSL-based SED. Our findings suggest that structured temporal reconstruction tasks, rather than simple masked prediction, offer a more effective pretraining paradigm for sound event representation learning.
声音事件检测(SED)从自监督学习(SSL)方法中受益匪浅,特别是用于SED的掩码音频变换器(MAT-SED),它利用掩码块预测来重建缺失的音频片段。然而,虽然有效捕捉全局依赖性,但掩码块预测会扰乱瞬态声音事件,并且缺乏对时间顺序的显式约束,使其不太适合细粒度事件边界的检测。 为了解决这些问题,我们提出了JiTTER(拼图时间变换器用于事件重建),这是一种针对基于变压器的SED改进时序建模能力的自监督学习框架。JiTTER 引入了一种层次化的时间打乱重构策略,在块级和帧级随机地对音频序列进行打乱,迫使模型重新构建正确的时序顺序。这种预训练目标鼓励模型同时学习全局事件结构和细粒度瞬态细节,从而提高其检测具有急剧开始和结束特性的事件的能力。 此外,我们还在块打乱期间加入了噪声注入,提供了一种微妙的扰动机制,进一步规范特征学习并增强模型鲁棒性。在DESED数据集上的实验结果表明,JiTTER 超过了MAT-SED,在PSDS指标上提高了5.89%,证明了显式时间推理在基于SSL的声音事件表示学习中的有效性。 我们的研究发现表明,结构化的时间重构任务相比简单的掩码预测提供了一种更为有效的预训练范例用于声音事件的表示学习。
https://arxiv.org/abs/2502.20857
Existing supervised action segmentation methods depend on the quality of frame-wise classification using attention mechanisms or temporal convolutions to capture temporal dependencies. Even boundary detection-based methods primarily depend on the accuracy of an initial frame-wise classification, which can overlook precise identification of segments and boundaries in case of low-quality prediction. To address this problem, this paper proposes ASESM (Action Segmentation via Explicit Similarity Measurement) to enhance the segmentation accuracy by incorporating explicit similarity evaluation across frames and predictions. Our supervised learning architecture uses frame-level multi-resolution features as input to multiple Transformer encoders. The resulting multiple frame-wise predictions are used for similarity voting to obtain high quality initial prediction. We apply a newly proposed boundary correction algorithm that operates based on feature similarity between consecutive frames to adjust the boundary locations iteratively through the learning process. The corrected prediction is then further refined through multiple stages of temporal convolutions. As post-processing, we optionally apply boundary correction again followed by a segment smoothing method that removes outlier classes within segments using similarity measurement between consecutive predictions. Additionally, we propose a fully unsupervised boundary detection-correction algorithm that identifies segment boundaries based solely on feature similarity without any training. Experiments on 50Salads, GTEA, and Breakfast datasets show the effectiveness of both the supervised and unsupervised algorithms. Code and models are made available on Github.
现有的监督动作分割方法依赖于注意力机制或时间卷积来捕捉帧级分类的质量,以捕获时间依赖性。即使是基于边界检测的方法也主要依赖初始帧级别分类的准确性,在预测质量较低的情况下可能会忽略精确识别段和边界的细节。为了解决这个问题,本文提出了ASESM(通过显式相似度测量的动作分割),通过在帧之间以及预测之间引入显式的相似度评估来增强分割精度。我们的监督学习架构将多分辨率帧级特征作为多个Transformer编码器的输入。生成的多个帧级别预测被用于相似性投票以获得高质量初始预测。我们应用了一个新的基于连续帧间特征相似性的边界修正算法,通过迭代的学习过程逐步调整边界位置。随后,经过多次时间卷积阶段进一步细化纠正后的预测结果。在后期处理中,我们可以选择再次执行边界修正,并通过测量连续预测之间的相似度来移除段内的离群类别以实现平滑化操作。 此外,我们还提出了一种完全无监督的边界检测校正算法,仅基于特征相似性而不需任何训练即可识别出段边界。在50Salads、GTEA和Breakfast数据集上的实验展示了该监督与非监督算法的有效性。代码和模型已在Github上公开提供。
https://arxiv.org/abs/2502.10713
Efficient use of cultivated areas is a necessary factor for sustainable development of agriculture and ensuring food security. Along with the rapid development of satellite technologies in developed countries, new methods are being searched for accurate and operational identification of cultivated areas. In this context, identification of cropland boundaries based on spectral analysis of data obtained from satellite images is considered one of the most optimal and accurate methods in modern agriculture. This article proposes a new approach to determine the suitability and green index of cultivated areas using satellite data obtained through the "Google Earth Engine" (GEE) platform. In this approach, two powerful algorithms, "SNIC (Simple Non-Iterative Clustering) Super Pixels" and "Canny Edge Detection Method", are combined. The SNIC algorithm combines pixels in a satellite image into larger regions (super pixels) with similar characteristics, thereby providing better image analysis. The Canny Edge Detection Method detects sharp changes (edges) in the image to determine the precise boundaries of agricultural fields. This study, carried out using high-resolution multispectral data from the Sentinel-2 satellite and the Google Earth Engine JavaScript API, has shown that the proposed method is effective in accurately and reliably classifying randomly selected agricultural fields. The combined use of these two tools allows for more accurate determination of the boundaries of agricultural fields by minimizing the effects of outliers in satellite images. As a result, more accurate and reliable maps can be created for agricultural monitoring and resource management over large areas based on the obtained data. By expanding the application capabilities of cloud-based platforms and artificial intelligence methods in the agricultural field.
高效利用耕种区域是农业可持续发展和保障粮食安全的一个重要因素。随着发达国家卫星技术的迅速发展,人们正在寻找准确且操作性强的方式来识别耕地。在这种背景下,基于从卫星图像获取的数据进行光谱分析以确定耕地边界的方法被认为是现代农业中最为优化和精确的方法之一。本文提出了一种新方法,利用通过“Google Earth Engine”(GEE)平台获得的卫星数据来确定耕种区域的适宜性和绿色指数。该方法结合了两个强大的算法:“SNIC(Simple Non-Iterative Clustering)超像素”算法以及“Canny边缘检测法”。 SNIC算法将卫星图像中的像素组合成具有类似特性的较大区域(即超像素),从而提供更好的图像分析能力。而Canny边缘检测法则用于识别图像中急剧变化的边界,以确定农业用地的确切边界。这项研究使用了来自Sentinel-2卫星的高分辨率多光谱数据以及Google Earth Engine JavaScript API,并表明所提出的方法在准确且可靠地分类随机选择的农田方面非常有效。 通过结合这两种工具的使用,可以更精确地确定农业土地的边界,减少卫星图像中异常值的影响。基于获得的数据,可以在大面积范围内创建更加准确和可靠的农业监测与资源管理地图。这扩展了云计算平台及人工智能方法在农业领域的应用能力。
https://arxiv.org/abs/2502.04529
Aspect Sentiment Triplet Extraction (ASTE) is a thriving research area with impressive outcomes being achieved on high-resource languages. However, the application of cross-lingual transfer to the ASTE task has been relatively unexplored, and current code-switching methods still suffer from term boundary detection issues and out-of-dictionary problems. In this study, we introduce a novel Test-Time Code-SWitching (TT-CSW) framework, which bridges the gap between the bilingual training phase and the monolingual test-time prediction. During training, a generative model is developed based on bilingual code-switched training data and can produce bilingual ASTE triplets for bilingual inputs. In the testing stage, we employ an alignment-based code-switching technique for test-time augmentation. Extensive experiments on cross-lingual ASTE datasets validate the effectiveness of our proposed method. We achieve an average improvement of 3.7% in terms of weighted-averaged F1 in four datasets with different languages. Additionally, we set a benchmark using ChatGPT and GPT-4, and demonstrate that even smaller generative models fine-tuned with our proposed TT-CSW framework surpass ChatGPT and GPT-4 by 14.2% and 5.0% respectively.
方面情感三元组抽取(ASTE)是一个充满活力的研究领域,在高资源语言上已经取得了令人瞩目的成果。然而,跨语言迁移在ASTE任务中的应用相对较少探索,当前的代码切换方法仍然存在术语边界检测问题和词典外的问题。在这项研究中,我们引入了一种新颖的测试时间代码切换(TT-CSW)框架,该框架弥合了双语训练阶段与单语测试时预测之间的差距。在训练过程中,基于双语文本代码切换的数据开发了一个生成模型,并且可以为双语输入产生双语ASTE三元组。在测试阶段,我们采用一种基于对齐的代码切换技术进行测试时间增强。跨语言ASTE数据集上的大量实验证明了我们提出方法的有效性。我们在四个不同语言的数据集中实现了加权平均F1分数3.7%的平均改进。此外,我们使用ChatGPT和GPT-4设置了基准,并证明即使是经过我们的TT-CSW框架微调的小型生成模型也分别超越了ChatGPT和GPT-4 14.2% 和5.0%。
https://arxiv.org/abs/2501.14144
In this paper, we introduce an unsupervised approach for Speech Segmentation, which builds on previously researched approaches, e.g., Speaker Diarization, while being applicable to an inclusive set of acoustic-semantic distinctions, paving a path towards a general Unsupervised Speech Segmentation approach. Unlike traditional speech and audio segmentation, which mainly focuses on spectral changes in the input signal, e.g., phone segmentation, our approach tries to segment the spoken utterance into chunks with differing acoustic-semantic styles, focusing on acoustic-semantic information that does not translate well into text, e.g., emotion or speaker. While most Speech Segmentation tasks only handle one style change, e.g., emotion diarization, our approach tries to handle multiple acoustic-semantic style changes. Leveraging recent advances in Speech Language Models (SLMs), we propose a simple unsupervised method to segment a given speech utterance. We empirically demonstrate the effectiveness of the proposed approach by considering several setups. Results suggest that the proposed method is superior to the evaluated baselines on boundary detection, segment purity, and over-segmentation. Code is available at this https URL.
在这篇论文中,我们提出了一种无监督的语音分割方法,该方法建立在先前研究的方法(如说话人识别)的基础上,并适用于广泛的声学-语义区别,从而为通用的无监督语音分割方法铺平了道路。与传统的语音和音频分割主要关注输入信号中的频谱变化(例如,音素划分)不同,我们的方法试图将口语内容划分为具有不同声学-语义风格的片段,并专注于那些难以转化为文本的信息,例如情感或说话人的身份。大多数语音分割任务仅处理一种风格的变化,例如情感记录,而我们提出的方法旨在处理多种声学-语义风格变化。 通过利用最近在语音语言模型(SLM)方面的进展,我们提出了一种简单无监督的分割方法来对给定的口语内容进行划分。我们通过对几个不同设置进行实证研究,证明了所提议方法的有效性。结果表明,在边界检测、片段纯净度和过度分段方面,我们的方法优于评估中的基准方法。 代码可在以下网址获得:[此 URL]
https://arxiv.org/abs/2501.03711
Multi-class semantic segmentation remains a cornerstone challenge in computer vision. Yet, dataset creation remains excessively demanding in time and effort, especially for specialized domains. Active Learning (AL) mitigates this challenge by selecting data points for annotation strategically. However, existing patch-based AL methods often overlook boundary pixels critical information, essential for accurate segmentation. We present OREAL, a novel patch-based AL method designed for multi-class semantic segmentation. OREAL enhances boundary detection by employing maximum aggregation of pixel-wise uncertainty scores. Additionally, we introduce one-vs-rest entropy, a novel uncertainty score function that computes class-wise uncertainties while achieving implicit class balancing during dataset creation. Comprehensive experiments across diverse datasets and model architectures validate our hypothesis.
多类语义分割仍然是计算机视觉中的一个核心挑战。然而,数据集的创建在时间和精力上仍然极其耗费,特别是在专业化领域中更是如此。主动学习(AL)通过战略性地选择注释的数据点来缓解这一挑战。但是,现有的基于补丁的AL方法往往忽视了边界像素的关键信息,而这些信息对于准确分割至关重要。我们提出了OREAL,这是一种专为多类语义分割设计的新颖的基于补丁的AL方法。OREAL 通过运用像素级不确定度分数的最大聚合来增强边界检测。此外,我们引入了一种新的不确定性评分函数——一对其余熵(one-vs-rest entropy),该函数在创建数据集的过程中计算类别级别的不确定度,同时实现隐式的类别平衡。广泛的实验跨越了不同的数据集和模型架构,验证了我们的假设。
https://arxiv.org/abs/2412.06470