Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.
在神经外科中,显微吻合术的熟练掌握是一项关键的手术技能。该技术需要精确操作精细仪器的能力以确保成功的治疗结果。这些程序要求持续集中注意力、协调的手部动作以及高度精炼的运动技巧,这凸显了客观和系统方法评估及提升显微手术培训的需求。 传统评估方法通常依赖于专家监督手术过程或审查手术视频,这是一个主观性很强的过程,容易导致评分者之间的一致性问题、不稳定性,并需要大量的时间投入。这些局限性突出了自动化的、可扩展的解决方案的必要性。 为了解决这一挑战,我们引入了一种新颖的人工智能驱动框架,用于显微吻合术程序中的自动化动作分割和性能评估,旨在有效运行于边缘计算平台上。该提议系统由三个主要组成部分构成:(1)基于YOLO和DeepSORT的对象尖端跟踪与定位模块;(2)利用自相似矩阵进行动作边界检测和无监督聚类的动作分割模块;以及(3)用于评估手术手势熟练程度的监督分类模块。 通过使用包含58个专家评分显微吻合术视频的数据集进行了实验验证,证明了我们方法的有效性。在该数据集中,我们的系统达到了92.4%的帧级动作分割准确率和85.5%的整体技能分类准确率,在再现专家评估方面取得了显著成果。 这些发现表明所提出的方法具有提供显微外科教育中客观、实时反馈的潜力,从而能够促进更为标准化、基于数据驱动的培训协议,并在高风险手术环境中推进能力评估。
https://arxiv.org/abs/2512.23942
Dialogue topic segmentation supports summarization, retrieval, memory management, and conversational continuity. Despite decades of prior work, evaluation practice in dialogue topic segmentation remains dominated by strict boundary matching and F1-based metrics, even as modern LLM-based conversational systems increasingly rely on segmentation to manage conversation history beyond the model's fixed context window, where unstructured context accumulation degrades efficiency and coherence. This paper introduces an evaluation objective for dialogue topic segmentation that treats boundary density and segment coherence as primary criteria, alongside window-tolerant F1 (W-F1). Through extensive cross-dataset empirical evaluation, we show that reported performance differences across dialogue segmentation benchmarks are driven not by model quality, but by annotation granularity mismatches and sparse boundary labels. This indicates that many reported improvements arise from evaluation artifacts rather than improved boundary detection. We evaluated multiple, structurally distinct dialogue segmentation strategies across eight dialogue datasets spanning task-oriented, open-domain, meeting-style, and synthetic interactions. Across these settings, we observe high segment coherence combined with extreme oversegmentation relative to sparse labels, producing misleadingly low exact-match F1 scores. We show that topic segmentation is best understood as selecting an appropriate granularity rather than predicting a single correct boundary set. We operationalize this view by explicitly separating boundary scoring from boundary selection.
对话话题分割支持总结、检索、记忆管理和会话连续性。尽管已有数十年的研究工作,对话话题分割的评估实践仍主要依赖严格的边界匹配和基于F1的度量标准,而现代大型语言模型(LLM)驱动的对话系统越来越依靠分割来管理超出固定上下文窗口的对话历史记录,在这种情况下,无结构的历史累积会降低效率和连贯性。本文引入了一种新的评估目标,该目标将边界密度和片段连贯性视为主要标准,并在基于窗口的F1(W-F1)度量之外进行考虑。 通过广泛的跨数据集实证评价,我们证明了对话分割基准测试中报告的不同性能主要是由于标注精细度不匹配和稀疏边界标签造成的,而不是模型质量。这表明许多所谓的改进实际上是评估偏差的结果,而非实际提高了边界检测的准确性。我们在八种不同类型的对话数据集中(包括任务导向型、开放领域型、会议式以及合成交互)评估了多种结构不同的对话分割策略,并发现在这些设置中,高片段连贯性与相对稀疏标签过度分割的情况相结合,导致看似较低的确切匹配F1分数。 我们还发现话题划分更应该被视为选择适当的粒度级别而不是预测一个单一正确的边界集合。为此,我们将边界评分和边界选择明确分开以操作化这一观点。
https://arxiv.org/abs/2512.17083
Annotating medical images demands significant time and expertise, often requiring pathologists to invest hundreds of hours in labeling mammary epithelial nuclei datasets. We address this critical challenge by achieving 95.5% Dice score using just 599 training images for breast cell segmentation, where just 4% of pixels represent breast tissue and 60% of images contain no breast regions. Our framework uses quantum-inspired edge enhancement via multi-scale Gabor filters creating a fourth input channel, enhancing boundary detection where inter-annotator variations reach +/- 3 pixels. We present a stabilized multi-component loss function that integrates adaptive Dice loss with boundary-aware terms and automatic positive weighting to effectively address severe class imbalance, where mammary epithelial cell regions comprise only 0.1%-20% of the total image area. Additionally, a complexity-based weighted sampling strategy is introduced to prioritize the challenging mammary epithelial cell regions. The model employs an EfficientNet-B7/UNet++ architecture with a 4-to-3 channel projection, enabling the use of pretrained weights despite limited medical imaging data. Finally, robust validation is achieved through exponential moving averaging and statistical outlier detection, ensuring reliable performance estimates on a small validation set (129 images). Our framework achieves a Dice score of 95.5% +/- 0.3% and an IoU of 91.2% +/- 0.4%. Notably, quantum-based enhancement contributes to a 2.1% improvement in boundary accuracy, while weighted sampling increases small lesion detection by 3.8%. By achieving groundbreaking performance with limited annotations, our approach significantly reduces the medical expert time required for dataset creation, addressing a fundamental bottleneck in clinical perception AI development.
标注医学图像需要大量的时间和专业知识,病理学家通常要花费数百小时来标记乳腺上皮细胞核数据集。为了解决这一关键挑战,我们使用仅599张训练图像就达到了95.5%的Dice分数,用于乳腺细胞分割任务,其中只有4%的像素代表了乳腺组织,并且有60%的图像是没有乳腺区域的。 我们的框架利用量子启发式的边缘增强技术,通过多尺度Gabor滤波器创建了一个第四输入通道,这有助于提高边界检测精度,在注释者之间差异可达±3个像素的情况下尤其有用。我们提出了一种稳定化的多重损失函数,该函数结合了自适应Dice损失和边界感知项以及自动正权重调整,以有效解决严重的类别不平衡问题,乳腺上皮细胞区域仅占总图像面积的0.1%-20%。 此外,我们引入了一种基于复杂度加权采样策略,优先处理具有挑战性的乳腺上皮细胞区域。该模型采用了EfficientNet-B7/UNet++架构,并采用4到3通道投影技术,这使得即使在医学影像数据有限的情况下也能使用预训练的权重。最后,通过指数移动平均和统计异常值检测实现稳健验证,在一个小规模验证集(129张图像)上提供可靠的性能估计。 我们的框架实现了95.5% ± 0.3%的Dice分数和91.2% ± 0.4%的交并比。值得一提的是,量子增强技术使边界准确性提高了2.1%,而加权采样则增加了小病变检测能力达3.8%。 通过在有限标注数据下实现突破性性能,我们的方法显著减少了医学专家为创建数据集所需的时间,从而解决了临床感知AI开发中的基本瓶颈问题。
https://arxiv.org/abs/2512.02302
Generic Event Boundary Detection (GEBD) aims to identify moments in videos that humans perceive as event boundaries. This paper proposes a novel method for addressing this task, called Structured Context Learning, which introduces the Structured Partition of Sequence (SPoS) to provide a structured context for learning temporal information. Our approach is end-to-end trainable and flexible, not restricted to specific temporal models like GRU, LSTM, and Transformers. This flexibility enables our method to achieve a better speed-accuracy trade-off. Specifically, we apply SPoS to partition the input frame sequence and provide a structured context for the subsequent temporal model. Notably, SPoS's overall computational complexity is linear with respect to the video length. We next calculate group similarities to capture differences between frames, and a lightweight fully convolutional network is utilized to determine the event boundaries based on the grouped similarity maps. To remedy the ambiguities of boundary annotations, we adapt the Gaussian kernel to preprocess the ground-truth event boundaries. Our proposed method has been extensively evaluated on the challenging Kinetics-GEBD, TAPOS, and shot transition detection datasets, demonstrating its superiority over existing state-of-the-art methods.
通用事件边界检测(GEBD)旨在识别视频中人类感知为事件边界的时刻。本文提出了一种新颖的方法来解决这一任务,称为结构化上下文学习,引入了序列的结构划分(SPoS),以提供一种结构化的上下文用于时间信息的学习。我们的方法是端到端可训练且灵活的,并不受限于特定的时间模型如GRU、LSTM和Transformer等。这种灵活性使得我们的方法能够实现更好的速度-精度权衡。 具体而言,我们使用SPoS对输入帧序列进行划分并为后续的时间模型提供结构化上下文。值得注意的是,SPoS的整体计算复杂度与视频长度呈线性关系。接下来,我们计算组相似度以捕捉帧之间的差异,并利用轻量级全卷积网络根据分组的相似度图来确定事件边界。 为了缓解边界注释的模糊性问题,我们将高斯核应用于预处理地面实况事件边界。我们的方法已在具有挑战性的Kinetics-GEBD、TAPOS和镜头转换检测数据集上进行了广泛的评估,并证明了其优于现有最先进的方法的优势。
https://arxiv.org/abs/2512.00475
This paper presents a multimodal approach for continuous sign recognition that first uses machine learning to detect the start and end frames of signs in videos of American Sign Language (ASL) sentences, and then recognizes the segmented signs. For improved robustness, we use 3D skeletal features extracted from sign language videos to capture the convergence of sign properties and their dynamics, which tend to cluster at sign boundaries. Another focus of this work is the incorporation of information from 3D handshape for boundary detection. To detect handshapes normally expected at the beginning and end of signs, we pretrain a handshape classifier for 87 linguistically defined canonical handshape categories using a dataset that we created by integrating and normalizing several existing datasets. A multimodal fusion module is then used to unify the pretrained sign video segmentation framework and the handshape classification models. Finally, the estimated boundaries are used for sign recognition, where the recognition model is trained on a large database containing both citation-form isolated signs and signs pre-segmented (based on manual annotations) from continuous signing, as such signs often differ in certain respects. We evaluate our method on the ASLLRP corpus and demonstrate significant improvements over previous work.
https://arxiv.org/abs/2511.19907
Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at this https URL.
https://arxiv.org/abs/2511.13417
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.
https://arxiv.org/abs/2511.11124
Temporal action localization requires precise boundary detection; however, current methods apply uniform computation despite significant variations in difficulty across boundaries. We present two complementary contributions. First, Boundary Distance Regression (BDR) provides information-theoretically optimal localization through signed-distance regression rather than classification, achieving 43\% sharper boundary peaks. BDR retrofits to existing methods with approximately 50 lines of code, yielding consistent 1.8 to 3.1\% mAP@0.7 improvements across diverse architectures. Second, Adaptive Temporal Refinement (ATR) allocates computation via continuous depth selection $\tau \in [0,1]$, enabling end-to-end differentiable optimization without reinforcement learning. On THUMOS14, ATR achieves 56.5\% mAP@0.7 at 162G FLOPs, compared to 53.6\% at 198G for uniform processing, providing a 2.9\% improvement with 18\% less compute. Gains scale with boundary heterogeneity, showing 4.2\% improvement on short actions. Training cost is mitigated via knowledge distillation, with lightweight students retaining 99\% performance at baseline cost. Results are validated across four benchmarks with rigorous statistical testing.
https://arxiv.org/abs/2511.03943
Boundary Vector Cells (BVCs) are a class of neurons in the brains of vertebrates that encode environmental boundaries at specific distances and allocentric directions, playing a central role in forming place fields in the hippocampus. Most computational BVC models are restricted to two-dimensional (2D) environments, making them prone to spatial ambiguities in the presence of horizontal symmetries in the environment. To address this limitation, we incorporate vertical angular sensitivity into the BVC framework, thereby enabling robust boundary detection in three dimensions, and leading to significantly more accurate spatial localization in a biologically-inspired robot model. The proposed model processes LiDAR data to capture vertical contours, thereby disambiguating locations that would be indistinguishable under a purely 2D representation. Experimental results show that in environments with minimal vertical variation, the proposed 3D model matches the performance of a 2D baseline; yet, as 3D complexity increases, it yields substantially more distinct place fields and markedly reduces spatial aliasing. These findings show that adding a vertical dimension to BVC-based localization can significantly enhance navigation and mapping in real-world 3D spaces while retaining performance parity in simpler, near-planar scenarios.
https://arxiv.org/abs/2510.24029
Video anomaly detection (VAD) focuses on identifying anomalies in videos. Supervised methods demand substantial in-domain training data and fail to deliver clear explanations for anomalies. In contrast, training-free methods leverage the knowledge reserves and language interactivity of large pre-trained models to detect anomalies. However, the current fixed-length temporal window sampling approaches struggle to accurately capture anomalies with varying temporal spans. Therefore, we propose VADTree that utilizes a Hierarchical Granularityaware Tree (HGTree) structure for flexible sampling in VAD. VADTree leverages the knowledge embedded in a pre-trained Generic Event Boundary Detection (GEBD) model to characterize potential anomaly event boundaries. Specifically, VADTree decomposes the video into generic event nodes based on boundary confidence, and performs adaptive coarse-fine hierarchical structuring and redundancy removal to construct the HGTree. Then, the multi-dimensional priors are injected into the visual language models (VLMs) to enhance the node-wise anomaly perception, and anomaly reasoning for generic event nodes is achieved via large language models (LLMs). Finally, an inter-cluster node correlation method is used to integrate the multi-granularity anomaly scores. Extensive experiments on three challenging datasets demonstrate that VADTree achieves state-of-the-art performance in training-free settings while drastically reducing the number of sampled video segments. The code will be available at this https URL.
https://arxiv.org/abs/2510.22693
Accurate segmentation of 3D medical images is critical for clinical applications like disease assessment and treatment planning. While the Segment Anything Model 2 (SAM2) has shown remarkable success in video object segmentation by leveraging temporal cues, its direct application to 3D medical images faces two fundamental domain gaps: 1) the bidirectional anatomical continuity between slices contrasts sharply with the unidirectional temporal flow in videos, and 2) precise boundary delineation, crucial for morphological analysis, is often underexplored in video tasks. To bridge these gaps, we propose SAM2-3dMed, an adaptation of SAM2 for 3D medical imaging. Our framework introduces two key innovations: 1) a Slice Relative Position Prediction (SRPP) module explicitly models bidirectional inter-slice dependencies by guiding SAM2 to predict the relative positions of different slices in a self-supervised manner; 2) a Boundary Detection (BD) module enhances segmentation accuracy along critical organ and tissue boundaries. Extensive experiments on three diverse medical datasets (the Lung, Spleen, and Pancreas in the Medical Segmentation Decathlon (MSD) dataset) demonstrate that SAM2-3dMed significantly outperforms state-of-the-art methods, achieving superior performance in segmentation overlap and boundary precision. Our approach not only advances 3D medical image segmentation performance but also offers a general paradigm for adapting video-centric foundation models to spatial volumetric data.
准确分割三维医学图像对于疾病评估和治疗计划等临床应用至关重要。虽然Segment Anything Model 2 (SAM2) 在视频对象分割中通过利用时间线索取得了显著的成功,但在将其直接应用于3D医学图像时面临两个根本性的领域差异:1)切片之间的双向解剖连续性与视频中的单向时间流形成鲜明对比;2)对于形态学分析至关重要的精确边界划分在视频任务中往往被忽视。为了弥合这些差距,我们提出了SAM2-3dMed,这是对SAM2用于三维医学成像的一种适应。我们的框架引入了两个关键创新:1)一个切片相对位置预测(SRPP)模块通过引导SAM2以自监督的方式预测不同切片之间的相对位置来明确建模双向跨切片依赖关系;2)边界检测(BD)模块增强了在重要器官和组织界限上的分割精度。我们在三个多样化的医学数据集上进行了广泛的实验(Medical Segmentation Decathlon (MSD) 数据集中的肺、脾脏和胰腺),结果表明SAM2-3dMed显著优于最先进的方法,在分割重叠度和边界精确度方面表现出色。我们的方法不仅提升了三维医学图像的分割性能,还为将视频中心的基础模型适应于空间体积数据提供了一般范式。
https://arxiv.org/abs/2510.08967
Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.
通用事件边界检测(GEBD)旨在通过人类感知的角度来解释长视频内容。然而,当前的GEBD方法需要处理完整的视频帧才能做出预测,而人类则是实时在线地处理数据。为弥合这一差距,我们引入了一个新的任务——在线通用事件边界检测(On-GEBD),其目标是在流媒体视频中立即检测通用事件的边界。这个任务面临独特的挑战,即在没有未来帧访问权限的情况下,实时识别细微、无类别定义的事件变化。 为了应对这些挑战,我们提出了一种新颖的On-GEBD框架——Estimator,该框架受事件分割理论(EST)启发,解释人类如何通过利用预测信息与实际信息之间的差异来将正在进行的活动划分为不同的事件。我们的框架包含两个关键组成部分:一致事件预期器(CEA)和在线边界辨别器(OBD)。具体来说,CEA仅基于先前帧生成对未来帧的预测,反映当前事件的动力学;随后,OBD测量预测误差,并通过统计测试在过去的错误上自适应地调整阈值以捕捉多样且细微的事件过渡。 实验结果表明,Estimator优于所有从最近在线视频理解模型中改编而来的基线方法,在Kinetics-GEBD和TAPOS数据集上的性能与先前的离线GEBD方法相当。
https://arxiv.org/abs/2510.06855
Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.
通过深度学习模型进行自动和弦识别(ACR)已经逐渐实现了有前景的识别准确性,但仍存在两个关键挑战。首先,先前的研究主要集中在音频域的ACR上,而符号音乐(如乐谱)的ACR由于数据稀缺性问题,受到的关注较少。其次,现有的方法仍然忽视了与人类音乐分析实践相一致的战略。 为了解决这些挑战,我们做出了两项贡献:(1) 我们引入了POP909-CL,这是POP909数据集的一个增强版本,包含了节奏对齐的内容以及和弦、节拍、调性和时间签名的人工校正标签;(2) 我们提出了BACHI,这是一种符号和弦识别模型,将任务分解为不同的决策步骤,即边界检测与和弦根音、质量及低音(反向)的迭代排序。这种机制反映了人类耳朵训练的方法。 实验表明,BACHI在古典音乐和流行音乐基准测试中均实现了最先进的和弦识别性能,并且消融研究验证了每个模块的有效性。
https://arxiv.org/abs/2510.06528
Current methods for Music Structure Analysis (MSA) focus primarily on audio data. While symbolic music can be synthesized into audio and analyzed using existing MSA techniques, such an approach does not exploit symbolic music's rich explicit representation of pitch, timing, and instrumentation. A key subproblem of MSA is section boundary detection-determining whether a given point in time marks the transition between musical sections. In this paper, we study automatic section boundary detection for symbolic music. First, we introduce a human-annotated MIDI dataset for section boundary detection, consisting of metadata from 6134 MIDI files that we manually curated from the Lakh MIDI dataset. Second, we train a deep learning model to classify the presence of section boundaries within a fixed-length musical window. Our data representation involves a novel encoding scheme based on synthesized overtones to encode arbitrary MIDI instrumentations into 3-channel piano rolls. Our model achieves an F1 score of 0.77, improving over the analogous audio-based supervised learning approach and the unsupervised block-matching segmentation (CBM) audio approach by 0.22 and 0.31, respectively. We release our dataset, code, and models.
目前的音乐结构分析(MSA)方法主要集中在音频数据上。虽然可以通过现有技术将符号音乐合成成音频并进行分析,但这种方法并未充分利用符号音乐在音高、时间以及乐器配置方面的丰富显式表示。MSA的一个关键子问题是节段边界检测——确定某一给定时刻是否标志着音乐部分之间的转换。本文探讨了针对符号音乐的自动节段边界检测。 首先,我们介绍了一个由人工整理的标记有元数据的MIDI数据集,该数据集包含了从Lakh MIDI数据集中手动选取和处理的6134个MIDI文件,用于进行节段边界检测。其次,我们训练了一种深度学习模型,以在固定长度的音乐窗口内对节段边界的出现情况进行分类。我们的数据表示方法采用了基于合成泛音的一种新颖编码方案,将任意MIDI乐器配置编码为三通道钢琴卷帘(piano rolls)。我们的模型实现了0.77的F1得分,相较于同类音频基础监督学习方法和无监督块匹配分割(CBM)音频方法分别提高了0.22和0.31。 我们已发布了数据集、代码及模型。
https://arxiv.org/abs/2509.16566
Federated Retrieval (FR) routes queries across multiple external knowledge sources, to mitigate hallucinations of LLMs, when necessary external knowledge is distributed. However, existing methods struggle to retrieve high-quality and relevant documents for ambiguous queries, especially in cross-domain scenarios, which significantly limits their effectiveness in supporting downstream generation tasks. Inspired by dynamic information flow (DIF), we propose DFAMS, a novel framework that leverages DIF to identify latent query intents and construct semantically aligned knowledge partitions for accurate retrieval across heterogeneous sources. Specifically, DFAMS probes the DIF in LLMs by leveraging gradient signals from a few annotated queries and employing Shapley value-based attribution to trace neuron activation paths associated with intent recognition and subdomain boundary detection. Then, DFAMS leverages DIF to train an alignment module via multi-prototype contrastive learning, enabling fine-grained intra-source modeling and inter-source semantic alignment across knowledge bases. Experimental results across five benchmarks show that DFAMS outperforms advanced FR methods by up to 14.37% in knowledge classification accuracy, 5.38% in retrieval recall, and 6.45% in downstream QA accuracy, demonstrating its effectiveness in complex FR scenarios.
联邦检索(FR)在必要时将查询分布在多个外部知识源上,以减轻大规模语言模型的幻觉问题,尤其是在所需的知识分散的情况下。然而,现有的方法难以针对含糊不清的查询,特别是在跨领域场景中,提取高质量且相关的文档,这大大限制了它们支持下游生成任务的有效性。 受动态信息流(DIF)启发,我们提出了DFAMS这一新型框架,该框架利用DIF来识别潜在的查询意图并构建语义对齐的知识分区,从而在异构来源之间实现准确检索。具体而言,DFAMS通过使用少量注释查询的梯度信号以及基于Shapley值归因的方法探查LLM中的动态信息流,以追踪与意图识别和子域边界检测相关的神经元激活路径。然后,DFAMS利用DIF训练一个对齐模块,通过多原型对比学习实现细粒度的源内建模,并在知识库之间进行语义对齐。 跨五个基准测试的数据表明,在知识分类准确率、检索召回率以及下游问答准确率方面,DFAMS分别比最先进的FR方法高出14.37%、5.38%和6.45%,展示了其在复杂联邦检索场景中的有效性。
https://arxiv.org/abs/2508.20353
Non-destructive 3D imaging of large multi-particulate samples is essential for quantifying particle-level properties, such as size, shape, and spatial distribution, across applications in mining, materials science, and geology. However, accurate instance segmentation of particles in tomographic data remains challenging due to high morphological variability and frequent particle contact, which limit the effectiveness of classical methods like watershed algorithms. While supervised deep learning approaches offer improved performance, they rely on extensive annotated datasets that are labor-intensive, error-prone, and difficult to scale. In this work, we propose self-validated learning, a novel self-training framework for particle instance segmentation that eliminates the need for manual annotations. Our method leverages implicit boundary detection and iteratively refines the training set by identifying particles that can be consistently matched across reshuffled scans of the same sample. This self-validation mechanism mitigates the impact of noisy pseudo-labels, enabling robust learning from unlabeled data. After just three iterations, our approach accurately segments over 97% of the total particle volume and identifies more than 54,000 individual particles in tomographic scans of quartz fragments. Importantly, the framework also enables fully autonomous model evaluation without the need for ground truth annotations, as confirmed through comparisons with state-of-the-art instance segmentation techniques. The method is integrated into the Biomedisa image analysis platform (this https URL).
非破坏性3D成像技术对于多颗粒样品的粒度属性量化(如尺寸、形状和空间分布)在矿业、材料科学和地质学等领域至关重要。然而,由于形态变化大且颗粒接触频繁,断层图像数据中的粒子实例分割仍然具有挑战性。传统的分水岭算法等方法的有效性也因此受到限制。虽然监督深度学习方法能够提高性能,但它们依赖于大量需要人工标注的数据集,这种做法耗时、易出错,并且难以扩展。 在此研究中,我们提出了一种自验证学习的新框架,用于粒子实例分割任务,这种方法可以消除手动注释的需求。我们的方法利用隐式边界检测并通过识别在重新排列的同一样本扫描中可一致匹配的颗粒来迭代地改进训练集。这种自我验证机制减少了噪声伪标签的影响,并使模型能够从无标注数据中稳健学习。 经过三次迭代后,我们提出的方法可以准确分割超过97%的总粒子体积,并且能够在石英碎片的断层图像扫描中识别出54,000多个独立颗粒。重要的是,该框架还允许完全自主地评估模型性能,无需使用真实标签注释。这已在与最新实例分割技术的对比测试中得到确认。 此方法已被集成到Biomedisa图像分析平台(点击链接查看详情)中。
https://arxiv.org/abs/2508.16224
Generic event boundary detection (GEBD) aims to identify natural boundaries in a video, segmenting it into distinct and meaningful chunks. Despite the inherent subjectivity of event boundaries, previous methods have focused on deterministic predictions, overlooking the diversity of plausible solutions. In this paper, we introduce a novel diffusion-based boundary detection model, dubbed DiffGEBD, that tackles the problem of GEBD from a generative perspective. The proposed model encodes relevant changes across adjacent frames via temporal self-similarity and then iteratively decodes random noise into plausible event boundaries being conditioned on the encoded features. Classifier-free guidance allows the degree of diversity to be controlled in denoising diffusion. In addition, we introduce a new evaluation metric to assess the quality of predictions considering both diversity and fidelity. Experiments show that our method achieves strong performance on two standard benchmarks, Kinetics-GEBD and TAPOS, generating diverse and plausible event boundaries.
泛化事件边界检测(GEBD)的目标是识别视频中的自然边界,将其分割成不同的、有意义的片段。尽管事件边界的主观性决定了其多样性,但先前的方法主要集中在确定性的预测上,忽略了可能解决方案的多样性。在本文中,我们引入了一种基于扩散模型的新方法——DiffGEBD,从生成式的角度来解决GEBD问题。所提出的方法通过时间自相似性编码相邻帧之间的相关变化,并在此基础上迭代地将随机噪声解码为受特征条件约束的可能事件边界。无分类器引导允许在去噪扩散过程中控制多样性的程度。此外,我们还引入了一种新的评估指标来衡量预测的质量,同时考虑多样性和保真度。实验表明,我们的方法在Kinetics-GEBD和TAPOS两个标准基准测试上取得了优异的成绩,能够生成多样化且合理的事件边界。
https://arxiv.org/abs/2508.12084
With ever-increasing data volumes, it is essential to develop automated approaches for identifying nanoscale defects in transmission electron microscopy (TEM) images. However, compared to features in conventional photographs, nanoscale defects in TEM images exhibit far greater variation due to the complex contrast mechanisms and intricate defect structures. These challenges often result in much less labeled data and higher rates of annotation errors, posing significant obstacles to improving machine learning model performance for TEM image analysis. To address these limitations, we examined transfer learning by leveraging large, pre-trained models used for natural images. We demonstrated that by using the pre-trained encoder and L2-regularization, semantically complex features are ignored in favor of simpler, more reliable cues, substantially improving the model performance. However, this improvement cannot be captured by conventional evaluation metrics such as F1-score, which can be skewed by human annotation errors treated as ground truth. Instead, we introduced novel evaluation metrics that are independent of the annotation accuracy. Using grain boundary detection in UO2 TEM images as a case study, we found that our approach led to a 57% improvement in defect detection rate, which is a robust and holistic measure of model performance on the TEM dataset used in this work. Finally, we showed that model self-confidence is only achieved through transfer learning and fine-tuning of very deep layers.
随着数据量的不断增加,开发自动识别透射电子显微镜(TEM)图像中纳米级缺陷的方法变得至关重要。然而,与传统照片中的特征相比,由于复杂的对比机制和复杂的缺陷结构,TEM图像中的纳米级缺陷表现出更大的变异性。这些挑战通常导致标记的数据大大减少,并且注释错误率更高,这对提高机器学习模型在TEM图像分析方面的性能构成了重大障碍。 为了解决这些问题,我们研究了迁移学习方法,通过利用大型预训练模型来处理自然图像。我们证明,在使用预训练编码器和L2正则化的情况下,复杂的语义特征会被忽略,而更简单且可靠的线索则被保留下来,从而显著提高了模型的性能。然而,这种改进无法通过传统的评估指标(如F1分数)捕捉到,因为这些传统指标可能会受到被视为真实标签的人类注释错误的影响。因此,我们引入了独立于标注准确性的新型评价指标。 以UO2 TEM图像中的晶界检测为例,我们的方法将缺陷检出率提高了57%,这是一个在用于本工作的TEM数据集上衡量模型性能的稳健且全面的标准。最后,我们证明只有通过迁移学习和对非常深层进行微调才能获得模型自信心。
https://arxiv.org/abs/2507.16779
Polyp segmentation is vital for early colorectal cancer detection, yet traditional fully supervised methods struggle with morphological variability and domain shifts, requiring frequent retraining. Additionally, reliance on large-scale annotations is a major bottleneck due to the time-consuming and error-prone nature of polyp boundary labeling. Recently, vision foundation models like Segment Anything Model (SAM) have demonstrated strong generalizability and fine-grained boundary detection with sparse prompts, effectively addressing key polyp segmentation challenges. However, SAM's prompt-dependent nature limits automation in medical applications, since manually inputting prompts for each image is labor-intensive and time-consuming. We propose OP-SAM, a One-shot Polyp segmentation framework based on SAM that automatically generates prompts from a single annotated image, ensuring accurate and generalizable segmentation without additional annotation burdens. Our method introduces Correlation-based Prior Generation (CPG) for semantic label transfer and Scale-cascaded Prior Fusion (SPF) to adapt to polyp size variations as well as filter out noisy transfers. Instead of dumping all prompts at once, we devise Euclidean Prompt Evolution (EPE) for iterative prompt refinement, progressively enhancing segmentation quality. Extensive evaluations across five datasets validate OP-SAM's effectiveness. Notably, on Kvasir, it achieves 76.93% IoU, surpassing the state-of-the-art by 11.44%.
结肠息肉分割对于早期发现结直肠癌至关重要,然而传统的全监督方法在处理形态变化和领域迁移时存在困难,需要频繁重新训练。此外,由于息肉边界标注耗时且易出错,对大规模注释的依赖成为了一个主要瓶颈。最近,像Segment Anything Model (SAM)这样的视觉基础模型展示了强大的泛化能力和稀疏提示下的精细边缘检测能力,有效地解决了关键的息肉分割挑战。然而,SAM基于提示的性质限制了其在医学应用中的自动化程度,因为为每张图像手动输入提示既费时又耗力。我们提出了一种名为OP-SAM的一次性息肉分割框架,该框架基于SAM,并从单个标注图像中自动生成提示,确保在无需额外注释负担的情况下进行准确且泛化的分割。 我们的方法引入了基于相关性的先验生成(CPG)用于语义标签的转移以及尺度级联先验融合(SPF),以适应息肉大小的变化并过滤掉噪声转移。我们没有一次性提供所有提示,而是设计了一种欧几里得提示演化(EPE),用于迭代式地改进提示,逐步提高分割质量。 在五个数据集上的广泛评估验证了OP-SAM的有效性。特别值得注意的是,在Kvasir数据集中,它实现了76.93%的交并比(IoU),超过了当前最佳方法11.44个百分点。
https://arxiv.org/abs/2507.16337
Audio-based music structure analysis (MSA) is an essential task in Music Information Retrieval that remains challenging due to the complexity and variability of musical form. Recent advances highlight the potential of fine-tuning pre-trained music foundation models for MSA tasks. However, these models are typically trained with high temporal feature resolution and short audio windows, which limits their efficiency and introduces bias when applied to long-form audio. This paper presents a temporal adaptation approach for fine-tuning music foundation models tailored to MSA. Our method enables efficient analysis of full-length songs in a single forward pass by incorporating two key strategies: (1) audio window extension and (2) low-resolution adaptation. Experiments on the Harmonix Set and RWC-Pop datasets show that our method significantly improves both boundary detection and structural function prediction, while maintaining comparable memory usage and inference speed.
基于音频的音乐结构分析(MSA)是音乐信息检索中的一个核心任务,但由于音乐形式的复杂性和多变性,这一任务仍然具有挑战性。最近的研究进展表明,通过微调预训练的音乐基础模型来执行MSA任务具有很大潜力。然而,这些模型通常是在高时间特征分辨率和短音频窗口上进行训练的,这在应用于长曲目时会限制它们的效率并引入偏差。本文提出了一种针对MSA任务对音乐基础模型进行微调的时间适应方法。我们的方法通过采用两个关键策略:(1)扩展音频窗口;(2)低分辨率调整,在单次前向传递中实现了对完整长度歌曲的有效分析。在Harmonix Set和RWC-Pop数据集上的实验表明,与现有方法相比,我们的方法显著提高了边界检测和结构功能预测的准确性,并且保持了相似的记忆使用量和推理速度。
https://arxiv.org/abs/2507.13572