Automated analysis of peripheral blood smears for Acute Lymphoblastic Leukemia (ALL) is hindered by low contrast and substantial variability in cytoplasmic appearance, which complicate conventional membrane-based segmentation. We found that many recent approaches rely on heavy neural architectures and extensive training, but still struggle to generalize across staining and acquisition variability. To address these limitations, we propose the Perinuclear Ring-based Image Segmentation Method (PRISM), which replaces explicit cytoplasmic delineation with adaptive concentric zones constructed around the nucleus. These perinuclear regions enable the extraction of robust cytoplasmic descriptors by integrating color information with texture statistics derived from grey-level co-occurrence patterns, without requiring accurate cell-boundary detection. A calibrated stacking ensemble of traditional classifiers leverages these descriptors to achieve a high performance, with an accuracy of 98.46% and a precision-recall AUC of 0.9937.
https://arxiv.org/abs/2605.12851
Accurate boundary detection in high-dimensional data remains a central challenge in unsupervised learning, particularly in the presence of non-linear structures and heterogeneous densities. In this work, we introduce Mean Curvature Boundary Points (MCBP), a novel geometric framework grounded in Geometric Machine Learning that departs from traditional density-based approaches by explicitly modeling the intrinsic curvature of the data manifold. The method relies on a discrete approximation of the shape operator, estimated from local k-nearest neighbor patches, to compute pointwise mean curvature without requiring explicit manifold parametrization. The key insight of MCBP is to use mean curvature as a principled descriptor of boundary structure: high-curvature regions naturally correspond to transitions between clusters, geometric irregularities, and low-density interfaces. This yields a unified geometric interpretation of boundary, outlier, and transition points. We further introduce an adaptive percentile-based thresholding scheme that enables multiscale boundary extraction without relying on ad hoc density parameters. Beyond detection, we propose a curvature-driven data decomposition that separates samples into smooth (low-curvature) and boundary (high-curvature) subsets, effectively acting as a non-linear geometric filtering mechanism. This representation enhances cluster separability and improves the robustness of downstream unsupervised algorithms. Extensive experiments on synthetic and real-world datasets demonstrate that MCBP consistently improves clustering performance, particularly in complex and high-dimensional scenarios. These results position MCBP as a concrete contribution to Geometric Machine Learning, highlighting the potential of curvature-aware analysis as a unifying paradigm bridging differential geometry and data-driven modeling.
https://arxiv.org/abs/2605.04274
Traditional Shot Boundary Detection (SBD) inherently struggles with complex transitions by formulating the task around isolated cut points, frequently yielding corrupted video shots. We address this fundamental limitation by formalizing the Shot Transition Detection (STD) task. Rather than searching for ambiguous points, STD explicitly detects the continuous temporal segments of transitions. To tackle this, we propose TransVLM, a Vision-Language Model (VLM) framework for STD. Unlike regular VLMs that predominantly rely on spatial semantics and struggle with fine-grained inter-shot dynamics, our method explicitly injects optical flow as a critical motion prior at the input stage. Through a simple yet effective feature-fusion strategy, TransVLM directly processes concatenated color and motion representations, significantly enhancing its temporal awareness without incurring any additional visual token overhead on the language backbone. To overcome the severe class imbalance in public data, we design a scalable data engine to synthesize diverse transition videos for robust training, alongside a comprehensive benchmark for STD. Extensive experiments demonstrate that TransVLM achieves superior overall performance, outperforming traditional heuristic methods, specialized spatiotemporal networks, and top-tier VLMs. This work has been deployed to production. For more related research, please visit HeyGen Research (this https URL) and HeyGen Avatar-V (this https URL). Project page: this https URL
https://arxiv.org/abs/2604.27975
Shot Boundary Detection (SBD) aims to automatically identify shot changes and divide a video into coherent shots. While SBD was widely studied in the literature, existing state-of-the-art methods often produce non-interpretable boundaries on transitions, miss subtle yet harmful discontinuities, and rely on noisy, low-diversity annotations and outdated benchmarks. To alleviate these limitations, we propose OmniShotCut to formulate SBD as structured relational prediction, jointly estimating shot ranges with intra-shot relations and inter-shot relations, by a shot query-based dense video Transformer. To avoid imprecise manual labeling, we adopt a fully synthetic transition synthesis pipeline that automatically reproduces major transition families with precise boundaries and parameterized variants. We also introduce OmniShotCutBench, a modern wide-domain benchmark enabling holistic and diagnostic evaluation.
https://arxiv.org/abs/2604.24762
Long-form video understanding remains fundamentally challenged by pervasive spatiotemporal redundancy and intricate narrative dependencies that span extended temporal horizons. While recent structured representations compress visual information effectively, they frequently sacrifice temporal coherence, which is critical for causal reasoning. Meanwhile, existing multi-agent frameworks operate through rigid, pre-defined workflows that fail to adapt their reasoning strategies to question-specific demands. In this paper, we introduce HiCrew, a hierarchical multi-agent framework that addresses these limitations through three core contributions. First, we propose a Hybrid Tree structure that leverages shot boundary detection to preserve temporal topology while performing relevance-guided hierarchical clustering within semantically coherent segments. Second, we develop a Question-Aware Captioning mechanism that synthesizes intent-driven visual prompts to generate precision-oriented semantic descriptions. Third, we integrate a Planning Layer that dynamically orchestrates agent collaboration by adaptively selecting roles and execution paths based on question complexity. Extensive experiments on EgoSchema and NExT-QA validate the effectiveness of our approach, demonstrating strong performance across diverse question types with particularly pronounced gains in temporal and causal reasoning tasks that benefit from our hierarchical structure-preserving design.
https://arxiv.org/abs/2604.21444
Camouflaged Object Detection is challenging due to the high degree of similarity between camouflaged objects and their surrounding backgrounds. Current COD methods mainly rely on edge extraction in the spatial domain and local pixel-level information, neglecting the importance of global structural features. Additionally, they fail to effectively leverage the importance of phase spectrum information within frequency domain features. To this end, we propose a COD framework BASFNet based on boundary-aware frequency domain and spatial domain this http URL method uses dual guided integration of frequency domain and spatial domain features. A phase-spectrum-based frequency-enhanced edge exploration module (FEEM) and a spatial core segmentation module (SCSM) are introduced to jointly capture the boundary and object features of camouflaged objects. These features are then effectively integrated through a spatial-frequency fusion interaction module (SFFIM). Furthermore, the boundary detection is further optimized through an boundary-aware training strategy. BASFNet outperforms existing state-of-the-art methods on three benchmark datasets, validating the effectiveness of the fusion of frequency and spatial domain information in COD tasks.
https://arxiv.org/abs/2604.17879
Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted. In this research, we introduce a novel algorithm, named as GMRL-BD, designed to identify the untrustworthy boundaries (in terms of topics) of a given LLM, with black-box access to the LLM and under specific query constraints. Based on a general Knowledge Graph (KG) derived from Wikipedia, our algorithm incorporates with multiple reinforcement learning agents to efficiently identify topics (some nodes in KG) where the LLM is likely to generate biased answers. Our experiments demonstrated the efficiency of our algorithm, which can detect the untrustworthy boundary with just limited queries to the LLM. Additionally, we have released a new dataset containing popular LLMs including Llama2, Vicuna, Falcon, Qwen2, Gemma2 and Yi-1.5, along with labels indicating the topics on which each LLM is likely to be biased.
大语言模型(LLMs)在回答各类话题的问题上展现出强大能力。然而,这些模型有时会产生带有偏见、意识形态化或不正确的回答,若无法明确判断哪些话题的答案可信,将限制其应用范围。本研究提出一种名为GMRL-BD的新型算法,旨在通过仅对LLM进行黑盒访问并在特定查询约束下,识别给定LLM在话题层面的不可信边界。该算法基于从维基百科导出的通用知识图谱(KG),结合多个强化学习智能体,高效识别LLM可能产生偏见回答的话题(KG中的部分节点)。实验证明,本算法效率显著,仅需有限次查询LLM即可检测其不可信边界。此外,我们发布了一个新数据集,涵盖Llama2、Vicuna、Falcon、Qwen2、Gemma2及Yi-1.5等主流LLM,并标注了各模型可能产生偏见的话题。
https://arxiv.org/abs/2604.05483
In many practical LLM deployments, a single guardrail is used for both prompt and response moderation. Prompt moderation operates on fully observed text, whereas streaming response moderation requires safety decisions to be made over partial generations. Existing text-based streaming guardrails commonly frame this output-side problem as boundary detection, training models to identify the earliest prefix at which a response has already become unsafe. In this work, we introduce StreamGuard, a unified model-agnostic streaming guardrail that instead formulates moderation as a forecasting problem: given a partial prefix, the model predicts the expected harmfulness of likely future continuations. We supervise this prediction using Monte Carlo rollouts, which enables early intervention without requiring exact token-level boundary annotations. Across standard safety benchmarks, StreamGuard performs strongly both for input moderation and for streaming output moderation. At the 8B scale, StreamGuard improves aggregated input-moderation F1 from 86.7 to 88.2 and aggregated streaming output-moderation F1 from 80.4 to 81.9 relative to Qwen3Guard-Stream-8B-strict. On the QWENGUARDTEST response_loc streaming benchmark, StreamGuard reaches 97.5 F1, 95.1 recall, and 92.6% on-time intervention, compared to 95.9 F1, 92.1 recall, and 89.9% for Qwen3Guard-Stream-8B-stric, while reducing the miss rate from 7.9% to 4.9%. We further show that forecasting-based supervision transfers effectively across tokenizers and model families: with transferred targets, Gemma3-StreamGuard-1B reaches 81.3 response-moderation F1, 98.2 streaming F1, and a 3.5% miss rate. These results show that strong end-to-end streaming moderation can be obtained without exact boundary labels, and that forecasting future risk is an effective supervision strategy for low-latency safety intervention.
在许多实际的LLM部署中,单一防护栏同时用于输入提示和输出响应的审核。提示审核作用于完全可见的文本,而流式响应审核则需基于部分生成内容做出安全决策。现有基于文本的流式防护栏通常将输出端问题框架为边界检测,训练模型识别响应已变得不安全的最早前缀。本研究提出StreamGuard,一种统一的模型无关流式防护栏,它将审核重新定义为预测问题:给定部分前缀,模型预测未来可能延续内容的预期危害性。我们使用蒙特卡洛推演对该预测进行监督,无需精确的标记级边界标注即可实现早期干预。在标准安全基准测试中,StreamGuard在输入审核和流式输出审核两方面均表现强劲。在8B规模下,相较于Qwen3Guard-Stream-8B-strict,StreamGuard将综合输入审核F1值从86.7提升至88.2,综合流式输出审核F1值从80.4提升至81.9。在QWENGUARDTEST的response_loc流式基准测试中,StreamGuard达到97.5的F1值、95.1的召回率和92.6%的及时干预率,而Qwen3Guard-Stream-8B-strict对应指标为95.9 F1、92.1召回率和89.9%,同时将漏报率从7.9%降至4.9%。我们进一步证明,基于预测的监督能有效跨分词器和模型族迁移:使用迁移目标,Gemma3-StreamGuard-1B达到81.3的响应审核F1值、98.2的流式F1值和3.5%的漏报率。这些结果表明,无需精确边界标签即可获得强大的端到端流式审核能力,且预测未来风险是低延迟安全干预的有效监督策略。
https://arxiv.org/abs/2604.03962
Partial deepfake speech detection requires identifying manipulated regions that may occur within short temporal portions of an otherwise bona fide utterance, making the task particularly challenging for conventional utterance-level classifiers. We propose a split-and-conquer framework that decomposes the problem into two stages: boundary detection and segment-level classification. A dedicated boundary detector first identifies temporal transition points, allowing the audio signal to be divided into segments that are expected to contain acoustically consistent content. Each resulting segment is then evaluated independently to determine whether it corresponds to bona fide or fake speech. This formulation simplifies the learning objective by explicitly separating temporal localization from authenticity assessment, allowing each component to focus on a well-defined task. To further improve robustness, we introduce a reflection-based multi-length training strategy that converts variable-duration segments into several fixed input lengths, producing diverse feature-space representations. Each stage is trained using multiple configurations with different feature extractors and augmentation strategies, and their complementary predictions are fused to obtain improved final models. Experiments on the PartialSpoof benchmark demonstrate state-of-the-art performance across multiple temporal resolutions as well as at the utterance level, with substantial improvements in the accurate detection and localization of spoofed regions. In addition, the proposed method achieves state-of-the-art performance on the Half-Truth dataset, further confirming the robustness and generalization capability of the framework.
部分深度伪造语音检测需要识别在原本真实的语音中可能出现的短时 manipulated 区域,这使得传统的语句级分类器面临特别大的挑战。我们提出一个分而治之框架,将问题分解为两个阶段:边界检测和片段级分类。首先,一个专用的边界检测器识别时间过渡点,从而将音频信号分割为预期包含声学一致内容的片段。随后,对每个生成的片段进行独立评估,以确定其对应真实语音还是伪造语音。这种形式通过明确分离时间定位与真实性评估来简化学习目标,使每个组件都能专注于定义明确的任务。为进一步提升鲁棒性,我们引入了基于镜像的多长度训练策略,将可变时长的片段转换为多个固定输入长度,产生多样化的特征空间表示。每个阶段均采用多种配置进行训练,结合不同的特征提取器和增强策略,并通过融合其互补预测来获得改进的最终模型。在 PartialSpoof 基准测试上的实验表明,该方法在多种时间分辨率及语句级别均达到最先进性能,在伪造区域的准确检测与定位方面取得显著提升。此外,所提方法在 Half-Truth 数据集上也实现了最先进性能,进一步验证了该框架的鲁棒性与泛化能力。
https://arxiv.org/abs/2604.02913
Turn-taking modeling is fundamental to spoken dialogue systems, yet its evaluation remains fragmented and often limited to binary boundary detection under narrow interaction settings. Such protocols hinder systematic comparison and obscure model weaknesses across conversational conditions. We present CoDeTT, a context-aware decision benchmark for turn-taking evaluation. CoDeTT formulates turn-taking as a structured decision problem and constructs a multi-scenario dataset with fine-grained decision categories and controlled context variations. Under a unified evaluation protocol, we assess representative existing models and observe substantial performance disparities across decision types and interaction scenarios. CoDeTT provides a standardized benchmark for systematic and context-aware evaluation of turn-taking systems. The benchmark dataset and evaluation toolkit are available at this https URL.
轮次建模是口语对话系统的基础,但其评估方法仍较为零散,且常局限于狭窄交互场景下的二元边界检测。此类评估协议阻碍了系统性对比,并掩盖了模型在不同对话条件下的缺陷。本文提出CoDeTT,一个面向轮次评估的上下文感知决策基准。CoDeTT将轮次转化为结构化决策问题,并构建了包含细粒度决策类别与可控上下文变化的多场景数据集。在统一的评估协议下,我们测试了代表性现有模型,观察到不同决策类型与交互场景间存在显著性能差异。CoDeTT为轮次系统的系统性、上下文感知评估提供了标准化基准。该基准数据集与评估工具包可通过此https URL获取。
https://arxiv.org/abs/2603.25434
For accurate glaucoma diagnosis and monitoring, reliable retinal layer segmentation in OCT images is essential. However, existing 2D segmentation methods often suffer from slice-to-slice inconsistencies due to the lack of contextual information across adjacent B-scans. 3D segmentation methods are better for capturing slice-to-slice context, but they require expensive computational resources. To address these limitations, we propose a 2.5D segmentation framework that incorporates a novel cross-slice feature fusion (CFF) module into a U-Net-like architecture. The CFF module fuses inter-slice features to effectively capture contextual information, enabling consistent boundary detection across slices and improved robustness in noisy regions. The framework was validated on both a clinical dataset and the publicly available DUKE DME dataset. Compared to other segmentation methods without the CFF module, the proposed method achieved an 8.56% reduction in mean absolute distance and a 13.92% reduction in root mean square error, demonstrating improved segmentation accuracy and robustness. Overall, the proposed 2.5D framework balances contextual awareness and computational efficiency, enabling anatomically reliable retinal layer delineation for automated glaucoma evaluation and potential clinical applications.
为实现准确的青光眼诊断与监测,OCT图像中可靠的视网膜层分割至关重要。然而,现有2D分割方法因缺乏相邻B扫描间的上下文信息,常出现切片间不一致问题。3D分割方法虽能更好捕捉切片间上下文,但需要高昂的计算资源。针对这些局限,我们提出一种2.5D分割框架,在类U-Net架构中引入新颖的跨切片特征融合(CFF)模块。该模块通过融合切片间特征有效捕获上下文信息,实现跨切片一致边界检测,并提升噪声区域的鲁棒性。该框架在临床数据集及公开的DUKE DME数据集上均得到验证。相较于不含CFF模块的其他分割方法,所提方法使平均绝对距离降低8.56%,均方根误差降低13.92%,展现出更优的分割精度与鲁棒性。总体而言,所提2.5D框架平衡了上下文感知能力与计算效率,可为自动化青光眼评估及潜在临床实践提供解剖学可靠的视网膜层勾画。
https://arxiv.org/abs/2603.24115
Retrieval-Augmented Generation (RAG) systems for biomedical literature are typically evaluated using ranking metrics like Mean Reciprocal Rank (MRR), which measure how well the system identifies the single most relevant chunk. We argue that for full-text scientific documents, this paradigm is incomplete: it rewards retrieval precision while ignoring retrieval breadth -- the ability to surface evidence from across a document's structural sections. We propose GraLC-RAG, a framework that unifies late chunking with graph-aware structural intelligence, introducing structure-aware chunk boundary detection, UMLS knowledge graph infusion, and graph-guided hybrid retrieval. We evaluate six strategies on 2,359 IMRaD-filtered PubMed Central articles using 2,033 cross-section questions and two metric families: standard ranking metrics (MRR, Recall@k) and structural coverage metrics (SecCov@k, CS Recall). Our results expose a sharp divergence: content-similarity methods achieve the highest MRR (0.517) but always retrieve from a single section, while structure-aware methods retrieve from up to 15.6x more sections. Generation experiments show that KG-infused retrieval narrows the answer-quality gap to delta-F1 = 0.009 while maintaining 4.6x section diversity. These findings demonstrate that standard metrics systematically undervalue structural retrieval and that closing the multi-section synthesis gap is a key open problem for biomedical RAG.
用于生物医学文献的检索增强生成(RAG)系统通常采用平均倒数排名(MRR)等排序指标进行评估,这类指标衡量系统识别单个最相关文本块的能力。我们认为,对于全文科学文献,这一范式存在不足:它奖励检索精度,却忽视了检索广度——即从文档各结构章节中提取证据的能力。我们提出了GraLC-RAG框架,该框架将延迟分块与图感知结构智能相结合,引入了结构感知分块边界检测、UMLS知识图谱注入以及图引导混合检索。我们在2,359篇经IMRaD筛选的PubMed Central文章上评估了六种策略,使用2,033个跨章节问题以及两类指标:标准排序指标(MRR、Recall@k)和结构覆盖指标(SecCov@k、CS Recall)。我们的结果揭示了显著分化:内容相似性方法达到最高MRR(0.517),但始终仅从单一章节检索;而结构感知方法检索的章节数量高达前者的15.6倍。生成实验表明,知识图谱注入的检索将答案质量差距缩小至delta-F1 = 0.009,同时保持4.6倍的章节多样性。这些发现证明,标准指标系统性地低估了结构检索的价值,而弥合多章节综合的差距是生物医学RAG领域一个关键未决问题。
https://arxiv.org/abs/2603.22633
Deep learning dominates speech processing but relies on massive datasets, global backpropagation-guided weight updates, and produces entangled representations. Assembly Calculus (AC), which models sparse neuronal assemblies via Hebbian plasticity and winner-take-all competition, offers a biologically grounded alternative, yet prior work focused on discrete symbolic inputs. We introduce an AC-based speech processing framework that operates directly on continuous speech by combining three key contributions:(i) neural encoding that converts speech into assembly-compatible spike patterns using probabilistic mel binarisation and population-coded MFCCs; (ii) a multi-area architecture organising assemblies across hierarchical timescales and classes; and (iii) cross-area update schemes for downstream tasks. Applied to two core tasks of boundary detection and segment classification, our framework detects phone (F1=0.69) and word (F1=0.61) boundaries without any weight training, and achieves 47.5% and 45.1% accuracy on phone and command recognition. These results show that AC-based dynamical systems are a viable alternative to deep learning for speech processing.
https://arxiv.org/abs/2603.16923
Music structure segmentation is a key task in audio analysis, but existing models perform poorly on Electronic Dance Music (EDM). This problem exists because most approaches rely on lyrical or harmonic similarity, which works well for pop music but not for EDM. EDM structure is instead defined by changes in energy, rhythm, and timbre, with different sections such as buildup, drop, and breakdown. We introduce EDMFormer, a transformer model that combines self-supervised audio embeddings using an EDM-specific dataset and taxonomy. We release this dataset as EDM-98: a group of 98 professionally annotated EDM tracks. EDMFormer improves boundary detection and section labelling compared to existing models, particularly for drops and buildups. The results suggest that combining learned representations with genre-specific data and structural priors is effective for EDM and could be applied to other specialized music genres or broader audio domains.
音乐结构分割是音频分析中的关键任务,但现有的模型在处理电子舞曲(EDM)时表现不佳。这一问题的原因在于大多数方法依赖于歌词或和声的相似性,这种方法对流行音乐有效,但对于EDM并不适用。EDM的结构则由能量、节奏和音色的变化定义,并且具有不同的部分,如蓄力段、爆发段和间奏。 我们引入了EDMFormer模型,这是一个结合自监督音频嵌入并使用特定于EDM的数据集与分类法的Transformer模型。我们将这一数据集作为EDM-98发布:一组共98首专业标注的EDM曲目。相比于现有的模型,EDMFormer在边界检测和段落标记方面有了显著改进,特别是在爆发段和蓄力段的表现更为突出。 实验结果表明,将学习到的表示与特定流派的数据及结构先验相结合是处理EDM的有效方法,并且这一策略也可以应用于其他专门化的音乐类型或更广泛的音频领域。
https://arxiv.org/abs/2603.08759
Intracoronary Optical Coherence Tomography (OCT) enables high-resolution visualization of coronary vessel anatomy but presents challenges due to noise, imaging artifacts, and complex tissue structures. This paper proposes a fully automated pipeline for vessel segmentation and classification in OCT images using machine learning techniques. The proposed method integrates image preprocessing, guidewire artifact removal, polar-to-Cartesian transformation, unsupervised K-means clustering, and local feature extraction. These features are used to train Logistic Regression and Support Vector Machine classifiers for pixel-wise vessel classification. Experimental results demonstrate excellent performance, achieving precision, recall, and F1-score values up to 1.00 and overall classification accuracy of 99.68%. The proposed approach provides accurate vessel boundary detection while maintaining low computational complexity and requiring minimal manual annotation. This method offers a reliable and efficient solution for automated OCT image analysis and has potential applications in clinical decision support and real-time medical image processing.
https://arxiv.org/abs/2602.15579
Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.
理解人类和人工智能系统如何处理复杂的叙事视频是神经科学与机器学习交叉领域中的一个基本挑战。本研究探讨了视频片段的时间上下文长度(3至12秒)以及叙述任务提示如何在观看自然主义电影时影响大脑模型的对齐情况。通过从参与者观看全长电影期间采集的功能磁共振成像(fMRI)记录,我们考察了哪些脑区在面对不同的时间尺度上是如何动态地表示叙事信息的,以及这些神经模式与由机器学习模型推导出的特征如何匹配。 研究发现,延长视频片段的时间可以显著提升多模态大型语言模型(MLLMs)的大脑对齐度,而单模态视频模型则几乎没有改善。此外,较短的时间窗口倾向于在感知和早期语言区域中找到对应,而较长的时间窗口则偏好于更高层次的综合区域,这一现象与MLLMs中的层级-皮层层次结构相呼应。 最后,叙事任务提示(如跨场景摘要、叙述性总结、人物动机分析以及事件边界检测)激发了特定的任务相关的脑区对齐模式,并在较高阶区域内引发了依据上下文而变化的片段级别调节。综上所述,我们的研究结果将长篇叙事电影定位为一种探查生物相关时间整合和多上下文模型解释能力的理想实验平台。
https://arxiv.org/abs/2602.07570
Boundary detection of irregular and translucent objects is an important problem with applications in medical imaging, environmental monitoring and manufacturing, where many of these applications are plagued with scarce labeled data and low in situ computational resources. While recent image segmentation studies focus on segmentation mask alignment with ground-truth, the task of boundary detection remains understudied, especially in the low data regime. In this work, we present a lightweight discrete diffusion contour refinement pipeline for robust boundary detection in the low data regime. We use a Convolutional Neural Network(CNN) architecture with self-attention layers as the core of our pipeline, and condition on a segmentation mask, iteratively denoising a sparse contour representation. We introduce multiple novel adaptations for improved low-data efficacy and inference efficiency, including using a simplified diffusion process, a customized model architecture, and minimal post processing to produce a dense, isolated contour given a dataset of size <500 training images. Our method outperforms several SOTA baselines on the medical imaging dataset KVASIR, is competitive on HAM10K and our custom wildfire dataset, Smoke, while improving inference framerate by 3.5X.
不规则和透明物体的边界检测是一个重要的问题,它在医学成像、环境监测及制造业中有着广泛的应用。然而,在这些领域中,许多应用由于缺乏标注数据且现场计算资源有限而面临挑战。尽管最近的图像分割研究主要集中在与地面真实值的分割掩码对齐上,但低数据环境下边界检测的任务仍然较少被探索。在这项工作中,我们提出了一种轻量级离散扩散轮廓细化流水线,用于在低数据情况下进行稳健的边界检测。 我们的方法采用带有自注意力层的卷积神经网络(CNN)架构作为核心,并基于分割掩码条件,迭代地去噪稀疏轮廓表示。为了提高低数据下的有效性和推理效率,我们引入了多种新颖的适应性改进措施,包括简化扩散过程、定制模型架构以及最小化后处理步骤,在仅使用少于500张训练图像的数据集上产生稠密且独立的轮廓。 在医学成像数据集KVASIR上,我们的方法优于多个最先进的基线。同时,在HAM10K和我们自定义的野火数据集Smoke上的性能也具有竞争力,并且推理帧率提高了3.5倍。
https://arxiv.org/abs/2602.05880
Chunking strategies significantly impact the effectiveness of Retrieval-Augmented Generation (RAG) systems. Existing methods operate within fixed-granularity paradigms that rely on static boundary identification, limiting their adaptability to diverse query requirements. This paper presents FreeChunker, a Cross-Granularity Encoding Framework that fundamentally transforms the traditional chunking paradigm: the framework treats sentences as atomic units and shifts from static chunk segmentation to flexible retrieval supporting arbitrary sentence combinations. This paradigm shift not only significantly avoids the computational overhead required for semantic boundary detection, but also enhances adaptability to complex queries. Experimental evaluation on LongBench V2 demonstrates that FreeChunker possesses significant advantages in both retrieval performance and time efficiency compared to existing chunking methods. The pre-trained models and codes are available at this https URL.
https://arxiv.org/abs/2510.20356
Administrative phone tasks drain roughly 1 trillion USD annually from U.S. healthcare, with over 500 million insurance-benefit verification calls manually handled in 2024. We introduce INSURE-Dial, to our knowledge the first public benchmark for developing and assessing compliance-aware voice agents for phase-aware call auditing with span-based compliance verification. The corpus includes 50 de-identified, AI-initiated calls with live insurance representatives (mean 71 turns/call) and 1,000 synthetically generated calls that mirror the same workflow. All calls are annotated with a phase-structured JSON schema covering IVR navigation, patient identification, coverage status, medication checks (up to two drugs), and agent identification (CRN), and each phase is labeled for Information and Procedural compliance under explicit ask/answer logic. We define two novel evaluation tasks: (1) Phase Boundary Detection (span segmentation under phase-specific acceptance rules) and (2) Compliance Verification (IC/PC decisions given fixed spans). Per-phase scores are strong across small, low-latency baselines, but end-to-end reliability is constrained by span-boundary errors. On real calls, full-call exact segmentation is low, showing a gap between conversational fluency and audit-grade evidence.
行政电话任务每年从美国医疗保健中消耗大约1万亿美元,其中仅2024年就有超过5亿次医疗保险资格验证通话由人工处理。我们介绍了INSURE-Dial,据我们所知,这是第一个用于开发和评估符合合规性意识语音代理的公开基准,这些代理专门针对具有基于跨度的合规性验证的阶段感知呼叫审计进行设计。该语料库包括了50个去识别化的人工智能发起并与真实保险代表互动的通话(平均每次通话71轮)以及1,000次模拟生成的通话,所有这些都是按照相同的流程工作的。 每个通话都使用一个分阶段结构化的JSON模式进行了标注,涵盖了交互式语音应答导航、患者身份验证、覆盖状况、药物检查(最多两份药品)和代理识别(CRN),并且每个阶段根据明确的提问/回答逻辑被标记为信息合规性和程序合规性。我们定义了两个新颖的评估任务:(1) 阶段边界检测(基于特定于阶段接受规则的跨度分割);以及 (2) 合规验证(给定固定跨度后的IC/PC决定)。在各个阶段,小型低延迟基准模型的表现都很好,但整个过程中的可靠性因跨度边界错误而受限。对于真实通话而言,全通话精确分段较低,表明对话流畅度与审计级证据之间存在差距。
https://arxiv.org/abs/2602.18448
Proficiency in microanastomosis is a critical surgical skill in neurosurgery, where the ability to precisely manipulate fine instruments is crucial to successful outcomes. These procedures require sustained attention, coordinated hand movements, and highly refined motor skills, underscoring the need for objective and systematic methods to evaluate and enhance microsurgical training. Conventional assessment approaches typically rely on expert raters supervising the procedures or reviewing surgical videos, which is an inherently subjective process prone to inter-rater variability, inconsistency, and significant time investment. These limitations highlight the necessity for automated and scalable solutions. To address this challenge, we introduce a novel AI-driven framework for automated action segmentation and performance assessment in microanastomosis procedures, designed to operate efficiently on edge computing platforms. The proposed system comprises three main components: (1) an object tip tracking and localization module based on YOLO and DeepSORT; (2) an action segmentation module leveraging self-similarity matrix for action boundary detection and unsupervised clustering; and (3) a supervised classification module designed to evaluate surgical gesture proficiency. Experimental validation on a dataset of 58 expert-rated microanastomosis videos demonstrates the effectiveness of our approach, achieving a frame-level action segmentation accuracy of 92.4% and an overall skill classification accuracy of 85.5% in replicating expert evaluations. These findings demonstrate the potential of the proposed method to provide objective, real-time feedback in microsurgical education, thereby enabling more standardized, data-driven training protocols and advancing competency assessment in high-stakes surgical environments.
在神经外科中,显微吻合术的熟练掌握是一项关键的手术技能。该技术需要精确操作精细仪器的能力以确保成功的治疗结果。这些程序要求持续集中注意力、协调的手部动作以及高度精炼的运动技巧,这凸显了客观和系统方法评估及提升显微手术培训的需求。 传统评估方法通常依赖于专家监督手术过程或审查手术视频,这是一个主观性很强的过程,容易导致评分者之间的一致性问题、不稳定性,并需要大量的时间投入。这些局限性突出了自动化的、可扩展的解决方案的必要性。 为了解决这一挑战,我们引入了一种新颖的人工智能驱动框架,用于显微吻合术程序中的自动化动作分割和性能评估,旨在有效运行于边缘计算平台上。该提议系统由三个主要组成部分构成:(1)基于YOLO和DeepSORT的对象尖端跟踪与定位模块;(2)利用自相似矩阵进行动作边界检测和无监督聚类的动作分割模块;以及(3)用于评估手术手势熟练程度的监督分类模块。 通过使用包含58个专家评分显微吻合术视频的数据集进行了实验验证,证明了我们方法的有效性。在该数据集中,我们的系统达到了92.4%的帧级动作分割准确率和85.5%的整体技能分类准确率,在再现专家评估方面取得了显著成果。 这些发现表明所提出的方法具有提供显微外科教育中客观、实时反馈的潜力,从而能够促进更为标准化、基于数据驱动的培训协议,并在高风险手术环境中推进能力评估。
https://arxiv.org/abs/2512.23942