3D pose estimation from sparse multi-views is a critical task for numerous applications, including action recognition, sports analysis, and human-robot interaction. Optimization-based methods typically follow a two-stage pipeline, first detecting 2D keypoints in each view and then associating these detections across views to triangulate the 3D pose. Existing methods rely on mere pairwise associations to model this correspondence problem, treating global consistency between views (i.e., cycle consistency) as a soft constraint. Yet, reconciling these constraints for multiple views becomes brittle when spurious associations propagate errors. We thus propose COMPOSE, a novel framework that formulates multi-view pose correspondence matching as a hypergraph partitioning problem rather than through pairwise association. While the complexity of the resulting integer linear program grows exponentially in theory, we introduce an efficient geometric pruning strategy to substantially reduce the search space. COMPOSE achieves improvements of up to 23% in average precision over previous optimization-based methods and up to 11% over self-supervised end-to-end learned methods, offering a promising solution to a widely studied problem.
从稀疏多视角进行三维姿态估计是一项对于动作识别、体育分析和人机交互等众多应用至关重要的任务。基于优化的方法通常遵循两阶段流程:首先在每个视角中检测2D关键点,然后将这些检测结果跨视角关联起来以三角测量出3D姿态。现有的方法依赖于简单的成对关联来建模这种对应问题,并且将视图之间的全局一致性(即循环一致性)视为软约束处理。然而,在多个视角的情况下,解决这些约束变得脆弱,因为虚假的关联会传播错误。 为此,我们提出了一种名为COMPOSE的新框架,它将多视角姿态对应的匹配问题形式化为超图划分问题,而不是通过成对关联来解决。尽管理论上由此产生的整数线性规划复杂度呈指数增长,但我们引入了一个高效的几何剪枝策略,从而大幅减少了搜索空间。与以前的基于优化的方法相比,COMPOSE在平均精度上提高了多达23%,而相较于自监督端到端学习方法则提升了高达11%的表现,为一个长期研究的问题提供了有前景的解决方案。
https://arxiv.org/abs/2601.09698
Identifying individual animals in long-duration videos is essential for behavioral ecology, wildlife monitoring, and livestock management. Traditional methods require extensive manual annotation, while existing self-supervised approaches are computationally demanding and ill-suited for long sequences due to memory constraints and temporal error propagation. We introduce a highly efficient, self-supervised method that reframes animal identification as a global clustering task rather than a sequential tracking problem. Our approach assumes a known, fixed number of individuals within a single video -- a common scenario in practice -- and requires only bounding box detections and the total count. By sampling pairs of frames, using a frozen pre-trained backbone, and employing a self-bootstrapping mechanism with the Hungarian algorithm for in-batch pseudo-label assignment, our method learns discriminative features without identity labels. We adapt a Binary Cross Entropy loss from vision-language models, enabling state-of-the-art accuracy ($>$97\%) while consuming less than 1 GB of GPU memory per batch -- an order of magnitude less than standard contrastive methods. Evaluated on challenging real-world datasets (3D-POP pigeons and 8-calves feeding videos), our framework matches or surpasses supervised baselines trained on over 1,000 labeled frames, effectively removing the manual annotation bottleneck. This work enables practical, high-accuracy animal identification on consumer-grade hardware, with broad applicability in resource-constrained research settings. All code written for this paper are \href{this https URL}{here}.
在长期视频中识别个体动物对于行为生态学、野生动物监测和畜牧业管理至关重要。传统方法需要大量的手动标注,而现有的自监督方法由于内存限制和时间误差传播的问题,在计算上非常耗费资源且不适合处理长时间序列。我们提出了一种高效、自监督的方法,将动物识别重新定义为一个全局聚类任务而非顺序跟踪问题。我们的方法假设单个视频中存在已知数量的个体(这在实践中是一个常见的场景),只需要边界框检测和总的数量信息。通过帧对采样、使用冻结预训练骨干网络以及利用匈牙利算法进行批量伪标签分配,我们的自引导机制能够在没有身份标签的情况下学习区分性特征。我们从视觉-语言模型中调整了二元交叉熵损失函数,使得这种方法在仅消耗不到1GB的GPU内存(比标准对比方法少一个数量级)的前提下,实现了97%以上的业界领先准确率。在具有挑战性的实际数据集(3D-POP鸽子和8头牛进食视频)上评估时,我们的框架匹配或超越了基于超过1000帧标注训练的监督基线,有效地消除了手动注释瓶颈。这项工作使得在消费级硬件上进行实用且高准确率的动物识别成为可能,并在资源受限的研究环境中具有广泛的适用性。本论文的所有代码都可在[此处](https://this https URL)找到。
https://arxiv.org/abs/2601.09663
Large-scale vision-language models such as CLIP achieve strong zero-shot recognition but struggle with classes that are rarely seen during pretraining, including newly emerging entities and culturally specific categories. We introduce LiteEmbed, a lightweight framework for few-shot personalization of CLIP that enables new classes to be added without retraining its encoders. LiteEmbed performs subspace-guided optimization of text embeddings within CLIP's vocabulary, leveraging a PCA-based decomposition that disentangles coarse semantic directions from fine-grained variations. Two complementary objectives, coarse alignment and fine separation, jointly preserve global semantic consistency while enhancing discriminability among visually similar classes. Once optimized, the embeddings are plug-and-play, seamlessly substituting CLIP's original text features across classification, retrieval, segmentation, and detection tasks. Extensive experiments demonstrate substantial gains over prior methods, establishing LiteEmbed as an effective approach for adapting CLIP to underrepresented, rare, or unseen classes.
大型视觉-语言模型,如CLIP,在零样本识别方面表现出色,但在预训练期间很少见到的类别(包括新出现的实体和特定文化的类别)上则表现不佳。为此,我们引入了LiteEmbed,这是一个轻量级框架,用于对CLIP进行少量样例个性化处理,它使得可以在不重新训练其编码器的情况下添加新的类别。LiteEmbed在CLIP词汇表内执行子空间引导的文本嵌入优化,并利用基于PCA(主成分分析)的分解来解开粗略语义方向和细粒度变化之间的联系。 该框架通过两个互补的目标——粗略对齐与精细分离,同时保持全局语义一致性并增强视觉上相似类别间的区分性。一旦完成优化,这些嵌入可以即插即用,在分类、检索、分割和检测任务中无缝替换CLIP原有的文本特征。广泛的实验表明,LiteEmbed相对于先前的方法取得了显著的改进,并确立了其作为将CLIP适应于代表性不足、罕见或未见过类别方面的有效方法的地位。
https://arxiv.org/abs/2601.09661
Underwater video analysis is particularly challenging due to factors such as low lighting, color distortion, and turbidity, which compromise visual data quality and directly impact the performance of perception modules in robotic applications. This work proposes AquaFeat+, a plug-and-play pipeline designed to enhance features specifically for automated vision tasks, rather than for human perceptual quality. The architecture includes modules for color correction, hierarchical feature enhancement, and an adaptive residual output, which are trained end-to-end and guided directly by the loss function of the final application. Trained and evaluated in the FishTrack23 dataset, AquaFeat+ achieves significant improvements in object detection, classification, and tracking metrics, validating its effectiveness for enhancing perception tasks in underwater robotic applications.
水下视频分析由于低光照、色彩失真和浑浊等因素而面临诸多挑战,这些因素会降低视觉数据质量,并直接影响机器人应用中感知模块的性能。本研究提出了一种名为AquaFeat+的即插即用管道方案,旨在增强自动视觉任务所需的特定特征,而非追求人类观察的质量标准。该架构包括颜色校正、分层特征增强和自适应残差输出等模块,这些模块被端到端训练,并直接由最终应用的损失函数引导。 在FishTrack23数据集中进行训练和评估后,AquaFeat+在目标检测、分类和跟踪指标方面取得了显著提升,验证了其对水下机器人感知任务增强的有效性。
https://arxiv.org/abs/2601.09652
Large Language Models (LLMs), despite their remarkable capabilities across NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. This is even more evident in lower-resource languages such as Modern Greek. In this paper, we present a hybrid system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification/analysis and generation. Our approach implements a comprehensive taxonomy of Greek rhyme types, including Pure, Rich, Imperfect, Mosaic, and Identical Pre-rhyme Vowel (IDV) patterns, and employs an agentic generation pipeline with phonological verification. We evaluate multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant "Reasoning Gap": while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails catastrophically (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. We release our system and a crucial, rigorously cleaned corpus of 40,000+ rhymes, derived from the Anemoskala and Interwar Poetry corpora, to support future research.
尽管大型语言模型(LLM)在自然语言处理任务中表现出色,但在诸如押韵检测和生成这类以音系为基础的现象上仍存在挑战。尤其对于资源较少的语言,例如现代希腊语,这一问题更加明显。本文介绍了一种混合系统,该系统结合了LLM与确定性音系算法,能够实现准确的押韵识别/分析及生成。 我们的方法实现了全面的希腊语押韵类型分类,包括纯式、丰富式、不完整式、马赛克式和同前元音(IDV)模式,并采用了具有音系验证的代理生成管道。我们评估了多种提示策略(零样本、少样本、链式思维、RAG增强),并测试了多个LLM模型,包括Claude 3.7和4.5、GPT-4o、Gemini 2.0以及开源模型如Llama 3.1(8B和70B)和Mistral Large。实验结果显示了一个显著的“推理差距”:虽然母语级别的模型(例如,Claude 3.7)能够直观地完成任务(识别准确率为40%),但在少样本情况下需要更多推理能力的模型(如Claude 4.5)仅在使用链式思维提示时才能达到最先进的性能(54%)。最关键的是,纯LLM生成完全失败(有效诗歌低于4%),而我们的混合验证循环将性能恢复到了73.1%。 我们公开了该系统以及一套关键的、严格清理过的数据集,包含40,000多个押韵语料,来源于Anemoskala和战间期诗歌语料库。这些资源为未来的研究提供了支持。
https://arxiv.org/abs/2601.09631
Most Multimodal Sentiment Analysis research has focused on point-wise regression. While straightforward, this approach is sensitive to label noise and neglects whether one sample is more positive than another, resulting in unstable predictions and poor correlation alignment. Pairwise ordinal learning frameworks emerged to address this gap, capturing relative order by learning from comparisons. Yet, they introduce two new trade-offs: First, they assign uniform importance to all comparisons, failing to adaptively focus on hard-to-rank samples. Second, they employ static ranking margins, which fail to reflect the varying semantic distances between sentiment groups. To address this, we propose a Two-Stage Group-wise Ranking and Calibration Framework (GRCF) that adapts the philosophy of Group Relative Policy Optimization (GRPO). Our framework resolves these trade-offs by simultaneously preserving relative ordinal structure, ensuring absolute score calibration, and adaptively focusing on difficult samples. Specifically, Stage 1 introduces a GRPO-inspired Advantage-Weighted Dynamic Margin Ranking Loss to build a fine-grained ordinal structure. Stage 2 then employs an MAE-driven objective to align prediction magnitudes. To validate its generalizability, we extend GRCF to classification tasks, including multimodal humor detection and sarcasm detection. GRCF achieves state-of-the-art performance on core regression benchmarks, while also showing strong generalizability in classification tasks.
大多数多模态情感分析研究都集中在点式回归上。尽管这种做法简单直接,但它对标签噪声敏感,并且忽视了某些样本比另一些更积极这一事实,导致预测不稳定和相关性较差。为了解决这些问题,成对序数学习框架应运而生,通过比较来捕捉相对顺序。然而,这些方法引入了两个新的权衡:首先,它们将所有比较的重要性视为统一的,未能适应性地聚焦于难以排名的样本;其次,它们使用静态排序间隔,无法反映不同情感组之间的语义距离变化。 为了解决这些问题,我们提出了一种两阶段分组排序和校准框架(GRCF),该框架借鉴了集团相对策略优化(GRPO)的理念。我们的框架通过同时保持相对序数结构、确保绝对评分校准以及适应性地关注困难样本来解决这些权衡问题。 具体而言,在第一阶段,我们引入了一种受GRPO启发的动态优势加权间隔排序损失函数,以构建一个精细粒度的序数结构。在第二阶段,则采用了一个由MAE驱动的目标函数,用于对齐预测幅度。 为了验证其泛化能力,我们将GRCF扩展到分类任务中,包括多模态幽默检测和讽刺检测。GRCF在核心回归基准测试上取得了最先进的性能,并且在分类任务中也显示出强大的泛化性。
https://arxiv.org/abs/2601.09606
We study permutation (jumbled/Abelian) pattern matching over a general alphabet $\Sigma$. Given a pattern P of length m and a text T of length n, the classical task is to decide whether T contains a length-m substring whose Parikh vector equals that of P . While this existence problem admits a linear-time sliding-window solution, many practical applications require optimization and packing variants beyond mere detection. We present a unified sliding-window framework based on maintaining the Parikh-vector difference between P and the current window of T , enabling permutation matching in O(n + {\sigma}) time and O({\sigma}) space, where {\sigma} = |{\Sigma}|. Building on this foundation, we introduce a combinatorial-optimization variant that we call Maximum Feasible Substring under Pattern Supply (MFSP): find the longest substring S of T whose symbol counts are component-wise bounded by those of P . We show that MFSP can also be solved in O(n + {\sigma}) time via a two-pointer feasibility maintenance algorithm, providing an exact packing interpretation of P as a resource budget. Finally, we address non-overlapping occurrence selection by modeling each permutation match as an equal-length interval and proving that a greedy earliest-finishing strategy yields a maximum-cardinality set of disjoint matches, computable in linear time once all matches are enumerated. Our results provide concise, provably correct algorithms with tight bounds, and connect frequency-based string matching to packing-style optimization primitives.
https://arxiv.org/abs/2601.09577
We explore a situation in which the target domain is accessible, but real-time data annotation is not feasible. Instead, we would like to construct an alternative training set from a large-scale data server so that a competitive model can be obtained. For this problem, because the target domain usually exhibits distinct modes (i.e., semantic clusters representing data distribution), if the training set does not contain these target modes, the model performance would be compromised. While prior existing works improve algorithms iteratively, our research explores the often-overlooked potential of optimizing the structure of the data server. Inspired by the hierarchical nature of web search engines, we introduce a hierarchical data server, together with a bipartite mode matching algorithm (BMM) to align source and target modes. For each target mode, we look in the server data tree for the best mode match, which might be large or small in size. Through bipartite matching, we aim for all target modes to be optimally matched with source modes in a one-on-one fashion. Compared with existing training set search algorithms, we show that the matched server modes constitute training sets that have consistently smaller domain gaps with the target domain across object re-identification (re-ID) and detection tasks. Consequently, models trained on our searched training sets have higher accuracy than those trained otherwise. BMM allows data-centric unsupervised domain adaptation (UDA) orthogonal to existing model-centric UDA methods. By combining the BMM with existing UDA methods like pseudo-labeling, further improvement is observed.
我们探讨了一种情况,即目标领域可以访问,但实时数据标注不可行。为此,我们希望从大规模的数据服务器中构建一个替代的训练集,从而获得具有竞争力的模型。对于这个问题,因为目标域通常表现出不同的模式(即代表数据分布的语义簇),如果训练集中不包含这些目标模式,那么模型性能将受到影响。 以往的研究工作通过迭代改进算法来解决这一问题,而我们的研究则探索了往往被忽视的数据服务器结构优化潜力。受网络搜索引擎层次结构的启发,我们引入了一个分层数据服务器,并结合双轨模式匹配算法(BMM)以对齐源域和目标域的模式。对于每个目标模式,在服务器数据树中寻找最佳模式匹配,无论其大小如何。通过双轨匹配,我们的目的是让所有目标模式都能与源模式一对一地实现最优匹配。 相较于现有的训练集搜索算法,我们展示了所匹配的服务器模式构成的训练集在跨对象重识别(re-ID)和检测任务中的域差距始终较小。因此,在使用我们搜索到的训练集进行训练后,模型具有更高的准确性。 BMM使数据为中心的无监督领域适应(UDA)方法与现有以模型为中心的UDA方法相辅相成。通过结合BMM和其他现有的UDA方法(如伪标签),进一步改进了性能。
https://arxiv.org/abs/2601.09531
Egocentric Human-Object Interaction (EHOI) analysis is crucial for industrial safety, yet the development of robust models is hindered by the scarcity of annotated domain-specific data. We address this challenge by introducing a data generation framework that combines synthetic data with a diffusion-based process to augment real-world images with realistic Personal Protective Equipment (PPE). We present GlovEgo-HOI, a new benchmark dataset for industrial EHOI, and GlovEgo-Net, a model integrating Glove-Head and Keypoint- Head modules to leverage hand pose information for enhanced interaction detection. Extensive experiments demonstrate the effectiveness of the proposed data generation framework and GlovEgo-Net. To foster further research, we release the GlovEgo-HOI dataset, augmentation pipeline, and pre-trained models at: GitHub project.
以自我为中心的人机交互(EHOI)分析对于工业安全至关重要,但缺乏标注的特定领域数据阻碍了稳健模型的发展。为了解决这一挑战,我们引入了一个结合合成数据和基于扩散的过程的数据生成框架,该框架能够用逼真的个人防护装备(PPE)增强现实世界图像。我们提出了GlovEgo-HOI,这是一个新的工业EHOI基准数据集,以及GlovEgo-Net模型,该模型整合了Glove-Head和Keypoint-Head模块以利用手部姿态信息来增强交互检测能力。大量的实验展示了所提出的生成框架和GlovEgo-Net的有效性。为了促进进一步的研究,我们在GitHub项目中发布了GlovEgo-HOI数据集、增强管道以及预训练模型:[GitHub项目链接]。 (注意:原文末尾提供的具体GitHub链接未给出,因此在翻译时保持了形式上的表述。)
https://arxiv.org/abs/2601.09528
Object detectors often perform well in-distribution, yet degrade sharply on a different benchmark. We study cross-dataset object detection (CD-OD) through a lens of setting specificity. We group benchmarks into setting-agnostic datasets with diverse everyday scenes and setting-specific datasets tied to a narrow environment, and evaluate a standard detector family across all train--test pairs. This reveals a clear structure in CD-OD: transfer within the same setting type is relatively stable, while transfer across setting types drops substantially and is often asymmetric. The most severe breakdowns occur when transferring from specific sources to agnostic targets, and persist after open-label alignment, indicating that domain shift dominates in the hardest regimes. To disentangle domain shift from label mismatch, we compare closed-label transfer with an open-label protocol that maps predicted classes to the nearest target label using CLIP similarity. Open-label evaluation yields consistent but bounded gains, and many corrected cases correspond to semantic near-misses supported by the image evidence. Overall, we provide a principled characterization of CD-OD under setting specificity and practical guidance for evaluating detectors under distribution shift. Code will be released at \href{[this https URL}{this https URL}.
对象检测器通常在同分布的数据集上表现良好,但在不同的基准测试中性能会急剧下降。我们通过设定特异性(setting specificity)的角度研究跨数据集的对象检测(CD-OD)。我们将基准分为无特定设置的多样化日常场景数据集和与狭窄环境相关的特定设置数据集,并评估标准检测器家族在所有训练-测试对上的表现。这揭示了CD-OD的一个清晰结构:在同一类型设置内的迁移相对稳定,而在不同类型的设置之间的迁移大幅下降且通常不对称。当从特定来源转移到无特定目标时,性能最严重的下降发生,即使经过开放标签对齐后仍然存在,表明在最难的情况下领域偏移(domain shift)占主导地位。 为了将领域偏移与标签不匹配区分开来,我们将封闭标签迁移与一个开放标签协议进行比较,该协议使用CLIP相似性将预测类别映射到最近的目标标签。开放标签评估产生了一致但有限的收益,并且许多修正案例对应于由图像证据支持的语义近似情况。 总体而言,我们提供了在设定特异性下跨数据集对象检测(CD-OD)的原则性特征描述和针对分布变化下的检测器评估的实际指导建议。代码将在 [此链接](https://this.https.URL) 发布。
https://arxiv.org/abs/2601.09497
We present PrivLEX, a novel image privacy classifier that grounds its decisions in legally defined personal data concepts. PrivLEX is the first interpretable privacy classifier aligned with legal concepts that leverages the recognition capabilities of Vision-Language Models (VLMs). PrivLEX relies on zero-shot VLM concept detection to provide interpretable classification through a label-free Concept Bottleneck Model, without requiring explicit concept labels during training. We demonstrate PrivLEX's ability to identify personal data concepts that are present in images. We further analyse the sensitivity of such concepts as perceived by human annotators of image privacy datasets.
我们介绍了PrivLEX,这是一种新颖的图像隐私分类器,其决策依据是法律定义的个人数据概念。PrivLEX是第一个与法律概念相一致且利用视觉-语言模型(VLMs)识别能力的可解释性隐私分类器。PrivLEX依靠零样本VLM概念检测来提供无需训练时明确概念标签即可通过无标签的概念瓶颈模型进行可解释分类的能力。我们展示了PrivLEX能够识别图像中包含的个人数据概念,并进一步分析了人类标注者在图像隐私数据集中对这些概念敏感度的感知。
https://arxiv.org/abs/2601.09449
This work proposes a contextualised detection framework for implicitly hateful speech, implemented as a multi-agent system comprising a central Moderator Agent and dynamically constructed Community Agents representing specific demographic groups. Our approach explicitly integrates socio-cultural context from publicly available knowledge sources, enabling identity-aware moderation that surpasses state-of-the-art prompting methods (zero-shot prompting, few-shot prompting, chain-of-thought prompting) and alternative approaches on a challenging ToxiGen dataset. We enhance the technical rigour of performance evaluation by incorporating balanced accuracy as a central metric of classification fairness that accounts for the trade-off between true positive and true negative rates. We demonstrate that our community-driven consultative framework significantly improves both classification accuracy and fairness across all target groups.
这项工作提出了一种针对隐含仇恨言论的上下文化检测框架,该框架以一个多代理系统形式实现,其中包括一个中央调停者代理和根据特定人口群体动态构建的社区代理。我们的方法明确地从公开知识来源中集成了社会文化背景信息,从而实现了具有身份意识的监管功能,这一功能超越了目前最先进的提示方法(零样本提示、少样本提示、链式思维提示)和其他替代方案在具有挑战性的ToxiGen数据集上的表现。我们通过将平衡准确率作为分类公平性的重要指标纳入性能评估中,增强了技术严谨性,该指标考虑了真正例和真反例之间的权衡。我们的研究结果表明,社区驱动的协商框架显著提高了所有目标群体的分类准确性和公平性。
https://arxiv.org/abs/2601.09342
Delineating wildfire affected areas using satellite imagery remains challenging due to irregular and spatially heterogeneous spectral changes across the electromagnetic spectrum. While recent deep learning approaches achieve high accuracy when high-resolution multispectral data are available, their applicability in operational settings, where a quick delineation of the burn scar shortly after a wildfire incident is required, is limited by the trade-off between spatial resolution and temporal revisit frequency of current satellite systems. To address this limitation, we propose a novel deep learning model, namely BAM-MRCD, which employs multi-resolution, multi-source satellite imagery (MODIS and Sentinel-2) for the timely production of detailed burnt area maps with high spatial and temporal resolution. Our model manages to detect even small scale wildfires with high accuracy, surpassing similar change detection models as well as solid baselines. All data and code are available in the GitHub repository: this https URL.
使用卫星图像划定受野火影响的区域仍然颇具挑战,因为电磁光谱中的光谱变化不规则且空间异质性较强。虽然最近的深度学习方法在高分辨率多光谱数据可用时能够达到很高的精度,但由于当前卫星系统在空间分辨率和时间重访频率之间的权衡,在需要快速划定火灾后烧伤区域的实际操作环境中,这些方法的应用受到限制。为了解决这一局限性,我们提出了一种新的深度学习模型——BAM-MRCD(Burn Area Mapping with Multi-Resolution and Multi-Source Data),该模型利用多分辨率、多源卫星图像(MODIS和Sentinel-2)及时生成具有高空间和时间分辨率的详细烧伤区域地图。我们的模型能够以高精度检测出小规模野火,超过了类似的变更检测模型以及坚实的基础线模型。所有数据和代码可在GitHub存储库中获取:[此链接](https://this-url.github.com)。请注意将方括号中的"this https URL"替换为实际的GitHub网址。
https://arxiv.org/abs/2601.09262
Label assignment is a critical component in object detectors, particularly within DETR-style frameworks where the one-to-one matching strategy, despite its end-to-end elegance, suffers from slow convergence due to sparse supervision. While recent works have explored one-to-many assignments to enrich supervisory signals, they often introduce complex, architecture-specific modifications and typically focus on a single auxiliary strategy, lacking a unified and scalable design. In this paper, we first systematically investigate the effects of ``one-to-many'' supervision and reveal a surprising insight that performance gains are driven not by the sheer quantity of supervision, but by the diversity of the assignment strategies employed. This finding suggests that a more elegant, parameter-efficient approach is attainable. Building on this insight, we propose LoRA-DETR, a flexible and lightweight framework that seamlessly integrates diverse assignment strategies into any DETR-style detector. Our method augments the primary network with multiple Low-Rank Adaptation (LoRA) branches during training, each instantiating a different one-to-many assignment rule. These branches act as auxiliary modules that inject rich, varied supervisory gradients into the main model and are discarded during inference, thus incurring no additional computational cost. This design promotes robust joint optimization while maintaining the architectural simplicity of the original detector. Extensive experiments on different baselines validate the effectiveness of our approach. Our work presents a new paradigm for enhancing detectors, demonstrating that diverse ``one-to-many'' supervision can be integrated to achieve state-of-the-art results without compromising model elegance.
标签分配是目标检测器中的关键组成部分,特别是在采用DETR风格框架的系统中,其中一对一匹配策略虽然具有端到端的优雅性,但由于稀疏监督而导致收敛速度慢。尽管最近的一些研究探索了一对多(one-to-many)分配来丰富监督信号,但它们往往引入了复杂的、架构特定的修改,并且通常专注于单一辅助策略,缺乏统一和可扩展的设计。 在这篇论文中,我们首先系统地研究了一对多监督的影响,并揭示了一个令人惊讶的见解:性能提升并不是由监督数量直接驱动的,而是由使用的分配策略多样性所推动。这一发现表明,可以实现一种更为优雅且参数效率更高的方法。基于这一洞察,我们提出了LoRA-DETR框架,这是一个灵活而轻量级的方法,能够将多种不同的分配策略无缝集成到任何DETR风格的目标检测器中。 我们的方法在训练过程中通过多个低秩适应(Low-Rank Adaptation, LoRA)分支来增强主网络,每个分支实现一种不同的多对一分配规则。这些分支作为辅助模块,在模型的主要优化过程中注入丰富的、多样化的监督梯度,并且在推理阶段被丢弃,因此不会带来额外的计算成本。这一设计促进了稳健的联合优化,同时保持了原始检测器架构的简单性。 我们在不同基准上的广泛实验验证了我们方法的有效性。我们的工作提出了增强目标检测的新范式,证明了多样化的多对一监督可以整合到最先进的结果中,而不损害模型的优雅度。
https://arxiv.org/abs/2601.09247
Satellite videos provide continuous observations of surface dynamics but pose significant challenges for multi-object tracking (MOT), especially under unstabilized conditions where platform jitter and the weak appearance of tiny objects jointly degrade tracking performance. To address this problem, we propose DeTracker, a joint detection-and-tracking framework tailored for unstabilized satellite videos. DeTracker introduces a Global--Local Motion Decoupling (GLMD) module that explicitly separates satellite platform motion from true object motion through global alignment and local refinement, leading to improved trajectory stability and motion estimation accuracy. In addition, a Temporal Dependency Feature Pyramid (TDFP) module is developed to perform cross-frame temporal feature fusion, enhancing the continuity and discriminability of tiny-object representations. We further construct a new benchmark dataset, SDM-Car-SU, which simulates multi-directional and multi-speed platform motions to enable systematic evaluation of tracking robustness under varying motion perturbations. Extensive experiments on both simulated and real unstabilized satellite videos demonstrate that DeTracker significantly outperforms existing methods, achieving 61.1% MOTA on SDM-Car-SU and 47.3% MOTA on real satellite video data.
卫星视频提供了对地表动态的连续观测,但在未稳定条件下进行多目标跟踪(MOT)时面临重大挑战。平台抖动和小目标外观微弱的问题共同导致了跟踪性能下降。为了应对这一问题,我们提出了DeTracker,这是一种专门针对未稳定的卫星视频设计的检测与追踪框架。DeTracker引入了一个全局-局部运动解耦(GLMD)模块,通过全局对齐和局部细化来明确区分卫星平台运动和真实目标运动,从而提高轨迹稳定性和运动估计精度。此外,还开发了一种时序依赖特征金字塔(TDFP)模块,用于跨帧的时序特征融合,增强小对象表示的一致性与可识别性。 我们进一步构建了一个新的基准数据集SDM-Car-SU,该数据集模拟了多方向和多种速度下的平台运动,从而能够在不同的运动扰动下系统地评估跟踪鲁棒性。在模拟的和真实的未稳定卫星视频上的广泛实验表明,DeTracker显著优于现有的方法,在SDM-Car-SU数据集中达到了61.1% MOTA(多项评估追踪准确度),而在真实卫星视频数据集中的表现也达到47.3% MOTA。
https://arxiv.org/abs/2601.09240
Substation meters play a critical role in monitoring and ensuring the stable operation of power grids, yet their detection of cracks and other physical defects is often hampered by a severe scarcity of annotated samples. To address this few-shot generation challenge, we propose a novel framework that integrates Knowledge Embedding and Hypernetwork-Guided Conditional Control into a Stable Diffusion pipeline, enabling realistic and controllable synthesis of defect images from limited data. First, we bridge the substantial domain gap between natural-image pre-trained models and industrial equipment by fine-tuning a Stable Diffusion backbone using DreamBooth-style knowledge embedding. This process encodes the unique structural and textural priors of substation meters, ensuring generated images retain authentic meter characteristics. Second, we introduce a geometric crack modeling module that parameterizes defect attributes--such as location, length, curvature, and branching pattern--to produce spatially constrained control maps. These maps provide precise, pixel-level guidance during generation. Third, we design a lightweight hypernetwork that dynamically modulates the denoising process of the diffusion model in response to the control maps and high-level defect descriptors, achieving a flexible balance between generation fidelity and controllability. Extensive experiments on a real-world substation meter dataset demonstrate that our method substantially outperforms existing augmentation and generation baselines. It reduces Frechet Inception Distance (FID) by 32.7%, increases diversity metrics, and--most importantly--boosts the mAP of a downstream defect detector by 15.3% when trained on augmented data. The framework offers a practical, high-quality data synthesis solution for industrial inspection systems where defect samples are rare.
变电站电表在监测和确保电网稳定运行中扮演着关键角色,然而由于标注样本的极度稀缺性,它们检测裂缝及其他物理缺陷的能力常常受限。为了解决这一少量数据生成挑战,我们提出了一种新颖框架,该框架将知识嵌入与超网络引导条件控制相结合,并将其整合到一个稳定的扩散模型(Stable Diffusion)管道中,从而能够从有限的数据中合成出真实且可控的缺陷图像。 首先,通过采用DreamBooth风格的知识嵌入方法对预训练的Stable Diffusion骨干网络进行微调,我们弥合了自然图象预训练模型与工业设备之间的巨大领域差距。这一过程编码了变电站电表特有的结构和纹理先验知识,确保生成的图像保留真实电表特性。 其次,我们引入了一个几何裂缝建模模块,该模块参数化缺陷属性(如位置、长度、曲率及分支模式)以生成空间受限控制图。这些地图在生成过程中提供精确到像素级别的指导信息。 第三,设计了一个轻量级超网络,在扩散模型去噪过程中动态调节响应于控制地图和高层次缺陷描述符,实现生成真实度与可控性的灵活平衡。 通过一个实际的变电站电表数据集进行广泛实验表明,我们的方法在性能上显著优于现有的增强和生成基准。它将Frechet Inception距离(FID)降低了32.7%,提高了多样性和关键性地提升了当训练数据增加时下游缺陷检测器的平均精度(mAP),增幅为15.3%。 该框架提供了一种实用且高质量的数据合成解决方案,适用于工业检查系统中缺陷样本稀缺的情况。
https://arxiv.org/abs/2601.09238
Infrared object detection focuses on identifying and locating objects in complex environments (\eg, dark, snow, and rain) where visible imaging cameras are disabled by poor illumination. However, due to low contrast and weak edge information in infrared images, it is challenging to extract discriminative object features for robust detection. To deal with this issue, we propose a novel vision-language representation learning paradigm for infrared object detection. An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features. Specifically, we propose a Semantic Feature Alignment (SFA) module to align the object features with the corresponding text features. Furthermore, we develop an Object Feature Disentanglement (OFD) module that disentangles text-aligned object features and non-object features by minimizing their correlation. Finally, the disentangled object features are entered into the detection head. In this manner, the detection performance can be remarkably enhanced via more discriminative and less noisy features. Extensive experimental results demonstrate that our approach achieves superior performance on two benchmarks: M\textsuperscript{3}FD (83.7\% mAP), FLIR (86.1\% mAP). Our code will be publicly available once the paper is accepted.
红外物体检测专注于在复杂环境中(例如黑暗、下雪和下雨)识别和定位对象,这些环境使可见光成像相机由于照明不良而失效。然而,由于红外图像中的低对比度和弱边缘信息,提取具有区分性的目标特征以实现稳健的检测非常具有挑战性。为了解决这一问题,我们提出了一种新颖的视觉-语言表示学习范式用于红外物体检测。通过探索额外包含丰富语义信息的文本监督来指导分离对象与非对象特征。具体而言,我们提出了一个语义特征对齐(Semantic Feature Alignment, SFA)模块,将目标特征与相应的文本特征进行对齐。此外,我们开发了一个目标特征分离(Object Feature Disentanglement, OFD)模块,通过最小化它们之间的相关性来分离与文本对齐的目标特征和非对象特征。最后,这些经过分离的物体特征被输入到检测头中。通过这种方式,利用更具区分性和更少噪声的特征可以显著提升检测性能。广泛的实验结果表明,我们的方法在两个基准数据集上(M³FD 83.7% mAP、FLIR 86.1% mAP)取得了优越的表现。一旦论文被接受,我们的代码将公开提供。
https://arxiv.org/abs/2601.09228
Purpose: Myocardium segmentation in echocardiography videos is a challenging task due to low contrast, noise, and anatomical variability. Traditional deep learning models either process frames independently, ignoring temporal information, or rely on memory-based feature propagation, which accumulates error over time. Methods: We propose Point-Seg, a transformer-based segmentation framework that integrates point tracking as a temporal cue to ensure stable and consistent segmentation of myocardium across frames. Our method leverages a point-tracking module trained on a synthetic echocardiography dataset to track key anatomical landmarks across video sequences. These tracked trajectories provide an explicit motion-aware signal that guides segmentation, reducing drift and eliminating the need for memory-based feature accumulation. Additionally, we incorporate a temporal smoothing loss to further enhance temporal consistency across frames. Results: We evaluate our approach on both public and private echocardiography datasets. Experimental results demonstrate that Point-Seg has statistically similar accuracy in terms of Dice to state-of-the-art segmentation models in high quality echo data, while it achieves better segmentation accuracy in lower quality echo with improved temporal stability. Furthermore, Point-Seg has the key advantage of pixel-level myocardium motion information as opposed to other segmentation methods. Such information is essential in the computation of other downstream tasks such as myocardial strain measurement and regional wall motion abnormality detection. Conclusion: Point-Seg demonstrates that point tracking can serve as an effective temporal cue for consistent video segmentation, offering a reliable and generalizable approach for myocardium segmentation in echocardiography videos. The code is available at this https URL.
目的:由于低对比度、噪声和解剖结构的变化,心肌在超声心动图视频中的分割是一项具有挑战性的任务。传统的深度学习模型要么独立处理帧而忽略时间信息,要么依赖于基于记忆的特征传播方法,这会导致随着时间推移累积误差。 方法:我们提出了Point-Seg,这是一种基于Transformer的分割框架,它将点跟踪作为时间线索来确保在各个视频帧中稳定且一致的心肌分割。我们的方法利用了一个在合成超声心动图数据集上训练的点跟踪模块,以追踪整个视频序列中的关键解剖标志物。这些被追踪到的轨迹提供了一种显式的运动感知信号,指导分割过程,减少漂移并消除基于记忆特征累积的需求。此外,我们还加入了一个时间平滑损失函数,进一步增强帧间的时间一致性。 结果:我们在公共和私有超声心动图数据集上评估了我们的方法。实验结果显示,Point-Seg在高质量的回波数据中与最先进的分割模型相比,在Dice系数(一种衡量分割准确性的指标)方面具有统计相似的准确性;而在较低质量的回波图像中,它实现了更好的分割精度和更稳定的时间一致性。此外,Point-Seg还具有一种关键的优势,即像素级心肌运动信息,这在其他分割方法中是无法获得的。此类信息对于计算下游任务(如心肌应变测量、局部壁运动异常检测)至关重要。 结论:Point-Seg证明了点跟踪可以作为有效的时间线索来实现一致的视频分割,在超声心动图视频中心肌分割方面提供了一种可靠且通用的方法。代码可在提供的链接处获取。
https://arxiv.org/abs/2601.09207
In this work, we propose N EIoU YOLOv9, a lightweight detection framework based on a signal aware bounding box regression loss derived from non monotonic gradient focusing and geometric decoupling principles, referred to as N EIoU (Non monotonic Efficient Intersection over Union). The proposed loss reshapes localization gradients by combining non monotonic focusing with decoupled width and height optimization, thereby enhancing weak regression signals for hard samples with low overlap while reducing gradient interference. This design is particularly effective for small and low contrast targets commonly observed in agricultural disease imagery. The proposed N EIoU loss is integrated into a lightweight YOLOv9t architecture and evaluated on a self collected field dataset comprising 5908 rice leaf images across four disease categories and healthy leaves. Experimental results demonstrate consistent performance gains over the standard CIoU loss, achieving a mean Average Precision of 90.3 percent, corresponding to a 4.3 percent improvement over the baseline, with improved localization accuracy under stricter evaluation criteria. For practical validation, the optimized model is deployed on an Android device using TensorFlow Lite with Float16 quantization, achieving an average inference time of 156 milliseconds per frame while maintaining accuracy. These results confirm that the proposed approach effectively balances accuracy, optimization stability, and computational efficiency for edge based agricultural monitoring systems.
在这项工作中,我们提出了一种基于信号感知边界框回归损失的轻量级检测框架N EIoU YOLOv9。该损失函数源自非单调梯度聚焦和几何解耦原理,被称为N EIoU(Non monotonic Efficient Intersection over Union)。所提出的损失通过结合非单调聚焦与分离宽度和高度优化来重塑定位梯度,从而增强弱回归信号在低重叠区域中的效果,并减少梯度干扰。这种设计特别适用于农业病害图像中常见的小型及对比度较低的目标。 我们将提议的N EIoU损失集成到轻量级YOLOv9t架构中,并在一个自收集的数据集上进行评估,该数据集包含5908张稻叶图片,分为四个疾病类别和健康叶片。实验结果表明,在标准CIoU损失之上实现了持续性能提升,达到了平均精度(mAP)为90.3%的结果,相较于基准提高了4.3%,在更严格的评价准则下定位准确性得到了改善。 为了验证实际应用效果,优化后的模型被部署到了使用TensorFlow Lite与Float16量化技术的Android设备上,在保持准确性的前提下实现了平均每帧推理时间为156毫秒。这些结果证实了所提出的方法能够有效平衡精度、优化稳定性和计算效率,适合边缘农业监测系统的需求。
https://arxiv.org/abs/2601.09170
Zero-Shot Anomaly Detection (ZSAD) leverages Vision-Language Models (VLMs) to enable supervision-free industrial inspection. However, existing ZSAD paradigms are constrained by single visual backbones, which struggle to balance global semantic generalization with fine-grained structural discriminability. To bridge this gap, we propose Synergistic Semantic-Visual Prompting (SSVP), that efficiently fuses diverse visual encodings to elevate model's fine-grained perception. Specifically, SSVP introduces the Hierarchical Semantic-Visual Synergy (HSVS) mechanism, which deeply integrates DINOv3's multi-scale structural priors into the CLIP semantic space. Subsequently, the Vision-Conditioned Prompt Generator (VCPG) employs cross-modal attention to guide dynamic prompt generation, enabling linguistic queries to precisely anchor to specific anomaly patterns. Furthermore, to address the discrepancy between global scoring and local evidence, the Visual-Text Anomaly Mapper (VTAM) establishes a dual-gated calibration paradigm. Extensive evaluations on seven industrial benchmarks validate the robustness of our method; SSVP achieves state-of-the-art performance with 93.0\% Image-AUROC and 92.2\% Pixel-AUROC on MVTec-AD, significantly outperforming existing zero-shot approaches.
零样本异常检测(ZSAD)利用视觉-语言模型(VLMs)实现无监督的工业检查。然而,现有的ZSAD方法受到单一视觉主干网络的限制,难以在全局语义泛化和细粒度结构判别性之间取得平衡。为了解决这一问题,我们提出了协同语义视觉提示机制(SSVP),该机制能够高效地融合多种视觉编码以提升模型对细节特征的感知能力。具体来说,SSVP引入了分层语义视觉协同机制(HSVS),该机制深度整合了DINOv3多尺度结构先验到CLIP语义空间中。随后,视觉条件提示生成器(VCPG)利用跨模态注意力来引导动态提示的生成,使语言查询能够精确锚定于特定异常模式上。此外,为了应对全局评分与局部证据之间的差异,我们提出了视觉-文本异常映射器(VTAM),建立了双门控校准范式。 在七个工业基准测试中进行了广泛评估以验证我们方法的鲁棒性;SSVP在MVTec-AD数据集上的图像AUROC和像素AUROC得分分别为93.0% 和 92.2%,显著优于现有的零样本方法。
https://arxiv.org/abs/2601.09147