We propose to bridge the gap between semi-supervised and unsupervised image recognition with a flexible method that performs well for both generalized category discovery (GCD) and image clustering. Despite the overlap in motivation between these tasks, the methods themselves are restricted to a single task -- GCD methods are reliant on the labeled portion of the data, and deep image clustering methods have no built-in way to leverage the labels efficiently. We connect the two regimes with an innovative approach that Utilizes Neighbor Information for Classification (UNIC) both in the unsupervised (clustering) and semisupervised (GCD) setting. State-of-the-art clustering methods already rely heavily on nearest neighbors. We improve on their results substantially in two parts, first with a sampling and cleaning strategy where we identify accurate positive and negative neighbors, and secondly by finetuning the backbone with clustering losses computed by sampling both types of neighbors. We then adapt this pipeline to GCD by utilizing the labelled images as ground truth neighbors. Our method yields state-of-the-art results for both clustering (+3% ImageNet-100, Imagenet200) and GCD (+0.8% ImageNet-100, +5% CUB, +2% SCars, +4% Aircraft).
我们提出了一种灵活的方法,旨在弥合半监督和无监督图像识别之间的差距,并且该方法在泛化的类别发现(GCD)和图像聚类方面都表现出色。尽管这些任务的动机存在重叠,但各自的方法仅限于单个任务——GCD 方法依赖于数据中的标注部分,而深度图像聚类方法没有内置的方式来高效利用标签信息。我们通过一种创新性的“基于邻域信息分类(UNIC)”方法将两者联系起来,在无监督(聚类)和半监督(GCD)设置中均能发挥作用。 当前最先进的聚类方法已经很大程度上依赖于最近邻技术,我们在两个方面对其结果进行了显著改进:首先,通过采样和清理策略来识别准确的正负邻居;其次,通过对两种类型的邻居进行采样计算聚类损失,并微调骨干网络。然后我们将此流程调整应用于 GCD 中,利用标注图像作为真实邻居。 我们的方法在聚类(ImageNet-100 和 ImageNet200 上分别提高了 3%)和 GCD(ImageNet-100 提升了 0.8%,CUB 数据集提升了 5%,SCars 数据集中提升了 2%,Aircraft 数据集中提升了 4%)方面都取得了最先进的结果。
https://arxiv.org/abs/2503.14500
Automated feature engineering plays a critical role in improving predictive model performance for tabular learning tasks. Traditional automated feature engineering methods are limited by their reliance on pre-defined transformations within fixed, manually designed search spaces, often neglecting domain knowledge. Recent advances using Large Language Models (LLMs) have enabled the integration of domain knowledge into the feature engineering process. However, existing LLM-based approaches use direct prompting or rely solely on validation scores for feature selection, failing to leverage insights from prior feature discovery experiments or establish meaningful reasoning between feature generation and data-driven performance. To address these challenges, we propose LLM-FE, a novel framework that combines evolutionary search with the domain knowledge and reasoning capabilities of LLMs to automatically discover effective features for tabular learning tasks. LLM-FE formulates feature engineering as a program search problem, where LLMs propose new feature transformation programs iteratively, and data-driven feedback guides the search process. Our results demonstrate that LLM-FE consistently outperforms state-of-the-art baselines, significantly enhancing the performance of tabular prediction models across diverse classification and regression benchmarks.
自动化特征工程在提高表格学习任务的预测模型性能方面扮演着关键角色。传统的自动化特征工程技术受限于其对固定、手动设计搜索空间内预定义转换方法的依赖,往往忽视了领域知识的作用。近期利用大型语言模型(LLMs)的进步使得将领域知识整合到特征工程过程中成为可能。然而,现有的基于LLM的方法要么直接采用提示技术,要么仅仅依靠验证分数来进行特征选择,未能充分利用先前特征发现实验中的见解或在特征生成与数据驱动性能之间建立有意义的推理联系。 为解决这些挑战,我们提出了LLM-FE这一创新框架,它结合了进化搜索和大型语言模型(LLMs)所提供的领域知识及推理能力,以自动发现适用于表格学习任务的有效特征。LLM-FE将特征工程问题表述为一个程序搜索问题,在此过程中,LLMs会迭代地提出新的特征变换程序,而数据驱动的反馈则指导整个搜索过程。 我们的实验结果表明,相较于最先进的基线方法,LLM-FE始终表现更优,并且在多种分类和回归基准测试中显著提升了表格预测模型的表现。
https://arxiv.org/abs/2503.14434
Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks like classification, segmentation, and depth estimation. This led us to ask: Could similar success be achieved when the pool of teachers also includes vision models specialized in diverse tasks across both 2D and 3D perception? In this paper, we define and investigate the problem of heterogeneous teacher distillation, or co-distillation, a challenging multi-teacher distillation scenario where teacher models vary significantly in both (a) their design objectives and (b) the data they were trained on. We explore data-sharing strategies and teacher-specific encoding, and introduce DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception. Our model achieves performance comparable to that of its larger teachers, sometimes even outperforming them, on their respective tasks. Notably, DUNE surpasses MASt3R in Map-free Visual Relocalization with a much smaller encoder.
最近的多教师蒸馏方法将多个基础模型的编码器统一为一个单一的编码器,在图像分类、分割和深度估计等核心视觉任务上达到了竞争性的性能。这引发了我们的思考:当教师模型池还包括专门针对2D和3D感知不同任务的视觉模型时,是否也能取得类似的成功? 在本文中,我们定义并研究了异构教师蒸馏(或协同蒸馏)的问题,这是一个具有挑战性的多教师蒸馏场景,在这种场景下,教师模型的设计目标以及它们所训练的数据集都存在显著差异。我们探索了数据共享策略和特定于每个教师的编码方式,并引入了一种名为DUNE的单个编码器,该编码器在2D视觉、3D理解和3D人体感知方面表现出色。我们的模型在其各自的任务上达到了与其更大的教师模型相当甚至超越后者的性能。值得注意的是,在没有Map的情况下,DUNE在视觉重定位任务上的表现超过了MASt3R,并且它的编码器要小得多。
https://arxiv.org/abs/2503.14405
Despite the growing scale of medical Vision-Language datasets, the impact of dataset quality on model performance remains under-explored. We introduce Open-PMC, a high-quality medical dataset from PubMed Central, containing 2.2 million image-text pairs, enriched with image modality annotations, subfigures, and summarized in-text references. Notably, the in-text references provide richer medical context, extending beyond the abstract information typically found in captions. Through extensive experiments, we benchmark Open-PMC against larger datasets across retrieval and zero-shot classification tasks. Our results show that dataset quality-not just size-drives significant performance gains. We complement our benchmark with an in-depth analysis of feature representation. Our findings highlight the crucial role of data curation quality in advancing multimodal medical AI. We release Open-PMC, along with the trained models and our codebase.
尽管医学视觉语言数据集的规模正在不断扩大,但数据集质量对模型性能的影响仍然研究不足。我们介绍了Open-PMC,这是一个来自PubMed Central的高质量医学数据集,包含220万张图像文本配对,并附有图像模态注释、子图和总结在文中引用。值得注意的是,在文中的引用提供了更丰富的医疗背景信息,超出了通常出现在说明文字中的摘要信息。通过广泛的实验,我们从检索任务和零样本分类任务两个方面将Open-PMC与更大的数据集进行了基准测试。我们的结果显示,数据集的质量——而不仅仅是规模——能够带来显著的性能提升。为了补充这一基准测试,我们还对特征表示进行了一项深入分析。我们的研究结果突显了高质量的数据整理在推进多模态医学AI发展中的关键作用。我们将Open-PMC、训练好的模型和代码库一同发布。
https://arxiv.org/abs/2503.14377
Medical image segmentation aims to identify anatomical structures at the voxel-level. Segmentation accuracy relies on distinguishing voxel differences. Compared to advancements achieved in studies of the inter-class variance, the intra-class variance receives less attention. Moreover, traditional linear classifiers, limited by a single learnable weight per class, struggle to capture this finer distinction. To address the above challenges, we propose a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation. Specifically, we design a multi-prototype-based classification strategy, rethinking the segmentation from the perspective of structural relationships between voxel embeddings. The intra-class variations are explored by clustering voxels along the distribution of multiple prototypes in each class. Next, we introduce a consistency constraint to alleviate the limitation of linear classifiers. This constraint integrates different classification granularities from a linear classifier and the proposed prototype-based classifier. In the thorough evaluation on two popular benchmarks, our method achieves superior performance compared with state-of-the-art methods. Code is available at this https URL.
医学图像分割的目标是在体素级别识别解剖结构。分割精度依赖于区分不同体素之间的差异。相比于在研究类间差异方面的进展,对于类内差异的关注较少。此外,传统的线性分类器由于每个类别只有一个可学习权重,在捕捉这些细微差别方面存在困难。为了解决上述挑战,我们提出了一种基于多原型的嵌入优化方法,用于半监督医学图像分割。具体而言,我们设计了基于多原型的分类策略,从体素嵌入之间的结构关系出发重新思考分割问题。通过在每个类别中多个原型分布上的聚类来探索类内变化。接下来,我们引入了一致性约束以缓解线性分类器的局限性。这种约束结合了来自线性分类器和所提出的基于原型的分类器的不同分类粒度。在两个流行基准数据集上进行彻底评估后,我们的方法相比现有最佳方法取得了更优性能。代码可在提供的链接处获取。
https://arxiv.org/abs/2503.14343
Chinese porcelain holds immense historical and cultural value, making its accurate classification essential for archaeological research and cultural heritage preservation. Traditional classification methods rely heavily on expert analysis, which is time-consuming, subjective, and difficult to scale. This paper explores the application of DL and transfer learning techniques to automate the classification of porcelain artifacts across four key attributes: dynasty, glaze, ware, and type. We evaluate four Convolutional Neural Networks (CNNs) - ResNet50, MobileNetV2, VGG16, and InceptionV3 - comparing their performance with and without pre-trained weights. Our results demonstrate that transfer learning significantly enhances classification accuracy, particularly for complex tasks like type classification, where models trained from scratch exhibit lower performance. MobileNetV2 and ResNet50 consistently achieve high accuracy and robustness across all tasks, while VGG16 struggles with more diverse classifications. We further discuss the impact of dataset limitations and propose future directions, including domain-specific pre-training, integration of attention mechanisms, explainable AI methods, and generalization to other cultural artifacts.
中国瓷器具有重大的历史和文化价值,因此对其准确分类对于考古研究和文化遗产保护至关重要。传统的分类方法依赖于专家分析,这种做法耗时、主观性强且难以扩展。本文探讨了将深度学习(DL)和迁移学习技术应用于自动分类瓷制品的四个关键属性:朝代、釉色、器型和类型的方法。我们评估了四种卷积神经网络(CNN),包括ResNet50、MobileNetV2、VGG16和InceptionV3,在使用预训练权重和不使用时它们的表现情况。我们的研究结果表明,迁移学习显著提高了分类的准确性,特别是在类型的分类这样复杂的任务中,从头开始训练模型表现较低。MobileNetV2 和 ResNet50 在所有任务中的准确性和鲁棒性始终较高,而 VGG16 在更广泛的分类上遇到困难。 我们进一步讨论了数据集限制的影响,并提出了未来的研究方向,包括领域特定的预训练、注意力机制的集成、可解释的人工智能方法以及扩展到其他文化艺术品的应用。
https://arxiv.org/abs/2503.14231
Automatic anatomical landmark localization in medical imaging requires not just accurate predictions but reliable uncertainty quantification for effective clinical decision support. Current uncertainty quantification approaches often fall short, particularly when combined with normality assumptions, systematically underestimating total predictive uncertainty. This paper introduces conformal prediction as a framework for reliable uncertainty quantification in anatomical landmark localization, addressing a critical gap in automatic landmark localization. We present two novel approaches guaranteeing finite-sample validity for multi-output prediction: Multi-output Regression-as-Classification Conformal Prediction (M-R2CCP) and its variant Multi-output Regression to Classification Conformal Prediction set to Region (M-R2C2R). Unlike conventional methods that produce axis-aligned hyperrectangular or ellipsoidal regions, our approaches generate flexible, non-convex prediction regions that better capture the underlying uncertainty structure of landmark predictions. Through extensive empirical evaluation across multiple 2D and 3D datasets, we demonstrate that our methods consistently outperform existing multi-output conformal prediction approaches in both validity and efficiency. This work represents a significant advancement in reliable uncertainty estimation for anatomical landmark localization, providing clinicians with trustworthy confidence measures for their diagnoses. While developed for medical imaging, these methods show promise for broader applications in multi-output regression problems.
医学影像中的自动解剖标志定位不仅需要准确的预测,还需要可靠的风险量化以支持有效的临床决策。目前的风险量化方法往往不足,特别是在结合正态性假设时,系统性地低估了总的预测不确定性。本文引入了一种称为“符合预测”的框架,用于解剖标志定位中可靠的不确定度量化,这填补了自动地标定位中的一个重要空白。我们提出了两种新的方法,保证多输出预测的有限样本有效性:多输出回归作为分类的符合预测(M-R2CCP)及其变体——针对区域的多输出回归到分类的符合预测(M-R2C2R)。与传统的产生轴向对齐的超矩形或椭球形区域的方法不同,我们的方法生成灵活且非凸的预测区域,更好地捕捉地标预测中的潜在不确定性结构。通过在多个二维和三维数据集上的广泛经验评估,我们证明了我们的方法在有效性和效率上始终优于现有的多输出符合预测方法。这项工作标志着解剖标志定位中可靠不确定度估计的重要进步,为临床医生提供了他们诊断所需的可信信心措施。尽管这些方法是为医学影像开发的,但它们在多输出回归问题中的更广泛应用显示出潜力。
https://arxiv.org/abs/2503.14106
Large language models (LLMs) have unlocked new possibilities for generating synthetic training data in both natural language and code. By producing artificial but task-relevant examples, these models can significantly augment or even replace real-world datasets, especially when labeled data is scarce or sensitive. This paper surveys recent advances in using LLMs to create synthetic text and code, emphasizing prompt-based generation, retrieval-augmented pipelines, and iterative self-refinement. We show how these methods enrich low-resource tasks such as classification and question answering, as well as code-centric applications such as instruction tuning, code translation, and bug repair, by enabling automated verification of functional correctness. Alongside potential benefits like cost-effectiveness, broad coverage, and controllable diversity, we address challenges such as factual inaccuracies in generated text, lack of stylistic realism, and the risk of bias amplification. Proposed mitigations include filtering and weighting outputs and reinforcement learning with execution feedback for code. We conclude with open research directions like automated prompt engineering, cross-modal data synthesis, and robust evaluation frameworks, highlighting the importance of LLM-generated synthetic data in advancing AI while emphasizing ethical and quality safeguards.
大型语言模型(LLMs)解锁了生成自然语言和代码合成训练数据的新可能性。通过产生人工但任务相关的示例,这些模型可以显著增强或替代现实世界的数据集,尤其是在标注数据稀缺或敏感的情况下。本文综述了最近利用LLM创建合成文本和代码的进展,重点介绍了基于提示的生成、检索增强流水线以及迭代自我改进方法。我们展示了这些方法如何丰富低资源任务(如分类和问答)以及以代码为中心的应用(如指令微调、代码翻译和错误修复),通过实现功能正确性的自动化验证来推动进步。除了成本效益、广泛覆盖范围和可控多样性等潜在好处之外,本文还讨论了生成文本中的事实不准确、缺乏风格现实性和偏见放大风险等挑战。提出的缓解措施包括过滤和加权输出以及基于执行反馈的代码强化学习。我们以自动提示工程、跨模态数据合成以及稳健评估框架等开放研究方向作为结论,并强调LLM生成的合成数据在推动AI发展方面的重要性,同时重视伦理和质量保障。
https://arxiv.org/abs/2503.14023
The BI_RADS score is a probabilistic reporting tool used by radiologists to express the level of uncertainty in predicting breast cancer based on some morphological features in mammography images. There is a significant variability in describing masses which sometimes leads to BI_RADS misclassification. Using a BI_RADS prediction system is required to support the final radiologist decisions. In this study, the uncertainty information extracted by a Bayesian deep learning model is utilized to predict the BI_RADS score. The investigation results based on the pathology information demonstrate that the f1-scores of the predictions of the radiologist are 42.86%, 48.33% and 48.28%, meanwhile, the f1-scores of the model performance are 73.33%, 59.60% and 59.26% in the BI_RADS 2, 3 and 5 dataset samples, respectively. Also, the model can distinguish malignant from benign samples in the BI_RADS 0 category of the used dataset with an accuracy of 75.86% and correctly identify all malignant samples as BI_RADS 5. The Grad-CAM visualization shows the model pays attention to the morphological features of the lesions. Therefore, this study shows the uncertainty-aware Bayesian Deep Learning model can report his uncertainty about the malignancy of a lesion based on morphological features, like a radiologist.
BI-RADS评分是一种放射科医生使用的概率报告工具,用于根据乳腺X线摄影图像中的某些形态特征来表达预测乳腺癌不确定性的程度。由于对肿块的描述存在显著差异,有时会导致BI-RADS分类错误。使用一种BI-RADS预测系统是支持最终放射科医生决策所需的要求。在这项研究中,利用贝叶斯深度学习模型提取的不确定性信息来预测BI-RADS评分。基于病理学信息的研究结果表明,在BI-RADS 2、3和5数据集样本中,放射科医生预测的f1分数分别为42.86%、48.33% 和 48.28%,而模型性能的f1分数则为73.33%、59.60% 和 59.26%。此外,在使用的数据集BI-RADS 0类别中,该模型能够以75.86% 的准确率区分恶性与良性样本,并且可以正确地将所有恶性样本识别为BI-RADS 5。通过Grad-CAM可视化方法可以看出,该模型关注病变的形态特征。因此,这项研究表明,这种具有不确定性感知能力的贝叶斯深度学习模型可以根据形态特征像放射科医生一样报告关于病变恶性的不确定性。
https://arxiv.org/abs/2503.13999
Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.
顶尖的视频场景图生成(Video Scene Graph Generation,VSGG)解决方案通常采用离线处理管道。尽管这些方法展示了令人鼓舞的表现,但它们无法实时处理视频流,并且需要大量的GPU内存。此外,这些方法在时间推理方面也存在不足,仅通过在一个时间上下文中聚合帧级别的预测来实现。 为了解决这些问题,我们引入了DIFFVSGG,这是一个在线的VSGG解决方案,该方案将任务视为一个迭代场景图更新问题。受到潜在扩散模型(Latent Diffusion Models,LDMs)的启发,这些模型通过去噪潜在特征嵌入生成图像,我们将对象分类、边界框回归和图形生成三个任务的解码统一使用一个共享的特征嵌入。在给定包含成对物体统一特性的嵌入后,我们在LDM中进行逐步去噪处理,以提供一个清晰地显示物体之间关系的干净嵌入。然后将此嵌入用作特定于任务的头部(如对象分类、场景图生成等)的输入。 DIFFVSGG进一步促进了连续时间推理,在这种推理过程中,后续帧的预测会利用过去帧的结果作为LDMs条件输入,以指导当前帧的反向扩散过程。在Action Genome的三个设置上进行的广泛实验展示了DIFFVSGG的优越性。
https://arxiv.org/abs/2503.13957
This research proposes a very lightweight model "Fibonacci-Net" along with a novel pooling technique, for automatic brain tumor classification from imbalanced Magnetic Resonance Imaging (MRI) datasets. Automatic brain tumor detection from MRI dataset has garnered significant attention in the research community, since the inception of Convolutional Neural Network (CNN) models. However, the performance of conventional CNN models is hindered due to class imbalance problems. The novelties of this work are as follows: (I) A lightweight CNN model is proposed in which the number of filters in different convolutional layers is chosen according to the numbers of Fibonacci series. (II) In the last two blocks of the proposed model, depth-wise separable convolution (DWSC) layers are employed to considerably reduce the computational complexity of the model. (III) Two parallel concatenations (or, skip connections) are deployed from 2nd to 4th, and 3rd to 5th convolutional block in the proposed Fibonacci-Net. This skip connection encompasses a novel Average-2Max pooling layer that produces two stacks of convoluted output, having a bit different statistics. Therefore, this parallel concatenation block works as an efficient feature augmenter inside the model, thus, automatically alleviating the class imbalance problem to a certain extent. For validity purpose, we have implemented the proposed framework on three MRI datasets which are highly class-imbalanced. (a) The first dataset has four classes, i.e., glioma tumor, meningioma tumor, pituitary tumor, and no-tumor. (b) Second and third MRI datasets have 15 and 44 classes respectively. Experimental results reveal that, after employing the proposed Fibonacci-Net we have achieved 96.2% accuracy, 97.17% precision, 95.9% recall, 96.5% F1 score, and 99.9% specificity on the most challenging ``44-classes MRI dataset''.
这项研究提出了一种非常轻量级的模型“Fibonacci-Net”,并结合一种新型池化技术,用于从不平衡的磁共振成像(MRI)数据集中自动进行脑肿瘤分类。自从卷积神经网络(CNN)模型问世以来,从MRI数据集中自动检测脑肿瘤在研究界引起了广泛关注。然而,传统CNN模型的表现因类别不平衡问题而受到阻碍。这项工作的创新点如下: (I) 提出了一种轻量级的CNN模型,在该模型的不同卷积层中,滤波器的数量根据斐波那契数列选择。 (II) 在所提模型的最后一、二个模块中采用深度可分离卷积(DWSC)层,大大减少了模型的计算复杂度。 (III) 在提出的Fibonacci-Net中,从第2到第4卷积块以及从第3到第5卷积块部署了两个并行串联(或称跳过连接)。该跳过连接包含了一种新颖的平均-最大池化层,生成两堆具有略不同统计特性的卷积输出。因此,并联串联模块作为模型中的高效特征增强器工作,从而在一定程度上自动缓解类别不平衡问题。 为了验证目的,我们在三个高度类别不平衡的MRI数据集上实现了所提出的框架: (a) 第一个数据集包含四个类:胶质瘤、脑膜瘤、垂体瘤和无肿瘤。 (b) 第二个和第三个MRI数据集分别包含15个和44个类别。 实验结果显示,在使用提议的Fibonacci-Net后,我们在最具挑战性的“44类MRI数据集”上达到了96.2%的准确率、97.17%的精确度、95.9%的召回率、96.5%的F1分数和99.9%的特异性。
https://arxiv.org/abs/2503.13928
World models significantly enhance hierarchical understanding, improving data integration and learning efficiency. To explore the potential of the world model in the remote sensing (RS) field, this paper proposes a label-efficient remote sensing world model for multimodal data fusion (FusDreamer). The FusDreamer uses the world model as a unified representation container to abstract common and high-level knowledge, promoting interactions across different types of data, \emph{i.e.}, hyperspectral (HSI), light detection and ranging (LiDAR), and text data. Initially, a new latent diffusion fusion and multimodal generation paradigm (LaMG) is utilized for its exceptional information integration and detail retention capabilities. Subsequently, an open-world knowledge-guided consistency projection (OK-CP) module incorporates prompt representations for visually described objects and aligns language-visual features through contrastive learning. In this way, the domain gap can be bridged by fine-tuning the pre-trained world models with limited samples. Finally, an end-to-end multitask combinatorial optimization (MuCO) strategy can capture slight feature bias and constrain the diffusion process in a collaboratively learnable direction. Experiments conducted on four typical datasets indicate the effectiveness and advantages of the proposed FusDreamer. The corresponding code will be released at this https URL.
世界模型在增强层次理解方面表现出色,提高了数据整合和学习效率。为了探索世界模型在遥感(RS)领域的潜力,本文提出了一种标签高效的多模态数据融合的遥感世界模型(FusDreamer)。FusDreamer利用世界模型作为统一表示容器来抽象公共和高层次的知识,并促进不同类型的数据之间的交互,例如高光谱(HSI)、激光雷达(LiDAR)以及文本数据。 首先,使用一种新的潜在扩散融合与多模态生成范式(LaMG),因其出色的综合信息整合及细节保留能力。随后,一个开放世界的知识引导一致性投影(OK-CP)模块被采用,该模块将描述物体的提示表示纳入其中,并通过对比学习对语言视觉特征进行对齐。这样,在有限样本的情况下,可以通过微调预训练的世界模型来弥合领域差距。 最后,一种端到端多任务组合优化(MuCO)策略能够捕捉细微的特征偏差并约束扩散过程以协作学习的方式前进。 在四个典型的数据集上进行的实验表明了所提出的FusDreamer的有效性和优势。相应的代码将在以下网址发布:[此链接] (请根据实际情况提供或替换实际发布的URL地址)。
https://arxiv.org/abs/2503.13814
We introduce the 8-Calves dataset, a benchmark for evaluating object detection and identity classification in occlusion-rich, temporally consistent environments. The dataset comprises a 1-hour video (67,760 frames) of eight Holstein Friesian calves in a barn, with ground truth bounding boxes and identities, alongside 900 static frames for detection tasks. Each calf exhibits a unique coat pattern, enabling precise identity distinction. For cow detection, we fine-tuned 28 models (25 YOLO variants, 3 transformers) on 600 frames, testing on the full video. Results reveal smaller YOLO models (e.g., YOLOV9c) outperform larger counterparts despite potential bias from a YOLOv8m-based labeling pipeline. For identity classification, embeddings from 23 pretrained vision models (ResNet, ConvNextV2, ViTs) were evaluated via linear classifiers and KNN. Modern architectures like ConvNextV2 excelled, while larger models frequently overfit, highlighting inefficiencies in scaling. Key findings include: (1) Minimal, targeted augmentations (e.g., rotation) outperform complex strategies on simpler datasets; (2) Pretraining strategies (e.g., BEiT, DinoV2) significantly boost identity recognition; (3) Temporal continuity and natural motion patterns offer unique challenges absent in synthetic or domain-specific benchmarks. The dataset's controlled design and extended sequences (1 hour vs. prior 10-minute benchmarks) make it a pragmatic tool for stress-testing occlusion handling, temporal consistency, and efficiency. The link to the dataset is this https URL.
我们介绍了8-Calves数据集,这是一个用于评估物体检测和身份分类的基准,在该数据集中环境具有丰富的遮挡且在时间上保持一致性。该数据集包括一小时视频(67,760帧),视频中八头黑白花奶牛在牛棚内的活动情况,并附带地面实况边界框和身份信息,另外还有900张静态图像用于检测任务。每头小牛的毛皮图案独特,这使得精确的身份识别成为可能。 对于牛的检测,我们在600帧上微调了28个模型(包括25种YOLO变体和3种变压器),并在整个视频上进行了测试。结果显示,尽管可能存在基于YOLOv8m标签流程的潜在偏差,但较小的YOLO模型(如YOLOV9c)的表现优于较大的模型。 对于身份分类,我们通过线性分类器和KNN评估了23个预训练视觉模型(包括ResNet、ConvNextV2、ViTs)生成的嵌入。现代架构如ConvNextV2表现尤为出色,而较大模型常出现过拟合现象,这凸显了在规模扩展方面的低效。 主要发现如下: 1. 针对简单的数据集,最小化且目标化的增强技术(例如旋转)比复杂的策略更有效。 2. 前训练策略(如BEiT、DinoV2)显著提高了身份识别的准确性。 3. 时间连续性和自然运动模式带来了合成或特定领域基准中所缺乏的独特挑战。 该数据集通过其受控的设计和延长的时间序列(1小时与之前的10分钟基准相比),成为了一个实用工具,用于测试遮挡处理、时间一致性以及效率方面的压力承受能力。数据集链接:[这里插入实际URL]。
https://arxiv.org/abs/2503.13777
Numerous maritime applications rely on the ability to recognize acoustic targets using passive sonar. While there is a growing reliance on pre-trained models for classification tasks, these models often require extensive computational resources and may not perform optimally when transferred to new domains due to dataset variations. To address these challenges, this work adapts the neural edge histogram descriptors (NEHD) method originally developed for image classification, to classify passive sonar signals. We conduct a comprehensive evaluation of statistical and structural texture features, demonstrating that their combination achieves competitive performance with large pre-trained models. The proposed NEHD-based approach offers a lightweight and efficient solution for underwater target recognition, significantly reducing computational costs while maintaining accuracy.
许多海洋应用依赖于使用被动声纳识别声学目标。虽然在分类任务中越来越依赖预训练模型,但这些模型往往需要大量的计算资源,并且当转移到新领域时,由于数据集的变化可能无法发挥最佳性能。为了解决这些问题,这项工作将最初用于图像分类的神经边缘直方图描述符(NEHD)方法改编应用于被动声纳信号的分类。我们对统计和结构纹理特征进行了全面评估,证明它们的结合能够与大型预训练模型实现竞争性的表现。所提出的基于NEHD的方法为水下目标识别提供了轻量级且高效的解决方案,在减少计算成本的同时保持了准确性。
https://arxiv.org/abs/2503.13763
In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods. Current state-of-the-art models ignore the shifts in event distributions over time, reducing their ability to adjust to changing video dynamics. Additionally, previous methods rely on late fusion to combine audio and visual information. While straightforward, this approach results in a significant loss of multimodal interactions. To address these challenges, we propose Audio-Visual Adaptive Video Analysis ($\text{AV}^2\text{A}$), a model-agnostic approach that requires no further training and integrates a score-level fusion technique to retain richer multimodal interactions. $\text{AV}^2\text{A}$ also includes a within-video label shift algorithm, leveraging input video data and predictions from prior frames to dynamically adjust event distributions for subsequent frames. Moreover, we present the first training-free, open-vocabulary baseline for audio-visual event perception, demonstrating that $\text{AV}^2\text{A}$ achieves substantial improvements over naive training-free baselines. We demonstrate the effectiveness of $\text{AV}^2\text{A}$ on both zero-shot and weakly-supervised state-of-the-art methods, achieving notable improvements in performance metrics over existing approaches.
在音频-视频事件感知领域,该领域的研究重点在于跨模态(音频和视觉)的时序定位与分类。现有方法受制于训练数据词汇表范围,这一限制显著阻碍了它们对新出现且未见过的事物类别的泛化能力。此外,此类任务的标注过程非常耗时,需要在多模态间进行广泛的标签标注,并按时间片段分配,这制约了当前方法的大规模应用性。目前最先进的模型未能考虑到事件分布随时间的变化,从而降低了它们适应视频动态变化的能力。而且,以往的方法依赖于后期融合技术来结合音频和视觉信息,尽管这种方法简单直接,但它会导致多模态交互的大量损失。 为了解决这些挑战,我们提出了音频-视频自适应视频分析($\text{AV}^2\text{A}$),这是一种与模型无关的方法,无需额外训练,并采用评分级融合技术来保留更丰富的跨模态互动。$\text{AV}^2\text{A}$ 还包含了一个针对视频内的标签变化的算法,利用输入视频数据和先前帧的预测结果动态调整后续帧中的事件分布。 此外,我们还推出了首个无训练需求且开放词汇表的基础模型用于音频-视频事件感知领域。实验表明,$\text{AV}^2\text{A}$ 在性能上相较于简单的无训练基础线有了显著提升。我们在零样本和弱监督的最新方法中展示了 $\text{AV}^2\text{A}$ 的有效性,并证明它在各种性能指标上的表现超过了现有的方法。
https://arxiv.org/abs/2503.13693
We investigate the impact of robot appearance on users' spoken behavior during real-world interactions by comparing a human-like android, ERICA, with a less anthropomorphic humanoid, TELECO. Analyzing data from 42 participants at SIGDIAL 2024, we extracted linguistic features such as disfluencies and syntactic complexity from conversation transcripts. The results showed moderate effect sizes, suggesting that participants produced fewer disfluencies and employed more complex syntax when interacting with ERICA. Further analysis involving training classification models like Naïve Bayes, which achieved an F1-score of 71.60\%, and conducting feature importance analysis, highlighted the significant role of disfluencies and syntactic complexity in interactions with robots of varying human-like appearances. Discussing these findings within the frameworks of cognitive load and Communication Accommodation Theory, we conclude that designing robots to elicit more structured and fluent user speech can enhance their communicative alignment with humans.
我们通过比较类人机器人ERICA和更不具人类特征的人形机器人TELECO,研究了机器人外观对用户在现实世界互动中口语行为的影响。我们在SIGDIAL 2024会议上从42名参与者那里收集的数据中提取了语言学特性,如口吃(不流利)和句法复杂性,并分析了对话记录。 结果显示,适度的效果大小表明,与ERICA交互的用户产生较少的口吃现象,并且使用更复杂的语法。进一步通过训练朴素贝叶斯等分类模型进行分析,该模型实现了71.60%的F1分数,以及开展特征重要性分析,强调了在不同人类样貌外观的机器人互动中,口吃和句法复杂性的显著作用。 根据认知负荷理论和沟通适应理论框架,我们得出结论:设计能够促使用户更流畅、结构化的对话模式的机器人可以增强其与人类之间的通信一致性。
https://arxiv.org/abs/2503.13625
With the rise of neural networks, especially in high-stakes applications, these networks need two properties (i) robustness and (ii) interpretability to ensure their safety. Recent advances in classifiers with 3D volumetric object representations have demonstrated a greatly enhanced robustness in out-of-distribution data. However, these 3D-aware classifiers have not been studied from the perspective of interpretability. We introduce CAVE - Concept Aware Volumes for Explanations - a new direction that unifies interpretability and robustness in image classification. We design an inherently-interpretable and robust classifier by extending existing 3D-aware classifiers with concepts extracted from their volumetric representations for classification. In an array of quantitative metrics for interpretability, we compare against different concept-based approaches across the explainable AI literature and show that CAVE discovers well-grounded concepts that are used consistently across images, while achieving superior robustness.
随着神经网络的兴起,特别是在高风险应用中,这些网络需要具备两个关键特性:(i)鲁棒性;(ii)可解释性,以确保其安全性。最近,使用3D体积对象表示的分类器在处理分布外数据时展现出显著增强的鲁棒性。然而,从可解释性的角度来看,人们对这种3D感知分类器的研究还很少。我们引入了CAVE——基于概念感知体积的解释工具——这是一个新的方向,它将图像分类中的可解释性和鲁棒性统一起来。通过扩展现有的3D感知分类器并利用其体素表示中提取的概念来进行分类,我们设计了一个本质上具有可解释性和鲁棒性的新分类器。在一系列衡量可解释性的定量指标上,我们将CAVE与可解释AI文献中基于概念的方法进行了比较,并证明了CAVE能够发现一致应用于图像中的坚实概念,同时实现更优的鲁棒性。
https://arxiv.org/abs/2503.13429
Patient matching is the process of linking patients to appropriate clinical trials by accurately identifying and matching their medical records with trial eligibility criteria. We propose LLM-Match, a novel framework for patient matching leveraging fine-tuned open-source large language models. Our approach consists of four key components. First, a retrieval-augmented generation (RAG) module extracts relevant patient context from a vast pool of electronic health records (EHRs). Second, a prompt generation module constructs input prompts by integrating trial eligibility criteria (both inclusion and exclusion criteria), patient context, and system instructions. Third, a fine-tuning module with a classification head optimizes the model parameters using structured prompts and ground-truth labels. Fourth, an evaluation module assesses the fine-tuned model's performance on the testing datasets. We evaluated LLM-Match on four open datasets, n2c2, SIGIR, TREC 2021, and TREC 2022, using open-source models, comparing it against TrialGPT, Zero-Shot, and GPT-4-based closed models. LLM-Match outperformed all baselines.
患者匹配是指通过准确识别和匹配患者的医疗记录与临床试验的纳入标准,将患者链接到合适的临床试验的过程。我们提出了一种新的框架——LLM-Match,该框架利用微调后的开源大型语言模型进行患者匹配。我们的方法包含四个关键组成部分: 首先,一个检索增强生成(RAG)模块从大量的电子健康记录(EHRs)中提取相关的患者背景信息。 其次,一个提示生成模块通过将试验的纳入和排除标准、患者背景以及系统指令整合起来构建输入提示。 第三,一个微调模块利用结构化提示和真实标签优化模型参数,并在该模块上添加了一个分类头。 第四,评估模块对微调后的模型在测试数据集上的性能进行评估。 我们在n2c2、SIGIR、TREC 2021和TREC 2022四个公开数据集上使用开源模型评估了LLM-Match,并将其与TrialGPT、零样本学习(Zero-Shot)以及基于GPT-4的封闭模型进行了比较。结果显示,LLM-Match在所有基准测试中都表现出优越性。
https://arxiv.org/abs/2503.13281
Background: The COVID-19 pandemic has overwhelmed healthcare systems, emphasizing the need for AI-driven tools to assist in rapid and accurate patient prognosis. Chest X-ray imaging is a widely available diagnostic tool, but existing methods for prognosis classification lack scalability and efficiency. Objective: This study presents a high-accuracy deep learning model for classifying COVID-19 severity (Mild, Moderate, and Severe) using Chest X-ray images, developed on Microsoft Azure Custom Vision. Methods: Using a dataset of 1,103 confirmed COVID-19 X-ray images from AIforCOVID, we trained and validated a deep learning model leveraging Convolutional Neural Networks (CNNs). The model was evaluated on an unseen dataset to measure accuracy, precision, and recall. Results: Our model achieved an average accuracy of 97%, with specificity of 99%, sensitivity of 87%, and an F1-score of 93.11%. When classifying COVID-19 severity, the model achieved accuracies of 89.03% (Mild), 95.77% (Moderate), and 81.16% (Severe). These results demonstrate the model's potential for real-world clinical applications, aiding in faster decision-making and improved resource allocation. Conclusion: AI-driven prognosis classification using deep learning can significantly enhance COVID-19 patient management, enabling early intervention and efficient triaging. Our study provides a scalable, high-accuracy AI framework for integrating deep learning into routine clinical workflows. Future work should focus on expanding datasets, external validation, and regulatory compliance to facilitate clinical adoption.
背景:COVID-19 疫情已经使医疗系统不堪重负,强调了需要利用人工智能驱动的工具来帮助快速准确地预测患者的预后。胸部 X 光片是一种广泛使用的诊断工具,但现有的预后分类方法在可扩展性和效率方面存在不足。 目的:本研究提出了一种高精度的深度学习模型,用于根据胸部 X 光图像对 COVID-19 的严重程度(轻度、中度和重度)进行分类。该模型是在微软 Azure Custom Vision 平台上开发的。 方法:我们使用了一个包含 1,103 张确认为 COVID-19 的 X 光片的数据集(来自 AIforCOVID),利用卷积神经网络 (CNN) 训练并验证了深度学习模型。通过在未知数据集上进行评估,以测量其准确性、精确性和召回率。 结果:我们的模型达到了平均 97% 的准确度,特异性为 99%,灵敏度为 87%,F1 分数达到 93.11%。当对 COVID-19 的严重程度分类时,该模型在轻度(89.03%)、中度(95.77%)和重度(81.16%)的准确率方面表现出色。这些结果展示了该模型在实际临床应用中的潜力,有助于更快地做出决策并优化资源配置。 结论:基于深度学习的人工智能驱动预后分类可以显著提高 COVID-19 患者的管理效率,实现早期干预和有效的分诊工作。本研究提供了一个可扩展、高精度的 AI 架构,用于将深度学习技术整合到常规临床流程中。未来的工作应集中在扩大数据集、进行外部验证以及确保符合监管要求上,以促进临床应用的推广。
https://arxiv.org/abs/2503.13277
Lameness and gait irregularities are significant concerns in equine health management, affecting performance, welfare, and economic value. Traditional observational methods rely on subjective expert assessments, which can lead to inconsistencies in detecting subtle or early-stage lameness. While AI-based approaches have emerged, many require multiple sensors, force plates, or video systems, making them costly and impractical for field deployment. In this applied research study, we present a stride-level classification system that utilizes a single inertial measurement unit (IMU) and a one-dimensional convolutional neural network (1D CNN) to objectively differentiate between sound and lame horses, with a primary focus on the trot gait. The proposed system was tested under real-world conditions, achieving a 90% session-level accuracy with no false positives, demonstrating its robustness for practical applications. By employing a single, non-intrusive, and readily available sensor, our approach significantly reduces the complexity and cost of hardware requirements while maintaining high classification performance. These results highlight the potential of our CNN-based method as a field-tested, scalable solution for automated lameness detection. By enabling early diagnosis, this system offers a valuable tool for preventing minor gait irregularities from developing into severe conditions, ultimately contributing to improved equine welfare and performance in veterinary and equestrian practice.
跛行和步态异常是马匹健康管理中的重大问题,影响着马匹的表现、福利以及经济价值。传统的方法依赖于专家的主观观察评估,这可能导致在检测细微或早期阶段的跛行时出现不一致的情况。虽然基于人工智能的方法已经出现,但许多方法需要使用多个传感器、力板或者视频系统,使得它们的成本高且难以在现场部署。在这项应用研究中,我们提出了一种步态级别的分类系统,该系统仅使用一个惯性测量单元(IMU)和一维卷积神经网络(1D CNN),能够客观地区分健康马匹与跛行马匹,并主要针对快步步伐进行了优化。在真实世界条件下的测试表明,所提出的系统达到了90%的会话级别准确率且没有假阳性,展示了其在现场应用中的鲁棒性。 通过使用单一、无侵入性和易于获得的传感器,我们的方法显著降低了硬件要求的复杂度和成本,同时保持了高水平的分类性能。这些结果突显了我们基于CNN的方法作为现场测试过的、可扩展的自动跛行检测解决方案的巨大潜力。通过实现早期诊断,该系统为防止轻微步态异常发展成严重状况提供了一种有价值的工具,最终有助于提高马匹福利和表现,在兽医和赛马实践中发挥重要作用。
https://arxiv.org/abs/2503.13578