Robotics, autonomous driving, augmented reality, and many embodied computer vision applications must quickly react to user-defined events unfolding in real time. We address this setting by proposing a novel task for multimodal video understanding-Streaming Detection of Queried Event Start (SDQES). The goal of SDQES is to identify the beginning of a complex event as described by a natural language query, with high accuracy and low latency. We introduce a new benchmark based on the Ego4D dataset, as well as new task-specific metrics to study streaming multimodal detection of diverse events in an egocentric video setting. Inspired by parameter-efficient fine-tuning methods in NLP and for video tasks, we propose adapter-based baselines that enable image-to-video transfer learning, allowing for efficient online video modeling. We evaluate three vision-language backbones and three adapter architectures on both short-clip and untrimmed video settings.
https://arxiv.org/abs/2412.03567
Hepatocellular carcinoma (HCC) is a common type of liver cancer whose early-stage diagnosis is a common challenge, mainly due to the manual assessment of hematoxylin and eosin-stained whole slide images, which is a time-consuming process and may lead to variability in decision-making. For accurate detection of HCC, we propose a hybrid deep learning-based architecture that uses transfer learning to extract the features from pre-trained convolutional neural network (CNN) models and a classifier made up of a sequence of fully connected layers. This study uses a publicly available The Cancer Genome Atlas Hepatocellular Carcinoma (TCGA-LIHC)database (n=491) for model development and database of Kasturba Gandhi Medical College (KMC), India for validation. The pre-processing step involves patch extraction, colour normalization, and augmentation that results in 3920 patches for the TCGA dataset. The developed hybrid deep neural network consisting of a CNN-based pre-trained feature extractor and a customized artificial neural network-based classifier is trained using five-fold cross-validation. For this study, eight different state-of-the-art models are trained and tested as feature extractors for the proposed hybrid model. The proposed hybrid model with ResNet50-based feature extractor provided the sensitivity, specificity, F1-score, accuracy, and AUC of 100.00%, 100.00%, 100.00%, 100.00%, and 1.00, respectively on the TCGA database. On the KMC database, EfficientNetb3 resulted in the optimal choice of the feature extractor giving sensitivity, specificity, F1-score, accuracy, and AUC of 96.97, 98.85, 96.71, 96.71, and 0.99, respectively. The proposed hybrid models showed improvement in accuracy of 2% and 4% over the pre-trained models in TCGA-LIHC and KMC databases.
https://arxiv.org/abs/2412.03084
Artificial intelligence generated content (AIGC), a rapidly advancing technology, is transforming content creation across domains, such as text, images, audio, and video. Its growing potential has attracted more and more researchers and investors to explore and expand its possibilities. This review traces AIGC's evolution through four developmental milestones-ranging from early rule-based systems to modern transfer learning models-within a unified framework that highlights how each milestone contributes uniquely to content generation. In particular, the paper employs a common example across all milestones to illustrate the capabilities and limitations of methods within each phase, providing a consistent evaluation of AIGC methodologies and their development. Furthermore, this paper addresses critical challenges associated with AIGC and proposes actionable strategies to mitigate them. This study aims to guide researchers and practitioners in selecting and optimizing AIGC models to enhance the quality and efficiency of content creation across diverse domains.
https://arxiv.org/abs/2412.01948
Can computer vision help us explore the ocean? The ultimate challenge for computer vision is to recognize any visual phenomena, more than only the objects and animals humans encounter in their terrestrial lives. Previous datasets have explored everyday objects and fine-grained categories humans see frequently. We present the FathomVerse v0 detection dataset to push the limits of our field by exploring animals that rarely come in contact with people in the deep sea. These animals present a novel vision challenge. The FathomVerse v0 dataset consists of 3843 images with 8092 bounding boxes from 12 distinct morphological groups recorded at two locations on the deep seafloor that are new to computer vision. It features visually perplexing scenarios such as an octopus intertwined with a sea star, and confounding categories like vampire squids and sea spiders. This dataset can push forward research on topics like fine-grained transfer learning, novel category discovery, species distribution modeling, and carbon cycle analysis, all of which are important to the care and husbandry of our planet.
https://arxiv.org/abs/2412.01701
To protect large-scale computing environments necessary to meet increasing computing demand, cloud providers have implemented security measures to monitor Operations and Maintenance (O&M) activities and therefore prevent data loss and service interruption. Command interception systems are used to intercept, assess, and block dangerous Command-line Interface (CLI) commands before they can cause damage. Traditional solutions for command risk assessment include rule-based systems, which require expert knowledge and constant human revision to account for unseen commands. To overcome these limitations, several end-to-end learning systems have been proposed to classify CLI commands. These systems, however, have several other limitations, including the adoption of general-purpose text classifiers, which may not adapt to the language characteristics of scripting languages such as Bash or PowerShell, and may not recognize dangerous commands in the presence of an unbalanced class distribution. In this paper, we propose a transformer-based command risk classification system, which leverages the generalization power of Large Language Models (LLM) to provide accurate classification and the ability to identify rare dangerous commands effectively, by exploiting the power of transfer learning. We verify the effectiveness of our approach on a realistic dataset of production commands and show how to apply our model for other security-related tasks, such as dangerous command interception and auditing of existing rule-based systems.
https://arxiv.org/abs/2412.01655
Text Simplification is a task that has been minimally explored for low-resource languages. Consequently, there are only a few manually curated datasets. In this paper, we present a human curated sentence-level text simplification dataset for the Sinhala language. Our evaluation dataset contains 1,000 complex sentences and corresponding 3,000 simplified sentences produced by three different human annotators. We model the text simplification task as a zero-shot and zero resource sequence-to-sequence (seq-seq) task on the multilingual language models mT5 and mBART. We exploit auxiliary data from related seq-seq tasks and explore the possibility of using intermediate task transfer learning (ITTL). Our analysis shows that ITTL outperforms the previously proposed zero-resource methods for text simplification. Our findings also highlight the challenges in evaluating text simplification systems, and support the calls for improved metrics for measuring the quality of automated text simplification systems that would suit low-resource languages as well. Our code and data are publicly available: this https URL
https://arxiv.org/abs/2412.01293
This study takes a preliminary step toward teaching computers to recognize human emotions through Facial Emotion Recognition (FER). Transfer learning is applied using ResNeXt, EfficientNet models, and an ArcFace model originally trained on the facial verification task, leveraging the AffectNet database, a collection of human face images annotated with corresponding emotions. The findings highlight the value of congruent domain transfer learning, the challenges posed by imbalanced datasets in learning facial emotion patterns, and the effectiveness of pairwise learning in addressing class imbalances to enhance model performance on the FER task.
https://arxiv.org/abs/2412.01860
Around seven million individuals in India, Bangladesh, Bhutan, and Nepal speak Santali, positioning it as nearly the third most commonly used Austroasiatic language. Despite its prominence among the Austroasiatic language family's Munda subfamily, Santali lacks global recognition. Currently, no translation models exist for the Santali language. Our paper aims to include Santali to the NPL spectrum. We aim to examine the feasibility of building Santali translation models based on available Santali corpora. The paper successfully addressed the low-resource problem and, with promising results, examined the possibility of creating a functional Santali machine translation model in a low-resource setup. Our study shows that Santali-English parallel corpus performs better when in transformers like mt5 as opposed to untrained transformers, proving that transfer learning can be a viable technique that works with Santali language. Besides the mT5 transformer, Santali-English performs better than Santali-Bangla parallel corpus as the mT5 has been trained in way more English data than Bangla data. Lastly, our study shows that with data augmentation, our model performs better.
https://arxiv.org/abs/2411.19726
More music foundation models are recently being released, promising a general, mostly task independent encoding of musical information. Common ways of adapting music foundation models to downstream tasks are probing and fine-tuning. These common transfer learning approaches, however, face challenges. Probing might lead to suboptimal performance because the pre-trained weights are frozen, while fine-tuning is computationally expensive and is prone to overfitting. Our work investigates the use of parameter-efficient transfer learning (PETL) for music foundation models which integrates the advantage of probing and fine-tuning. We introduce three types of PETL methods: adapter-based methods, prompt-based methods, and reparameterization-based methods. These methods train only a small number of parameters, and therefore do not require significant computational resources. Results show that PETL methods outperform both probing and fine-tuning on music auto-tagging. On key detection and tempo estimation, they achieve similar results as fine-tuning with significantly less training cost. However, the usefulness of the current generation of foundation model on key and tempo tasks is questioned by the similar results achieved by training a small model from scratch. Code available at this https URL
https://arxiv.org/abs/2411.19371
Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this by framing it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled/labeled and high/low-density EEG data. To fully leverage the abundant unlabeled EEG data, we introduce a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates Graph Contrastive Pre-training and Graph Masked Autoencoder Pre-training. This approach synergistically combines contrastive and generative pre-training techniques by reconstructing contrastive samples and contrasting the reconstructions. For knowledge distillation from high-density to low-density EEG data, we propose a Graph Topology Distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data, effectively handling missing electrodes through contrastive distillation. To integrate transfer learning and distillation, we jointly pre-train the teacher and student models by contrasting their queries and keys during pre-training, enabling robust distillers for downstream tasks. We demonstrate the effectiveness of our method on four classification tasks across two clinical EEG datasets with abundant unlabeled data and limited labeled data. The experimental results show that our approach significantly outperforms contemporary methods in both efficiency and accuracy.
https://arxiv.org/abs/2411.19230
Going beyond few-shot action recognition (FSAR), cross-domain FSAR (CDFSAR) has attracted recent research interests by solving the domain gap lying in source-to-target transfer learning. Existing CDFSAR methods mainly focus on joint training of source and target data to mitigate the side effect of domain gap. However, such kind of methods suffer from two limitations: First, pair-wise joint training requires retraining deep models in case of one source data and multiple target ones, which incurs heavy computation cost, especially for large source and small target data. Second, pre-trained models after joint training are adopted to target domain in a straightforward manner, hardly taking full potential of pre-trained models and then limiting recognition performance. To overcome above limitations, this paper proposes a simple yet effective baseline, namely Temporal-Aware Model Tuning (TAMT) for CDFSAR. Specifically, our TAMT involves a decoupled paradigm by performing pre-training on source data and fine-tuning target data, which avoids retraining for multiple target data with single source. To effectively and efficiently explore the potential of pre-trained models in transferring to target domain, our TAMT proposes a Hierarchical Temporal Tuning Network (HTTN), whose core involves local temporal-aware adapters (TAA) and a global temporal-aware moment tuning (GTMT). Particularly, TAA learns few parameters to recalibrate the intermediate features of frozen pre-trained models, enabling efficient adaptation to target domains. Furthermore, GTMT helps to generate powerful video representations, improving match performance on the target domain. Experiments on several widely used video benchmarks show our TAMT outperforms the recently proposed counterparts by 13%$\sim$31%, achieving new state-of-the-art CDFSAR results.
https://arxiv.org/abs/2411.19041
The scarcity of data in medical domains hinders the performance of Deep Learning models. Data augmentation techniques can alleviate that problem, but they usually rely on functional transformations of the data that do not guarantee to preserve the original tasks. To approximate the distribution of the data using generative models is a way of reducing that problem and also to obtain new samples that resemble the original data. Denoising Diffusion models is a promising Deep Learning technique that can learn good approximations of different kinds of data like images, time series or tabular data. Automatic colonoscopy analysis and specifically Polyp localization in colonoscopy videos is a task that can assist clinical diagnosis and treatment. The annotation of video frames for training a deep learning model is a time consuming task and usually only small datasets can be obtained. The fine tuning of application models using a large dataset of generated data could be an alternative to improve their performance. We conduct a set of experiments training different diffusion models that can generate jointly colonoscopy images with localization annotations using a combination of existing open datasets. The generated data is used on various transfer learning experiments in the task of polyp localization with a model based on YOLO v9 on the low data regime.
https://arxiv.org/abs/2411.18926
Deep learning models need a sufficient amount of data in order to be able to find the hidden patterns in it. It is the purpose of generative modeling to learn the data distribution, thus allowing us to sample more data and augment the original dataset. In the context of physiological data, and more specifically electrocardiogram (ECG) data, given its sensitive nature and expensive data collection, we can exploit the benefits of generative models in order to enlarge existing datasets and improve downstream tasks, in our case, classification of heart rhythm. In this work, we explore the usefulness of synthetic data generated with different generative models from Deep Learning namely Diffweave, Time-Diffusion and Time-VQVAE in order to obtain better classification results for two open source multivariate ECG datasets. Moreover, we also investigate the effects of transfer learning, by fine-tuning a synthetically pre-trained model and then progressively adding increasing proportions of real data. We conclude that although the synthetic samples resemble the real ones, the classification improvement when simply augmenting the real dataset is barely noticeable on individual datasets, but when both datasets are merged the results show an increase across all metrics for the classifiers when using synthetic samples as augmented data. From the fine-tuning results the Time-VQVAE generative model has shown to be superior to the others but not powerful enough to achieve results close to a classifier trained with real data only. In addition, methods and metrics for measuring closeness between synthetic data and the real one have been explored as a side effect of the main research questions of this study.
深度学习模型需要足够量的数据才能发现其中的隐藏模式。生成式建模的目的就是学习数据分布,从而允许我们采样更多数据并扩充原始数据集。在生理数据,尤其是心电图(ECG)数据的情况下,由于其敏感性以及昂贵的数据收集成本,我们可以利用生成模型的优势来扩展现有数据集,并改进下游任务,在我们的案例中即为心脏节律分类。在这项工作中,我们探讨了使用不同深度学习的生成模型——Diffweave、Time-Diffusion 和 Time-VQVAE 生成的合成数据在提升两个开源多变量 ECG 数据集分类性能方面的有效性。此外,我们也研究了迁移学习的效果,即通过微调一个用合成数据预训练的模型,并逐步增加真实数据的比例。我们得出结论:尽管合成样本与真实的样本相似,但在单独的数据集中仅通过扩充现实数据集来改善分类效果几乎难以察觉;然而,当两个数据集合并时,使用合成样本作为扩充数据的所有指标都显示出了提升。微调结果显示,Time-VQVAE 生成模型优于其他两种,但还不足以实现接近仅用真实数据训练的分类器的结果。另外,作为本研究主要问题的附带效果,还探讨了衡量合成数据与真实数据之间相似性的方法和度量标准。
https://arxiv.org/abs/2411.18456
Over the past few decades, Artificial Intelligence(AI) has progressed from the initial machine learning stage to the deep learning stage, and now to the stage of foundational models. Foundational models have the characteristics of pre-training, transfer learning, and self-supervised learning, and pre-trained models can be fine-tuned and applied to various downstream tasks. Under the framework of foundational models, models such as Bidirectional Encoder Representations from Transformers(BERT) and Generative Pre-trained Transformer(GPT) have greatly advanced the development of natural language processing(NLP), especially the emergence of many models based on BERT. BERT broke through the limitation of only using one-way methods for language modeling in pre-training by using a masked language model. It can capture bidirectional context information to predict the masked words in the sequence, this can improve the feature extraction ability of the model. This makes the model very useful for downstream tasks, especially for specialized applications. The model using the bidirectional encoder can better understand the domain knowledge and be better applied to these downstream tasks. So we hope to help understand how this technology has evolved and improved model performance in various natural language processing tasks under the background of foundational models and reveal its importance in capturing context information and improving the model's performance on downstream tasks. This article analyzes one-way and bidirectional models based on GPT and BERT and compares their differences based on the purpose of the model. It also briefly analyzes BERT and the improvements of some models based on BERT. The model's performance on the Stanford Question Answering Dataset(SQuAD) and General Language Understanding Evaluation(GLUE) was compared.
在过去几十年中,人工智能(AI)从最初的机器学习阶段发展到深度学习阶段,进而进入了基础模型的阶段。基础模型具有预训练、迁移学习和自监督学习的特点,并且经过预训练的模型可以被微调并应用于各种下游任务。在基础模型框架下,如Transformer双向编码器表示(BERT)和生成预训练变换器(GPT)等模型极大地推动了自然语言处理(NLP)的发展,特别是许多基于BERT的模型的出现。BERT通过使用掩码语言模型打破了仅用单向方法进行语言建模在预训练中的限制。它可以捕捉双向上下文信息来预测序列中被遮罩的单词,这可以提高模型的特征提取能力。这让模型对于下游任务非常有用,特别是在专用应用方面。使用双向编码器的模型能够更好地理解领域知识并能更好地应用于这些下游任务。因此,我们希望帮助了解这项技术在基础模型背景下是如何发展和提升各种自然语言处理任务中的模型性能,并揭示其在捕获上下文信息和提高模型在下游任务表现上的重要性。本文分析了基于GPT和BERT的一维和双向模型,并根据模型的目的比较了它们之间的差异。还简要分析了BERT以及一些基于BERT改进的模型。这些模型在斯坦福问答数据集(SQuAD)和通用语言理解评估(GLUE)上的性能进行了比较。
https://arxiv.org/abs/2411.18021
EEG signals have emerged as a powerful tool in affective brain-computer interfaces, playing a crucial role in emotion recognition. However, current deep transfer learning-based methods for EEG recognition face challenges due to the reliance of both source and target data in model learning, which significantly affect model performance and generalization. To overcome this limitation, we propose a novel framework (PL-DCP) and introduce the concepts of feature disentanglement and prototype inference. The dual prototyping mechanism incorporates both domain and class prototypes: domain prototypes capture individual variations across subjects, while class prototypes represent the ideal class distributions within their respective domains. Importantly, the proposed PL-DCP framework operates exclusively with source data during training, meaning that target data remains completely unseen throughout the entire process. To address label noise, we employ a pairwise learning strategy that encodes proximity relationships between sample pairs, effectively reducing the influence of mislabeled data. Experimental validation on the SEED and SEED-IV datasets demonstrates that PL-DCP, despite not utilizing target data during training, achieves performance comparable to deep transfer learning methods that require both source and target data. This highlights the potential of PL-DCP as an effective and robust approach for EEG-based emotion recognition.
https://arxiv.org/abs/2412.00082
In deep learning theory, a critical question is to understand how neural networks learn hierarchical features. In this work, we study the learning of hierarchical polynomials of \textit{multiple nonlinear features} using three-layer neural networks. We examine a broad class of functions of the form $f^{\star}=g^{\star}\circ \bp$, where $\bp:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$ represents multiple quadratic features with $r \ll d$ and $g^{\star}:\mathbb{R}^{r}\rightarrow \mathbb{R}$ is a polynomial of degree $p$. This can be viewed as a nonlinear generalization of the multi-index model \citep{damian2022neural}, and also an expansion upon previous work that focused only on a single nonlinear feature, i.e. $r = 1$ \citep{nichani2023provable,wang2023learning}. Our primary contribution shows that a three-layer neural network trained via layerwise gradient descent suffices for \begin{itemize}\item complete recovery of the space spanned by the nonlinear features \item efficient learning of the target function $f^{\star}=g^{\star}\circ \bp$ or transfer learning of $f=g\circ \bp$ with a different link function \end{itemize} within $\widetilde{\cO}(d^4)$ samples and polynomial time. For such hierarchical targets, our result substantially improves the sample complexity ${\Theta}(d^{2p})$ of the kernel methods, demonstrating the power of efficient feature learning. It is important to highlight that{ our results leverage novel techniques and thus manage to go beyond all prior settings} such as single-index and multi-index models as well as models depending just on one nonlinear feature, contributing to a more comprehensive understanding of feature learning in deep learning.
在深度学习理论中,一个关键问题是理解神经网络是如何学习分层特征的。在这项工作中,我们研究了使用三层神经网络学习包含多个非线性特征的分层多项式。我们考察了一类形式为$f^{\star}=g^{\star}\circ \bp$的函数,其中$\bp:\mathbb{R}^{d} \rightarrow \mathbb{R}^{r}$代表多个二次特征且$r \ll d$,而$g^{\star}:\mathbb{R}^{r}\rightarrow \mathbb{R}$是一个$p$次多项式。这可以视为多指标模型\citep{damian2022neural}的非线性扩展,并进一步扩展了先前仅关注单个非线性特征(即$r = 1$\citep{nichani2023provable,wang2023learning})的研究。我们的主要贡献在于证明,通过逐层梯度下降训练的三层神经网络足以实现以下目标: \begin{itemize} \item 完全恢复由非线性特征张成的空间 \item 高效学习目标函数$f^{\star}=g^{\star}\circ \bp$或具有不同连接函数的迁移学习$f=g\circ \bp$ \end{itemize} 在$\widetilde{\cO}(d^4)$样本和多项式时间内。对于这类分层目标,我们的结果显著改善了核方法所需的样本复杂度${\Theta}(d^{2p})$,展示了高效特征学习的力量。值得注意的是,我们的研究采用了新颖的技术,从而超越了所有先前的研究设定,如单指标模型、多指标模型以及仅依赖于一个非线性特征的模型,为深度学习中的特征学习提供了更加全面的理解。
https://arxiv.org/abs/2411.17201
Crack detection plays a pivotal role in the maintenance and safety of infrastructure, including roads, bridges, and buildings, as timely identification of structural damage can prevent accidents and reduce costly repairs. Traditionally, manual inspection has been the norm, but it is labor-intensive, subjective, and hazardous. This paper introduces an advanced approach for crack detection in infrastructure using deep learning, leveraging transfer learning, spatial attention mechanisms, and genetic algorithm(GA) optimization. To address the challenge of the inaccessability of large amount of data, we employ ResNet50 as a pre-trained model, utilizing its strong feature extraction capabilities while reducing the need for extensive training datasets. We enhance the model with a spatial attention layer as well as a customized neural network which architecture was fine-tuned using GA. A comprehensive case study demonstrates the effectiveness of the proposed Attention-ResNet50-GA model, achieving a precision of 0.9967 and an F1 score of 0.9983, outperforming conventional methods. The results highlight the model's ability to accurately detect cracks in various conditions, making it highly suitable for real-world applications where large annotated datasets are scarce.
裂缝检测在基础设施的维护和安全中发挥着关键作用,包括道路、桥梁和建筑物。及时识别结构损坏可以预防事故并减少高昂的维修费用。传统上,手动检查是常态,但这种方法劳动强度大、主观性强且存在危险性。本文介绍了一种使用深度学习进行基础设施裂缝检测的先进方法,该方法利用迁移学习、空间注意力机制以及遗传算法(GA)优化。为了解决大量数据难以获取的问题,我们采用了预训练模型ResNet50,利用其强大的特征提取能力,并减少了对大规模训练数据集的需求。我们还通过添加一个空间注意力层和一个经过GA调优的定制神经网络架构来增强该模型。一项全面的案例研究展示了所提出的Attention-ResNet50-GA模型的有效性,达到了0.9967的精度和0.9983的F1分数,优于传统方法。结果突显了该模型在各种条件下准确检测裂缝的能力,使其非常适合于标注数据集稀缺的实际应用环境。
https://arxiv.org/abs/2411.17140
Segmenting glomerular intraglomerular tissue and lesions traditionally depends on detailed morphological evaluations by expert nephropathologists, a labor-intensive process susceptible to interobserver variability. Our group previously developed the Glo-In-One toolkit for integrated detection and segmentation of glomeruli. In this study, we leverage the Glo-In-One toolkit to version 2 with fine-grained segmentation capabilities, curating 14 distinct labels for tissue regions, cells, and lesions across a dataset of 23,529 annotated glomeruli across human and mouse histopathology data. To our knowledge, this dataset is among the largest of its kind to this http URL this study, we present a single dynamic head deep learning architecture designed to segment 14 classes within partially labeled images of human and mouse pathology data. Our model was trained using a training set derived from 368 annotated kidney whole-slide images (WSIs) to identify 5 key intraglomerular tissues covering Bowman's capsule, glomerular tuft, mesangium, mesangial cells, and podocytes. Additionally, the network segments 9 glomerular lesion classes including adhesion, capsular drop, global sclerosis, hyalinosis, mesangial lysis, microaneurysm, nodular sclerosis, mesangial expansion, and segmental sclerosis. The glomerulus segmentation model achieved a decent performance compared with baselines, and achieved a 76.5 % average Dice Similarity Coefficient (DSC). Additional, transfer learning from rodent to human for glomerular lesion segmentation model has enhanced the average segmentation accuracy across different types of lesions by more than 3 %, as measured by Dice scores. The Glo-In-One-v2 model and trained weight have been made publicly available at https: //github.com/hrlblab/Glo-In-One_v2.
分段肾小球内的组织和病灶传统上依赖于专家肾病理学家的详细形态学评估,这是一个劳动密集型的过程,容易受到不同观察者间变异的影响。我们团队之前开发了Glo-In-One工具包用于综合检测和分割肾小球。在这项研究中,我们利用Glo-In-One工具包升级到版本2,具备细粒度分割能力,在人类和鼠类组织病理学数据集中的23,529个注释的肾小球上整理出14种不同的标签来标记组织区域、细胞和病灶。据我们所知,这是此类数据集中最大的一个。在这项研究中,我们展示了一个单一动态头部深度学习架构的设计,用于分割人类和鼠类病理数据部分标注图像中的14个类别。我们的模型使用从368张注释的肾脏全切片图像(WSIs)中提取的训练集来识别覆盖Bowman囊、肾小球团块、系膜、系膜细胞和足细胞在内的5种关键肾小球内组织。此外,该网络还分割了9类肾小球病灶,包括粘连、囊滴、全球硬化症、玻璃样变性、系膜溶解、微动脉瘤、结节状硬化症、系膜扩张以及段性硬化症。肾小球分割模型相较于基线达到了相当的性能水平,并实现了76.5%的平均Dice相似系数(DSC)。另外,从啮齿类动物到人类的迁移学习对肾小球病灶分割模型的准确度在不同类型的病变上提升了超过3%,通过Dice分数衡量。Glo-In-One-v2模型和训练权重已在https://github.com/hrlblab/Glo-In-One_v2公开发布。
https://arxiv.org/abs/2411.16961
Sign language processing has traditionally relied on task-specific models,limiting the potential for transfer learning across tasks. We introduce SHuBERT (Sign Hidden-Unit BERT), a self-supervised transformer encoder that learns strong representations from approximately 1,000 hours of American Sign Language (ASL) video content. Inspired by the success of the HuBERT speech representation model, SHuBERT adapts masked prediction for multi-stream visual sign language input, learning to predict multiple targets for corresponding to clustered hand, face, and body pose streams. SHuBERT achieves state-of-the-art performance across multiple benchmarks. On sign language translation, it outperforms prior methods trained on publicly available data on the How2Sign (+0.7 BLEU), OpenASL (+10.0 BLEU), and FLEURS-ASL (+0.3 BLEU) benchmarks. Similarly for isolated sign language recognition, SHuBERT's accuracy surpasses that of specialized models on ASL-Citizen (+5\%) and SEM-LEX (+20.6\%), while coming close to them on WLASL2000 (-3\%). Ablation studies confirm the contribution of each component of the approach.
手语处理传统上依赖于特定任务的模型,这限制了跨任务迁移学习的潜力。我们引入了SHuBERT(Sign Hidden-Unit BERT),这是一个自我监督的变压器编码器,可以从大约1000小时的美国手语(ASL)视频内容中学习强大的表示形式。受HuBERT语音表示模型成功的影响,SHuBERT为多流视觉手语输入适应了遮罩预测,学习预测与聚类的手部、面部和身体姿势流相对应的多个目标。SHuBERT在多项基准测试中实现了最先进的性能。在手语翻译方面,它在How2Sign(+0.7 BLEU)、OpenASL(+10.0 BLEU)和FLEURS-ASL(+0.3 BLEU)基准上超越了基于公开数据训练的先前方法。同样,在孤立的手语识别中,SHuBERT的准确性在ASL-Citizen(+5%)和SEM-LEX(+20.6%)上超过了专业模型,而在WLASL2000上与它们相差不大(-3%)。消融研究证实了方法各组成部分的贡献。
https://arxiv.org/abs/2411.16765
This study explores the effectiveness of multi-temporal satellite imagery for better functional field boundary delineation using deep learning semantic segmentation architecture on two distinct geographical and multi-scale farming systems of Netherlands and Pakistan. Multidate images of April, August and October 2022 were acquired for PlanetScope and Sentinel-2 in sub regions of Netherlands and November 2022, February and March 2023 for selected area of Dunyapur in Pakistan. For Netherlands, Basic registration crop parcels (BRP) vector layer was used as labeled training data. while self-crafted field boundary vector data were utilized for Pakistan. Four deep learning models with UNET architecture were evaluated using different combinations of multi-date images and NDVI stacks in the Netherlands subregions. A comparative analysis of IoU scores assessed the effectiveness of the proposed multi-date NDVI stack approach. These findings were then applied for transfer learning, using pre-trained models from the Netherlands on the selected area in Pakistan. Additionally, separate models were trained using self-crafted field boundary data for Pakistan, and combined models were developed using data from both the Netherlands and Pakistan. Results indicate that multi-date NDVI stacks provide additional temporal context, reflecting crop growth over different times of the season. The study underscores the critical role of multi-scale ground information from diverse geographical areas in developing robust and universally applicable models for field boundary delineation. The results also highlight the importance of fine spatial resolution for extraction of field boundaries in regions with small scale framing. The findings can be extended to multi-scale implementations for improved automatic field boundary delineation in heterogeneous agricultural environments.
这项研究探讨了多时相卫星图像在利用深度学习语义分割架构更好地界定功能地块边界方面的有效性,特别是在荷兰和巴基斯坦两个不同的地理区域及多尺度农田系统中。2022年4月、8月和10月的多日期影像被获取用于荷兰子地区的PlanetScope和Sentinel-2,而在巴基斯坦Dunyapur选定区域内则使用了2022年11月以及2023年2月和3月的图像。对于荷兰,基础注册作物地块(BRP)矢量层用作标记训练数据;而为巴基斯坦则利用了自制的地块边界矢量数据。通过不同组合的多日期影像和NDVI堆叠,评估了四种具有UNET架构的深度学习模型在荷兰子区域中的表现。通过IoU得分对比分析评估所提出的多日期NDVI堆叠方法的有效性。这些发现随后应用于迁移学习,在巴基斯坦选定区域内使用从荷兰预训练的模型。此外,还利用自制的地块边界数据为巴基斯坦单独训练了不同的模型,并结合来自荷兰和巴基斯坦的数据开发了综合模型。结果表明,多日期NDVI堆叠提供了额外的时间上下文,反映了整个生长季节中的作物变化情况。研究强调了不同地理区域中多层次地面信息在构建鲁棒且普遍适用的地块边界界定模型中的关键作用。研究还突显了细空间分辨率在小规模农田地区提取地块边界的必要性。这些发现可以扩展到多尺度实施,以改善异质农业环境中自动地块边界界定的效果。
https://arxiv.org/abs/2411.15923