Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
文本转语音(TTS)发展对于非洲语言如卢加丹仍然是有限的,主要原因是因为高质量、单声道录音对于训练TTS模型至关重要。之前的工作主要集中在利用多个年龄在20-49岁的说话者的卢加丹共同声音录音。尽管生成的声音是可以理解的,但它们仍然比训练在工作室级别录音上的模型质量较低。这是因为用于提高共同声音录音质量的数据预处理方法不足。此外,由于变调的存在以及背景噪音,声音收敛更困难。在本文中,我们证明了通过在共同声音训练中使用多声道接近调的说话者,卢加丹TTS的质量可以得到改善。具体来说,我们选择了六名女性说话者,通过主观听力和比较它们的录音,确定了他们的共同声音。除了剪去录音的开头和结尾处的无声部分外,我们还应用了一个预训练的语音增强模型来降低背景噪音并提高音频质量。我们还利用了一个预训练的非侵入性自我监督Mean Opinion Score(MOS)估计模型,该模型可以过滤具有估计MOS超过3.5的录音,表明具有高感知质量。来自九名母语为卢加丹的参与者的主观MOS评估表明,我们的TTS模型实现了比现有模型更高的MOS值(3.55),而现有的模型报告的MOS值为2.5。此外,对于公平的比较,基于多声道的接近调的模型训练胜过基于单声道的模型(3.13 MOS)或双声道的模型(3.22 MOS)。这展示了通过从单一说话者的不足数据中补充多声道接近调的数据来提高TTS质量的有效性。
https://arxiv.org/abs/2405.10211
Large-scale "foundation models" have gained traction as a way to leverage the vast amounts of unlabeled remote sensing data collected every day. However, due to the multiplicity of Earth Observation satellites, these models should learn "sensor agnostic" representations, that generalize across sensor characteristics with minimal fine-tuning. This is complicated by data availability, as low-resolution imagery, such as Sentinel-2 and Landsat-8 data, are available in large amounts, while very high-resolution aerial or satellite data is less common. To tackle these challenges, we introduce cross-sensor self-supervised training and alignment for remote sensing (X-STARS). We design a self-supervised training loss, the Multi-Sensor Alignment Dense loss (MSAD), to align representations across sensors, even with vastly different resolutions. Our X-STARS can be applied to train models from scratch, or to adapt large models pretrained on e.g low-resolution EO data to new high-resolution sensors, in a continual pretraining framework. We collect and release MSC-France, a new multi-sensor dataset, on which we train our X-STARS models, then evaluated on seven downstream classification and segmentation tasks. We demonstrate that X-STARS outperforms the state-of-the-art by a significant margin with less data across various conditions of data availability and resolutions.
大规模的“基础模型”作为一种利用每天收集的丰富遥感数据的方法已经获得了关注。然而,由于地球观测卫星的多样性,这些模型应该学习“传感器无关”的表示,具有最小的微调,以在传感器特性的不同上进行通用。这复杂化了数据可用性,因为像Sentinel-2和Landsat-8数据这样低分辨率 imagery 大量存在,而非常高的分辨率卫星数据则较为罕见。为解决这些挑战,我们引入了跨传感器自监督训练和对齐 for 遥感(X-STARS)。我们设计了一个自监督训练损失,Multi-Sensor Alignment Dense loss (MSAD),以在传感器特性存在很大差异的情况下对齐表示。我们的 X-STARS 可以应用于从头训练模型,或者将预训练于低分辨率EO数据的模型适应到新高分辨率传感器上,在连续预训练框架中进行。我们收集并发布 MSC-France,一个新的多传感器数据集,用于训练我们的 X-STARS 模型,然后对七个下游分类和分割任务进行评估。我们证明了 X-STARS 在各种数据可用性和分辨率条件下显著优于最先进的模型,尽管数据量较少。
https://arxiv.org/abs/2405.09922
Correspondence-based statistical shape modeling (SSM) stands as a powerful technology for morphometric analysis in clinical research. SSM facilitates population-level characterization and quantification of anatomical shapes such as bones and organs, aiding in pathology and disease diagnostics and treatment planning. Despite its potential, SSM remains under-utilized in medical research due to the significant overhead associated with automatic construction methods, which demand complete, aligned shape surface representations. Additionally, optimization-based techniques rely on bias-inducing assumptions or templates and have prolonged inference times as the entire cohort is simultaneously optimized. To overcome these challenges, we introduce Point2SSM++, a principled, self-supervised deep learning approach that directly learns correspondence points from point cloud representations of anatomical shapes. Point2SSM++ is robust to misaligned and inconsistent input, providing SSM that accurately samples individual shape surfaces while effectively capturing population-level statistics. Additionally, we present principled extensions of Point2SSM++ to adapt it for dynamic spatiotemporal and multi-anatomy use cases, demonstrating the broad versatility of the Point2SSM++ framework. Furthermore, we present extensions of Point2SSM++ tailored for dynamic spatiotemporal and multi-anatomy scenarios, showcasing the broad versatility of the framework. Through extensive validation across diverse anatomies, evaluation metrics, and clinically relevant downstream tasks, we demonstrate Point2SSM++'s superiority over existing state-of-the-art deep learning models and traditional approaches. Point2SSM++ substantially enhances the feasibility of SSM generation and significantly broadens its array of potential clinical applications.
基于对应关系的统计形状建模(SSM)在临床研究中具有强大的技术价值。SSM通过促进对人群水平特征和数量级解剖形状(如骨和器官)的描述,有助于病理学和疾病诊断与治疗计划的制定。然而,尽管SSM具有巨大的潜力,但在医学研究中,它仍然没有得到充分利用,因为与自动构建方法相关的显著开销,这些方法要求完全对齐的形状表面表示。此外,基于优化的技术依赖于偏差诱导的假设或模板,并且在整个队列同时优化时,推理时间会延长。为了克服这些挑战,我们引入了Point2SSM++,一种基于原则的自监督深度学习方法,可以直接从解剖形状点云表示中学习对应点。Point2SSM++对 misaligned 和不一致的输入具有鲁棒性,提供SSM,准确地采样单个形状表面,同时有效捕捉人群水平统计数据。此外,我们还介绍了Point2SSM++的可扩展性,用于动态空间和多轴场景,展示了Point2SSM++框架的广泛适用性。通过跨越不同解剖学和临床相关任务的大量验证、评价指标和临床应用,我们证明了Point2SSM++在现有深度学习模型和传统方法上的优越性。Point2SSM++极大地提高了SSM生成的可行性,并显著拓宽了其潜在临床应用的范围。
https://arxiv.org/abs/2405.09707
Synthetic aperture radar (SAR) is essential in actively acquiring information for Earth observation. SAR Automatic Target Recognition (ATR) focuses on detecting and classifying various target categories under different image conditions. The current deep learning-based SAR ATR methods are typically designed for specific datasets and applications. Various target characteristics, scene background information, and sensor parameters across ATR datasets challenge the generalization of those methods. This paper aims to achieve general SAR ATR based on a foundation model with Self-Supervised Learning (SSL). Our motivation is to break through the specific dataset and condition limitations and obtain universal perceptual capabilities across the target, scene, and sensor. A foundation model named SARATR-X is proposed with the following four aspects: pre-training dataset, model backbone, SSL, and evaluation task. First, we integrated 14 datasets with various target categories and imaging conditions as a pre-training dataset. Second, different model backbones were discussed to find the most suitable approaches for remote-sensing images. Third, we applied two-stage training and SAR gradient features to ensure the diversity and scalability of SARATR-X. Finally, SARATR-X has achieved competitive and superior performance on 5 datasets with 8 task settings, which shows that the foundation model can achieve universal SAR ATR. We believe it is time to embrace fundamental models for SAR image interpretation in the era of increasing big data.
合成孔雷达(SAR)在积极获取地球观测信息方面至关重要。SAR自动目标识别(ATR)关注于在不同的图像条件下检测和分类各种目标类别。目前基于深度学习的SAR ATR方法通常是为特定数据集和应用设计的。各种目标特征、场景背景信息和ATR数据集中的传感器参数挑战了这些方法的一般化。本文旨在基于自监督学习(SSL)的基础模型实现通用SAR ATR。我们的目标是突破特定数据和条件的限制,获得目标、场景和传感器之间的普遍感知能力。 我们提出了一个名为SARATR-X的基础模型,包括以下四个方面:预训练数据集、模型骨架、SSL和评估任务。首先,我们将14个数据集与各种目标和成像条件集成作为一个预训练数据集。其次,讨论了不同的模型骨架,以找到最适合远程感测图像的适当方法。第三,我们应用了两阶段培训和SAR梯度特征来确保SARATR-X的多样性和可扩展性。最后,SARATR-X在5个数据集和8个任务设置上实现了竞争性和卓越性能,这表明基础模型可以实现通用SAR ATR。我们认为,在数据和数据量不断增加的时代,应该拥抱基本模型用于SAR图像解释。
https://arxiv.org/abs/2405.09365
Recent years have witnessed the rapid development of short videos, which usually contain both visual and audio modalities. Background music is important to the short videos, which can significantly influence the emotions of the viewers. However, at present, the background music of short videos is generally chosen by the video producer, and there is a lack of automatic music recommendation methods for short videos. This paper introduces MVBind, an innovative Music-Video embedding space Binding model for cross-modal retrieval. MVBind operates as a self-supervised approach, acquiring inherent knowledge of intermodal relationships directly from data, without the need of manual annotations. Additionally, to compensate the lack of a corresponding musical-visual pair dataset for short videos, we construct a dataset, SVM-10K(Short Video with Music-10K), which mainly consists of meticulously selected short videos. On this dataset, MVBind manifests significantly improved performance compared to other baseline methods. The constructed dataset and code will be released to facilitate future research.
近年来,短视频的发展速度加快,通常包含视觉和音频模态。背景音乐对于短视频来说非常重要,因为它可以显著影响观众的情感。然而,目前,短视频的背景音乐通常由视频制作人选择,缺乏针对短视频的自动音乐推荐方法。本文介绍了一种创新的跨模态检索模型MVBind,用于短片的跨模态检索。MVBind是一种自监督方法,通过直接从数据中获取模态关系的固有知识,无需手动注释。此外,为了弥补短视频没有相应的音乐-视觉对数据集的不足,我们构建了一个数据集SVM-10K(短视频与音乐-10K),主要包含精心选择短的短视频。在这个数据集上,MVBind表现出比其他基线方法显著的优异性能。构建的数据集和代码将发布,以促进未来研究。
https://arxiv.org/abs/2405.09286
Model Predictive Control (MPC)-based trajectory planning has been widely used in robotics, and incorporating Control Barrier Function (CBF) constraints into MPC can greatly improve its obstacle avoidance efficiency. Unfortunately, traditional optimizers are resource-consuming and slow to solve such non-convex constrained optimization problems (COPs) while learning-based methods struggle to satisfy the non-convex constraints. In this paper, we propose SOMTP algorithm, a self-supervised learning-based optimizer for CBF-MPC trajectory planning. Specifically, first, SOMTP employs problem transcription to satisfy most of the constraints. Then the differentiable SLPG correction is proposed to move the solution closer to the safe set and is then converted as the guide policy in the following training process. After that, inspired by the Augmented Lagrangian Method (ALM), our training algorithm integrated with guide policy constraints is proposed to enable the optimizer network to converge to a feasible solution. Finally, experiments show that the proposed algorithm has better feasibility than other learning-based methods and can provide solutions much faster than traditional optimizers with similar optimality.
基于模型的预测控制(MPC)路径规划在机器人领域得到了广泛应用,并将控制障碍功能(CBF)约束融入MPC可以大大提高其避障效率。然而,传统的优化器在处理非凸约束优化问题(COPs)时资源消耗大、解决速度慢。基于学习的方法也难以满足非凸约束。在本文中,我们提出了SOMTP算法,一种基于自我监督学习的自适应CBF-MPC路径规划优化器。具体来说,SOMTP首先采用问题变换来满足大多数约束。然后,提出了不同可导的SLPG校正来将解决方案更接近安全集,接着在训练过程中将其转换为引导策略。此外,受到增广拉格朗日方法(ALM)的启发,我们提出了一种与引导策略约束相结合的训练算法,使优化器网络能够收敛到可行解。最后,实验证明,与其它学习方法相比,该算法具有更好的可行性,并提供比传统具有相似最优性的优化器更快的解决方案。
https://arxiv.org/abs/2405.09212
Dynamic manipulation of free-end cables has applications for cable management in homes, warehouses and manufacturing plants. We present a supervised learning approach for dynamic manipulation of free-end cables, focusing on the problem of getting the cable endpoint to a designated target position, which may lie outside the reachable workspace of the robot end effector. We present a simulator, tune it to closely match experiments with physical cables, and then collect training data for learning dynamic cable manipulation. We evaluate with 3 cables and a physical UR5 robot. Results over 32x5 trials on 3 cables suggest that a physical UR5 robot can attain a median error distance ranging from 22% to 35% of the cable length among cables, outperforming an analytic baseline by 21% and a Gaussian Process baseline by 7% with lower interquartile range (IQR).
翻译 在家庭、仓库和制造厂中,自由端电缆的动态操作具有电缆管理应用。我们提出了一个关注于将电缆端点移动到指定目标位置的监督学习方法,该目标位置可能超出机器人末端执行器的可达到范围。我们创建了一个模拟器,将其调谐为与物理电缆紧密匹配的实验,然后收集用于学习动态电缆操作的训练数据。我们在3条电缆上进行32x5次的测试,结果表明,具有物理UR5机器人的电缆长度范围内的中位数误差距离可以达到22%至35%,比分析基线高21%,比高斯过程基线高7%,具有更低的四分位数范围(IQR)。
https://arxiv.org/abs/2405.09581
This paper proposes leveraging vision-language pretraining on bone X-rays paired with French reports to address downstream tasks of interest on bone radiography. A practical processing pipeline is introduced to anonymize and process French medical reports. Pretraining then consists in the self-supervised alignment of visual and textual embedding spaces derived from deep model encoders. The resulting image encoder is then used to handle various downstream tasks, including quantification of osteoarthritis, estimation of bone age on pediatric wrists, bone fracture and anomaly detection. Our approach demonstrates competitive performance on downstream tasks, compared to alternatives requiring a significantly larger amount of human expert annotations. Our work stands as the first study to integrate French reports to shape the embedding space devoted to bone X-Rays representations, capitalizing on the large quantity of paired images and reports data available in an hospital. By relying on generic vision-laguage deep models in a language-specific scenario, it contributes to the deployement of vision models for wider healthcare applications.
本文提出了一种利用成骨扫描图像搭配法国报告进行视觉-语言预训练的方法,以解决骨摄影下游任务的挑战。具体来说,我们介绍了一个实用的处理流程来匿名化和处理法国医疗报告。预训练包括对深度模型编码器产生的视觉和文本嵌入空间的自监督对齐。然后,经过调整的图像编码器用于处理各种下游任务,包括对骨关节炎的定量、对儿童手腕上的骨龄估计、骨骨折和异常检测。我们的方法在下游任务上具有竞争力的性能,与需要大量人专家注释的替代方法相比。我们的工作是第一项将法国报告整合到专门用于骨X光表示的嵌入空间的研究,充分利用了医院中存在的大量成对图像和报告数据。通过在语言特定的场景中依赖通用视觉语言深度模型,它为更广泛的医疗应用部署视觉模型做出了贡献。
https://arxiv.org/abs/2405.08932
CLIP models perform remarkably well on zero-shot classification and retrieval tasks. But recent studies have shown that learnt representations in CLIP are not well suited for dense prediction tasks like object detection, semantic segmentation or depth estimation. More recently, multi-stage training methods for CLIP models was introduced to mitigate the weak performance of CLIP on downstream tasks. In this work, we find that simply improving the quality of captions in image-text datasets improves the quality of CLIP's visual representations, resulting in significant improvement on downstream dense prediction vision tasks. In fact, we find that CLIP pretraining with good quality captions can surpass recent supervised, self-supervised and weakly supervised pretraining methods. We show that when CLIP model with ViT-B/16 as image encoder is trained on well aligned image-text pairs it obtains 12.1% higher mIoU and 11.5% lower RMSE on semantic segmentation and depth estimation tasks over recent state-of-the-art Masked Image Modeling (MIM) pretraining methods like Masked Autoencoder (MAE). We find that mobile architectures also benefit significantly from CLIP pretraining. A recent mobile vision architecture, MCi2, with CLIP pretraining obtains similar performance as Swin-L, pretrained on ImageNet-22k for semantic segmentation task while being 6.1$\times$ smaller. Moreover, we show that improving caption quality results in $10\times$ data efficiency when finetuning for dense prediction tasks.
CLIP模型在零散分类和检索任务上表现出色。但最近的研究表明,CLIP学习到的表示并不适合用于密集预测任务,如目标检测、语义分割或深度估计。更最近地,多阶段训练方法被引入到CLIP模型的研究中,以改善CLIP在下游任务上的表现。在这项工作中,我们发现,仅仅通过提高图像文本数据集中捕获到的描述的质量来改善CLIP的视觉表示质量,从而在下游密集预测视觉任务上取得显著的改进。事实上,我们发现,使用质量好的摘要进行CLIP预训练可以超过最近的有监督、自监督和弱监督预训练方法。我们证明了,当CLIP模型使用ViT-B/16作为图像编码器进行预训练时,在语义分割和深度估计任务上可以获得比最近的先进masked图像建模(MIM)预训练方法更高的mIoU和更低的RMSE。我们发现,移动架构也显著从CLIP预训练中受益。最近的一个移动视觉架构,MCi2,通过CLIP预训练在语义分割任务上的性能与在ImageNet-22k上预训练的Swin-L类似,而其大小缩小了6.1倍。此外,我们还证明了,提高描述质量可以在对密集预测任务进行微调时实现10倍的数据效率。
https://arxiv.org/abs/2405.08911
Current video summarization methods primarily depend on supervised computer vision techniques, which demands time-consuming manual annotations. Further, the annotations are always subjective which make this task more challenging. To address these issues, we analyzed the feasibility in transforming the video summarization into a text summary task and leverage Large Language Models (LLMs) to boost video summarization. This paper proposes a novel self-supervised framework for video summarization guided by LLMs. Our method begins by generating captions for video frames, which are then synthesized into text summaries by LLMs. Subsequently, we measure semantic distance between the frame captions and the text summary. It's worth noting that we propose a novel loss function to optimize our model according to the diversity of the video. Finally, the summarized video can be generated by selecting the frames whose captions are similar with the text summary. Our model achieves competitive results against other state-of-the-art methods and paves a novel pathway in video summarization.
目前,主要的视频摘要方法依赖于监督计算机视觉技术,这需要耗时的人工标注。此外,标注总是主观的,使这项任务更具挑战性。为了应对这些问题,我们分析了将视频摘要转换为文本摘要的可行性,并利用大型语言模型(LLMs)提高视频摘要。本文提出了一种新的自监督框架,用于指导LLMs的 video summarization。我们的方法首先为视频帧生成字幕,然后由LLMs将其合成为文本摘要。接下来,我们测量视频帧字幕与文本摘要之间的语义距离。值得注意的是,我们提出了一个新颖的损失函数,根据视频的多样性优化我们的模型。最后,可以根据文本摘要选择具有相似文本摘要的帧来生成摘要视频。我们的模型在与其他最先进的 methods竞争的同时,在视频摘要领域开辟了新的途径。
https://arxiv.org/abs/2405.08890
In the realm of autonomous driving, robust perception under out-of-distribution conditions is paramount for the safe deployment of vehicles. Challenges such as adverse weather, sensor malfunctions, and environmental unpredictability can severely impact the performance of autonomous systems. The 2024 RoboDrive Challenge was crafted to propel the development of driving perception technologies that can withstand and adapt to these real-world variabilities. Focusing on four pivotal tasks -- BEV detection, map segmentation, semantic occupancy prediction, and multi-view depth estimation -- the competition laid down a gauntlet to innovate and enhance system resilience against typical and atypical disturbances. This year's challenge consisted of five distinct tracks and attracted 140 registered teams from 93 institutes across 11 countries, resulting in nearly one thousand submissions evaluated through our servers. The competition culminated in 15 top-performing solutions, which introduced a range of innovative approaches including advanced data augmentation, multi-sensor fusion, self-supervised learning for error correction, and new algorithmic strategies to enhance sensor robustness. These contributions significantly advanced the state of the art, particularly in handling sensor inconsistencies and environmental variability. Participants, through collaborative efforts, pushed the boundaries of current technologies, showcasing their potential in real-world scenarios. Extensive evaluations and analyses provided insights into the effectiveness of these solutions, highlighting key trends and successful strategies for improving the resilience of driving perception systems. This challenge has set a new benchmark in the field, providing a rich repository of techniques expected to guide future research in this field.
在自动驾驶领域,在非分布环境下稳健的感知至关重要,这将有利于车辆的安全部署。例如恶劣天气、传感器故障和环境不可预测性等问题会对自动驾驶系统的性能造成严重影响。为了解决这个问题,2024 RoboDrive挑战是为了推动开发能够承受并适应这些现实世界变异性的人工智能驱动感知技术。将注意力放在四个关键任务上--BEV检测、地图分割、语义占用预测和多视角深度估计--比赛为创新和提高系统抗干扰能力设定了挑战。今年的挑战包括五个不同的赛道,吸引了来自93个机构的140支注册队伍,并通过我们的服务器评估了大约1000个解决方案。比赛最终产生了15个最佳解决方案,其中包括先进的数据增强、多传感器融合、自监督学习误码纠正和新的算法策略来增强传感器稳健性。这些贡献显著推动了技术的进步,尤其是在处理传感器不一致性和环境变化方面。参与者通过协同努力,推动了现有技术的边界,展示了他们在现实场景中的潜力。 extensive评估和分析提供了对这些解决方案的有效性的深入了解,强调了改进驾驶感知系统韧性的关键趋势和成功策略。这个挑战为该领域设定了新的基准,为未来研究提供了丰富的技术资料。
https://arxiv.org/abs/2405.08816
The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).
现代视觉骨干的优越性能通常需要付出高昂的训练代价。我们通过将课程学习的概念扩展到其原始定义之外,即使用更容易-更难的数据训练模型,为这个问题做出了贡献。具体来说,我们将训练课程重新定义为一个软选择函数,该函数在训练过程中逐个例子中揭示出越来越困难的模式,而不是执行更容易-更难的样本选择。我们的工作受到视觉骨干学习动态中一个引人入胜的观察的启发:在训练的前几个阶段,模型主要学习识别数据中的“更容易-学习”的判别模式。这些模式通过频域和空间域观察时,包括较低频率的成分,以及不会扭曲或数据增强的自然图像内容。为了实现这些发现,我们提出了一个课程,其中模型在每次学习阶段始终利用所有训练数据,然而,对于每个实例,首先启动更容易-学习模式的暴露,随着训练的进行,逐渐引入更难的模式。为了在计算上实现这一想法,我们在输入的傅里叶频谱上引入裁剪操作,使模型仅从低频成分中学习。然后,我们证明了通过调整数据增强的强度,可以轻松地获得自然图像内容的暴露。最后,我们将这些方面综合起来,并设计了具有自定义搜索算法的课程计划。所得到的方法,EfficientTrain++,简单、通用,然而却非常有效。它将各种流行模型的训练时间缩短了1.5-3.0倍,而不会牺牲准确性。它还证明了在自监督学习方面的效果(例如MAE)。
https://arxiv.org/abs/2405.08768
This paper addresses the problem of self-supervised general-purpose audio representation learning. We explore the use of Joint-Embedding Predictive Architectures (JEPA) for this task, which consists of splitting an input mel-spectrogram into two parts (context and target), computing neural representations for each, and training the neural network to predict the target representations from the context representations. We investigate several design choices within this framework and study their influence through extensive experiments by evaluating our models on various audio classification benchmarks, including environmental sounds, speech and music downstream tasks. We focus notably on which part of the input data is used as context or target and show experimentally that it significantly impacts the model's quality. In particular, we notice that some effective design choices in the image domain lead to poor performance on audio, thus highlighting major differences between these two modalities.
本文解决了自监督通用音频表示学习的问题。我们探讨了使用联合嵌入预测架构(JEPA)解决这个任务的途径,它包括将输入的Mel声谱图拆分为两个部分(上下文和目标),计算每个部分的神经表示,并训练神经网络从上下文表示预测目标表示。我们在这种框架内研究了几个设计选择,并通过广泛的实验研究了它们的影响,评估了我们的模型在各种音频分类基准上的表现,包括环境声音、语音和音乐下游任务。我们特别关注输入数据中哪个部分被用作上下文或目标,并通过实验证明了它对模型性能的影响。值得注意的是,在图像领域,一些有效的设计选择导致了在音频方面的表现不佳,从而突出了这两种媒体之间的主要区别。
https://arxiv.org/abs/2405.08679
Depth estimation plays a crucial role in various tasks within endoscopic surgery, including navigation, surface reconstruction, and augmented reality visualization. Despite the significant achievements of foundation models in vision tasks, including depth estimation, their direct application to the medical domain often results in suboptimal performance. This highlights the need for efficient adaptation methods to adapt these models to endoscopic depth estimation. We propose Endoscopic Depth Any Camera (EndoDAC) which is an efficient self-supervised depth estimation framework that adapts foundation models to endoscopic scenes. Specifically, we develop the Dynamic Vector-Based Low-Rank Adaptation (DV-LoRA) and employ Convolutional Neck blocks to tailor the foundational model to the surgical domain, utilizing remarkably few trainable parameters. Given that camera information is not always accessible, we also introduce a self-supervised adaptation strategy that estimates camera intrinsics using the pose encoder. Our framework is capable of being trained solely on monocular surgical videos from any camera, ensuring minimal training costs. Experiments demonstrate that our approach obtains superior performance even with fewer training epochs and unaware of the ground truth camera intrinsics. Code is available at this https URL.
深度估计在内窥镜手术的各种任务中扮演着至关重要的角色,包括导航、表面重建和增强现实可视化。尽管基础模型在视觉任务中取得了显著的成就,包括深度估计,但它们的直接应用到医学领域通常会导致性能较低。这表明了需要有效的适应方法将这些模型应用于内窥镜深度估计。我们提出了Endoscopic Depth Any Camera(EndoDAC),这是一种有效的自监督深度估计框架,将基础模型适应内窥镜场景。具体来说,我们开发了基于动态向量的高级低秩适应(DV-LoRA)方法,并使用卷积颈块将基本模型裁剪为手术领域,利用训练参数的数量非常少。鉴于相机信息通常不可用,我们还引入了一种自监督的适应策略,使用姿态编码器估计相机内参。我们的框架能够仅通过单目手术视频进行训练,确保最小化训练成本。实验结果表明,即使训练次数较少,甚至不知道真实相机内参,我们的方法也能获得卓越的性能。代码位于此链接处。
https://arxiv.org/abs/2405.08672
Self-supervised learning (SSL) is an approach to extract useful feature representations from unlabeled data, and enable fine-tuning on downstream tasks with limited labeled examples. Self-pretraining is a SSL approach that uses the curated task dataset for both pretraining the networks and fine-tuning them. Availability of large, diverse, and uncurated public medical image sets provides the opportunity to apply SSL in the "wild" and potentially extract features robust to imaging variations. However, the benefit of wild- vs self-pretraining has not been studied for medical image analysis. In this paper, we compare robustness of wild versus self-pretrained transformer (vision transformer [ViT] and hierarchical shifted window [Swin]) models to computed tomography (CT) imaging differences for non-small cell lung cancer (NSCLC) segmentation. Wild-pretrained Swin models outperformed self-pretrained Swin for the various imaging acquisitions. ViT resulted in similar accuracy for both wild- and self-pretrained models. Masked image prediction pretext task that forces networks to learn the local structure resulted in higher accuracy compared to contrastive task that models global image information. Wild-pretrained models resulted in higher feature reuse at the lower level layers and feature differentiation close to output layer after fine-tuning. Hence, we conclude: Wild-pretrained networks were more robust to analyzed CT imaging differences for lung tumor segmentation than self-pretrained methods. Swin architecture benefited from such pretraining more than ViT.
自监督学习(SSL)是从未标记数据中提取有用特征表示的方法,并在有限标记示例的情况下,在下游任务上进行微调。自监督预训练是一种SSL方法,它利用精心挑选的任务数据集来预训练网络并对其进行微调。大型、多样化和未标记的公共医疗图像数据集的可用性提供了在“野地”应用SSL的机会,从而可能提取对影像变异具有鲁棒性的特征。然而,对于医学图像分析,还没有研究野生预训练和自监督预训练之间的优势。在本文中,我们比较野生预训练和自监督预训练的Transformer(视觉Transformer [ViT]和层次窗滑动窗口 [Swin])模型的计算断层扫描(CT)成像差异对非小细胞肺癌(NSCLC)分割的鲁棒性。野生预训练的Swin模型在各种成像采集中都优于自监督预训练的Swin模型。ViT模型的准确性与野生预训练和自监督预训练模型相当。使网络学习局部结构的目标预处理任务产生了比全局图像信息建模的对比任务更高的准确率。野生预训练模型在微调后,低层层级的特征复用较高,输出层附近的特征分化也较高。因此,我们得出结论:野生预训练网络在肺癌分割分析中的鲁棒性大于自监督方法。Swin架构从预训练中受益更多。
https://arxiv.org/abs/2405.08657
Self-supervised learning has shown great success in Speech Recognition. However, it has been observed that finetuning all layers of the learned model leads to lower performance compared to resetting top layers. This phenomenon is attributed to the ''autoencoder'' behavior: top layers contain information closer to the input and are less suitable for tasks that require linguistic information, such as Speech this http URL better our understanding of this behavior, we propose to study the evolution of high-level information within the model during pretraining. We focus on the HuBERT model, which exhibits a less pronounced ''autoencoder'' behavior. By experimentally exploring various factors that may have an impact, we aim to improve the training procedure and enhance the top layers of HuBERT for high-level tasks.Furthermore, our experiments demonstrate that these improvements in the training procedure result in faster convergence and competitive performance on downstream tasks.
自监督学习在语音识别方面取得了巨大的成功。然而,观察发现,对学习模型的所有层进行微调会导致性能低于重置顶层。这种现象归因于“自编码器”行为:顶层包含更接近输入的信息,并且不太适合需要语言信息的任务,比如更好地理解这个行为。为了研究模型在预训练期间的高级信息演化,我们关注了表现较轻的HuBERT模型。通过实验探索可能影响训练过程的各种因素,我们的目标是改进训练程序并提高HuBERT模型在高级任务上的顶级层。此外,我们的实验还表明,这些训练过程的改进会导致更快的学习曲线和竞争力的下游任务性能。
https://arxiv.org/abs/2405.08402
From a perspective of feature matching, optical flow estimation for event cameras involves identifying event correspondences by comparing feature similarity across accompanying event frames. In this work, we introduces an effective and robust high-dimensional (HD) feature descriptor for event frames, utilizing Vector Symbolic Architectures (VSA). The topological similarity among neighboring variables within VSA contributes to the enhanced representation similarity of feature descriptors for flow-matching points, while its structured symbolic representation capacity facilitates feature fusion from both event polarities and multiple spatial scales. Based on this HD feature descriptor, we propose a novel feature matching framework for event-based optical flow, encompassing both model-based (VSA-Flow) and self-supervised learning (VSA-SM) methods. In VSA-Flow, accurate optical flow estimation validates the effectiveness of HD feature descriptors. In VSA-SM, a novel similarity maximization method based on the HD feature descriptor is proposed to learn optical flow in a self-supervised way from events alone, eliminating the need for auxiliary grayscale images. Evaluation results demonstrate that our VSA-based method achieves superior accuracy in comparison to both model-based and self-supervised learning methods on the DSEC benchmark, while remains competitive among both methods on the MVSEC benchmark. This contribution marks a significant advancement in event-based optical flow within the feature matching methodology.
从特征匹配的角度来看,事件相机中的光学流估计涉及通过比较伴随事件帧之间的特征相似性来识别事件对应关系。在这项工作中,我们引入了一个有效且鲁棒的高维(HD)特征描述器来描述事件帧,利用向量符号抽象结构(VSA)。VSA中相邻变量之间的拓扑相似性有助于增强特征描述器对于匹配点的光流匹配的表示相似性,而其结构化符号表示能力则促使来自事件极性和多个空间维度的特征进行融合。基于这个HD特征描述器,我们提出了一个基于事件的光流匹配框架,涵盖了基于模型的(VSA-Flow)和自监督学习(VSA-SM)方法。在VSA-Flow中,准确的光流估计验证了HD特征描述器的有效性。在VSA-SM中,我们提出了一种基于HD特征描述器的新型相似度最大化方法来以事件的方式自监督地学习光流,消除了需要辅助灰度图像的需求。评估结果表明,基于VSA的方法在DSEC基准上实现了与基于模型的方法和自监督学习方法相比的卓越准确性,而在MVSEC基准上则与这两种方法保持竞争力。这项贡献在基于特征匹配的光流匹配方法论上取得了显著的进步。
https://arxiv.org/abs/2405.08300
The ability of deep networks to learn superior representations hinges on leveraging the proper inductive biases, considering the inherent properties of datasets. In tabular domains, it is critical to effectively handle heterogeneous features (both categorical and numerical) in a unified manner and to grasp irregular functions like piecewise constant functions. To address the challenges in the self-supervised learning framework, we propose a novel pretext task based on the classical binning method. The idea is straightforward: reconstructing the bin indices (either orders or classes) rather than the original values. This pretext task provides the encoder with an inductive bias to capture the irregular dependencies, mapping from continuous inputs to discretized bins, and mitigates the feature heterogeneity by setting all features to have category-type targets. Our empirical investigations ascertain several advantages of binning: capturing the irregular function, compatibility with encoder architecture and additional modifications, standardizing all features into equal sets, grouping similar values within a feature, and providing ordering information. Comprehensive evaluations across diverse tabular datasets corroborate that our method consistently improves tabular representation learning performance for a wide range of downstream tasks. The codes are available in this https URL.
深度网络学习优越的表示能力取决于正确利用归纳偏见,考虑到数据集的固有特性。在表格领域,关键是要以统一的方式有效地处理异质特征(包括分类和数值特征),并理解分段常数函数等不规则函数。为解决自监督学习框架中的挑战,我们提出了一个基于经典分立方法的新型预处理任务。这个想法很简单:重构分标(无论是顺序还是类别)而不是原始值。这个预处理任务为编码器提供了一个归纳偏见,以捕捉不规则依赖关系,将连续输入映射到离散的区间,并通过将所有特征都设置为具有类目类型的目标来缓解特征异质性。我们的实证调查证实了分立的一些优点:捕捉不规则函数、与编码器架构的兼容性、将所有特征标准化为等集、将相似值分组在一起以及提供排序信息。通过对各种表格数据集的全面评估,证实了我们的方法对于各种下游任务的表格表示学习性能始终有所改进。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2405.07414
Machine unlearning is a complex process that necessitates the model to diminish the influence of the training data while keeping the loss of accuracy to a minimum. Despite the numerous studies on machine unlearning in recent years, the majority of them have primarily focused on supervised learning models, leaving research on contrastive learning models relatively underexplored. With the conviction that self-supervised learning harbors a promising potential, surpassing or rivaling that of supervised learning, we set out to investigate methods for machine unlearning centered around contrastive learning models. In this study, we introduce a novel gradient constraint-based approach for training the model to effectively achieve machine unlearning. Our method only necessitates a minimal number of training epochs and the identification of the data slated for unlearning. Remarkably, our approach demonstrates proficient performance not only on contrastive learning models but also on supervised learning models, showcasing its versatility and adaptability in various learning paradigms.
机器学习消退是一个复杂的过程,需要模型在减小训练数据的影响的同时,将准确度的损失降到最低。尽管在近年来有关机器学习消退的研究数量众多,但大多数研究主要集中在监督学习模型上,相对较少地研究了对比学习模型。我们坚信,自监督学习具有很大的潜力,可以与监督学习相媲美或者超越监督学习。因此,我们研究了以对比学习模型为中心的机器学习消退方法。在这项研究中,我们引入了一种新颖的基于梯度约束的训练方法,以有效实现机器消退。我们的方法只需要很少的训练周期和计划消退数据的确定。值得注意的是,我们的方法不仅在对比学习模型上表现出卓越的性能,而且在监督学习模型上同样表现出卓越的性能,展示了其在各种学习范式中的灵活性和适应性。
https://arxiv.org/abs/2405.07317
This study introduces a novel Supervised Info-enhanced Contrastive Learning framework for EEG based Emotion Recognition (SICLEER). SI-CLEER employs multi-granularity contrastive learning to create robust EEG contextual representations, potentiallyn improving emotion recognition effectiveness. Unlike existing methods solely guided by classification loss, we propose a joint learning model combining self-supervised contrastive learning loss and supervised classification loss. This model optimizes both loss functions, capturing subtle EEG signal differences specific to emotion detection. Extensive experiments demonstrate SI-CLEER's robustness and superior accuracy on the SEED dataset compared to state-of-the-art methods. Furthermore, we analyze electrode performance, highlighting the significance of central frontal and temporal brain region EEGs in emotion detection. This study offers an universally applicable approach with potential benefits for diverse EEG classification tasks.
这项研究提出了一种新颖的基于监督的增强对比学习框架,用于基于脑电波(EEG)的情感识别(SICLEER)。SI-CLEER采用多粒度对比学习来创建稳健的EEG上下文表示,从而可能提高情感识别的有效性。与现有的方法仅基于分类损失不同,我们提出了一个结合自监督对比学习损失和监督分类损失的联合学习模型。该模型优化了两个损失函数,捕捉到特定于情感检测的微小EEG信号差异。大量实验证明,SI-CLEER在SEED数据集上的鲁棒性和卓越准确性比最先进的methods要强。此外,我们分析了电极性能,强调了中央前额和颞叶皮层EEG在情感检测中的重要性。这项研究提供了一种通用的方法,对各种EEG分类任务具有潜在的益处。
https://arxiv.org/abs/2405.07260