Addressing the challenges of rare diseases is difficult, especially with the limited number of reference images and a small patient population. This is more evident in rare skin diseases, where we encounter long-tailed data distributions that make it difficult to develop unbiased and broadly effective models. The diverse ways in which image datasets are gathered and their distinct purposes also add to these challenges. Our study conducts a detailed examination of the benefits and drawbacks of episodic and conventional training methodologies, adopting a few-shot learning approach alongside transfer learning. We evaluated our models using the ISIC2018, Derm7pt, and SD-198 datasets. With minimal labeled examples, our models showed substantial information gains and better performance compared to previously trained models. Our research emphasizes the improved ability to represent features in DenseNet121 and MobileNetV2 models, achieved by using pre-trained models on ImageNet to increase similarities within classes. Moreover, our experiments, ranging from 2-way to 5-way classifications with up to 10 examples, showed a growing success rate for traditional transfer learning methods as the number of examples increased. The addition of data augmentation techniques significantly improved our transfer learning based model performance, leading to higher performances than existing methods, especially in the SD-198 and ISIC2018 datasets. All source code related to this work will be made publicly available soon at the provided URL.
解决罕见疾病面临的挑战是困难的,尤其是在参考图像数量有限且患者人口规模较小的情况下。这在罕见皮肤疾病中更加明显,因为我们会遇到具有长尾数据分布的疾病,这使得开发无偏差且具有广泛效果的模型变得困难。图像数据集的收集方式和它们的独特目的也增加了这些挑战。我们的研究详细探讨了周期性训练方法和传统训练方法的优缺点,并采用少量样本学习方法与迁移学习相结合。我们使用ISIC2018、Derm7pt和SD-198数据集来评估我们的模型。由于样本标注数量很少,我们的模型在性能上与之前训练的模型相比取得了很大的信息和特征增益。我们的研究重点是改善DenseNet121和MobileNetV2模型的特征表示能力,通过在ImageNet上预训练模型来增加类内相似度。此外,我们的实验,从2-way到5-way分类,有 up to 10 个样本,表明随着样本数量的增加,传统迁移学习方法的转移学习效果逐渐提高。数据增强技术极大地提高了基于模型的迁移学习性能,特别是在SD-198和ISIC2018数据集上,使得现有方法的性能更优。所有与本研究相关的源代码都将很快在提供的URL上公开发布。
https://arxiv.org/abs/2404.16814
Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
近年来,大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上,一些研究,如CoOp和CoCoOp,提出了使用提示学习的方法,其中上下文在提示中替换为可学习向量,从而在手动设计的提示上取得了显著的改进。然而,对于未见过的类别的性能提升仍然很小,为了解决这个问题,传统零散学习技术中经常使用数据增强。通过我们的实验,我们发现了CoOp和CoCoOp中重要的问题:通过传统图像增强学习到的上下文存在偏见,不利于对未见过的类别的泛化。为了解决这个问题,我们提出了一个对抗性标记嵌入策略,当在提示中诱导偏见时,将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”,AAPL,我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验,总体而言,AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。
https://arxiv.org/abs/2404.16804
Multi-modal foundation models such as CLIP have showcased impressive zero-shot capabilities. However, their applicability in resource-constrained environments is limited due to their large number of parameters and high inference time. While existing approaches have scaled down the entire CLIP architecture, we focus on training smaller variants of the image encoder, which suffices for efficient zero-shot classification. The use of synthetic data has shown promise in distilling representations from larger teachers, resulting in strong few-shot and linear probe performance. However, we find that this approach surprisingly fails in true zero-shot settings when using contrastive losses. We identify the exploitation of spurious features as being responsible for poor generalization between synthetic and real data. However, by using the image feature-based L2 distillation loss, we mitigate these problems and train students that achieve zero-shot performance which on four domain-specific datasets is on-par with a ViT-B/32 teacher model trained on DataCompXL, while featuring up to 92% fewer parameters.
多模态基础模型,如CLIP,已经展示了令人印象深刻的零样本能力。然而,由于它们具有大量参数和高推理时间,这些模型在资源受限的环境中的应用有限。虽然现有的方法已经将整个CLIP架构缩小,但我们关注于训练更小的图像编码器变体,这对于高效的零样本分类是足够的。使用合成数据已经表明,从更大的教师表示中提取表示具有潜力,导致强大的零样本和线性探测性能。然而,我们发现,在真正的零样本设置中,这种方法在对比损失方面表现令人失望。我们发现,这种方法在合成和真实数据之间的泛化差上存在问题。然而,通过使用基于图像特征的L2蒸馏损失,我们缓解了这些问题,并培训学生实现零样本性能,这在与DataCompXL数据集上训练的ViT-B/32教师模型相当的四域特定数据集上。
https://arxiv.org/abs/2404.16637
This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.
本文介绍了由THU-HCSI团队为LIMMITS'24挑战开发的的多语种、多声道语音克隆系统。为了在单语种和跨语种场景下实现高说话者相似度和自然度,我们在YourTTS基础上进行了系统构建,并添加了几个增强功能。为了进一步提高说话者相似度和语音质量,我们引入了说话者感知的文本编码器和基于Transformer的流式解码器。此外,我们还对几 shot数据进行了去噪、混合处理,并采用了一种针对说话者的平衡采样策略,以确保对目标说话者的有效微调。在1号轨道的官方评估中,我们的系统实现了4.25的说话者相似度MOS和显著的自然度MOS。
https://arxiv.org/abs/2404.16619
Pre-trained large text-to-image (T2I) models with an appropriate text prompt has attracted growing interests in customized images generation field. However, catastrophic forgetting issue make it hard to continually synthesize new user-provided styles while retaining the satisfying results amongst learned styles. In this paper, we propose MuseumMaker, a method that enables the synthesis of images by following a set of customized styles in a never-end manner, and gradually accumulate these creative artistic works as a Museum. When facing with a new customization style, we develop a style distillation loss module to transfer the style of the whole dataset into generation of images. It can minimize the learning biases caused by content of images, and address the catastrophic overfitting issue induced by few-shot images. To deal with catastrophic forgetting amongst past learned styles, we devise a dual regularization for shared-LoRA module to optimize the direction of model update, which could regularize the diffusion model from both weight and feature aspects, respectively. Meanwhile, a unique token embedding corresponding to this new style is learned by a task-wise token learning module, which could preserve historical knowledge from past styles with the limitation of LoRA parameter quantity. As any new user-provided style come, our MuseumMaker can capture the nuances of the new styles while maintaining the details of learned styles. Experimental results on diverse style datasets validate the effectiveness of our proposed MuseumMaker method, showcasing its robustness and versatility across various scenarios.
预训练的大型文本-图像(T2I)模型,特别是适当的文本提示,在定制图像生成领域引起了越来越多的兴趣。然而,灾难性遗忘问题使得在保留学习到的样式的同时,持续生成新的用户提供样式变得困难。在本文中,我们提出了一种名为MuseumMaker的方法,使您能够在无尽的方式下根据一组自定义样式合成图像,并逐渐积累这些创意艺术作品作为一个博物馆。当面临新的定制样式时,我们开发了一种风格蒸馏损失模块,将整个数据集的风格传递给图像生成。它可以减小由于图片内容而产生的学习偏移,并解决由少样本图像引起的灾难性过拟合问题。为了处理过去学习到的样式中的灾难性遗忘,我们为共享LoRA模块设计了双重正则化,以优化模型更新方向,分别从权重和特征方面对扩散模型进行正则。同时,通过任务级别的标记学习模块,学习到一个与新样式对应的独特标记嵌入,这可以保留过去样式的历史知识,同时限制LoRA参数的数量。随着任何新的用户提供样式,我们的MuseumMaker可以捕捉到新风格的细微差别,同时保留学习到的样式的细节。在多样风格数据集上的实验结果证实了我们对MuseumMaker方法的有效性,展示了其在各种场景的稳健性和多样性。
https://arxiv.org/abs/2404.16612
Few-shot image synthesis entails generating diverse and realistic images of novel categories using only a few example images. While multiple recent efforts in this direction have achieved impressive results, the existing approaches are dependent only upon the few novel samples available at test time in order to generate new images, which restricts the diversity of the generated images. To overcome this limitation, we propose Conditional Distribution Modelling (CDM) -- a framework which effectively utilizes Diffusion models for few-shot image generation. By modelling the distribution of the latent space used to condition a Diffusion process, CDM leverages the learnt statistics of the training data to get a better approximation of the unseen class distribution, thereby removing the bias arising due to limited number of few shot samples. Simultaneously, we devise a novel inversion based optimization strategy that further improves the approximated unseen class distribution, and ensures the fidelity of the generated samples to the unseen class. The experimental results on four benchmark datasets demonstrate the effectiveness of our proposed CDM for few-shot generation.
少量样本图像生成意味着使用仅几张示例图像生成具有新颖类别的新颖且真实的图像。虽然在这个方向上已经有很多最近的尝试取得了令人印象深刻的成果,但现有的方法仅依赖于测试时间有限的几个新颖样本生成新图像,这限制了生成的图像的多样性。为了克服这个限制,我们提出了条件分布建模(CDM)框架——一个有效利用扩散模型进行少量样本图像生成的框架。通过建模用于条件扩散过程的潜在空间分布,CDM利用训练数据的已学习统计量来获得更好的类分布近似,从而消除由于样本数量有限而产生的偏差。同时,我们还设计了一种新的基于优化的策略,进一步改善了类分布的近似程度,并确保生成的样本与原始类别的一致性。在四个基准数据集上的实验结果表明,我们提出的CDM对于少量样本生成非常有效。
https://arxiv.org/abs/2404.16556
Recent advances in Vision and Language Models (VLMs) have improved open-world 3D representation, facilitating 3D zero-shot capability in unseen categories. Existing open-world methods pre-train an extra 3D encoder to align features from 3D data (e.g., depth maps or point clouds) with CAD-rendered images and corresponding texts. However, the limited color and texture variations in CAD images can compromise the alignment robustness. Furthermore, the volume discrepancy between pre-training datasets of the 3D encoder and VLM leads to sub-optimal 2D to 3D knowledge transfer. To overcome these issues, we propose OpenDlign, a novel framework for learning open-world 3D representations, that leverages depth-aligned images generated from point cloud-projected depth maps. Unlike CAD-rendered images, our generated images provide rich, realistic color and texture diversity while preserving geometric and semantic consistency with the depth maps. OpenDlign also optimizes depth map projection and integrates depth-specific text prompts, improving 2D VLM knowledge adaptation for 3D learning efficient fine-tuning. Experimental results show that OpenDlign significantly outperforms existing benchmarks in zero-shot and few-shot 3D tasks, exceeding prior scores by 8.0% on ModelNet40 and 16.4% on OmniObject3D with just 6 million tuned parameters. Moreover, integrating generated depth-aligned images into existing 3D learning pipelines consistently improves their performance.
近年来,在Vision和语言模型(VLMs)方面的进步已经提高了开放世界3D表示,推动了在未见类别的3D零击能力。现有的开放世界方法在预训练3D编码器时添加了一个额外的3D编码器,使其将来自3D数据(如深度图或点云)的特征与CAD渲染图像和相关文本对齐。然而,CAD图像中有限的颜色和纹理变化可能会削弱对齐稳健性。此外,预训练3D编码器数据集和VLM数据集之间的体积差异导致了2D到3D知识传递的低效。为了克服这些问题,我们提出了OpenDlign,一种学习开放世界3D表示的新框架,它利用点云投影得到的深度图生成的深度对齐图像。与CAD渲染图像不同,我们的生成图像在保持几何和语义一致性的同时,提供了丰富、逼真的颜色和纹理多样性。此外,OpenDlign还优化了深度图投影并集成了深度特定文本提示,提高了2D VLM对3D学习的知识迁移效率。实验结果表明,OpenDlign在零击和少击3D任务上显著优于现有基准,在仅600万调整参数的情况下,超过了ModelNet40和OmniObject3D的分数。此外,将生成的深度对齐图像集成到现有的3D学习流程中,显著提高了它们的性能。
https://arxiv.org/abs/2404.16538
In the context of imitation learning applied to dexterous robotic hands, the high complexity of the systems makes learning complex manipulation tasks challenging. However, the numerous datasets depicting human hands in various different tasks could provide us with better knowledge regarding human hand motion. We propose a method to leverage multiple large-scale task-agnostic datasets to obtain latent representations that effectively encode motion subtrajectories that we included in a transformer-based behavior cloning method. Our results demonstrate that employing latent representations yields enhanced performance compared to conventional behavior cloning methods, particularly regarding resilience to errors and noise in perception and proprioception. Furthermore, the proposed approach solely relies on human demonstrations, eliminating the need for teleoperation and, therefore, accelerating the data acquisition process. Accurate inverse kinematics for fingertip retargeting ensures precise transfer from human hand data to the robot, facilitating effective learning and deployment of manipulation policies. Finally, the trained policies have been successfully transferred to a real-world 23Dof robotic system.
在将模仿学习应用于灵巧机器人手的应用中,系统的复杂性使得学习复杂的操作任务具有挑战性。然而,描述人类在不同任务中操作的丰富数据集可以为我们在运动子轨迹方面提供更好的知识。我们提出了一种利用多个大型、任务无关的数据集的方法,以获得有效的表示我们包括在基于Transformer的行为复制方法中的运动子轨迹的潜在表示。我们的结果表明,使用潜在表示能够提高与传统行为复制方法的性能,特别是关于感知和本体知觉中的错误和噪声的鲁棒性。此外,所提出的方法仅依赖于人类演示,因此消除了遥控的需求,从而加速了数据收集过程。准确的手指重新定位的逆运动学确保了从人类手数据到机器人的精确传递,促进了有效的学习和部署操作策略。最后,已经训练好的策略已经被成功地应用于一个23Dof的实物机器人系统。
https://arxiv.org/abs/2404.16483
Machine learning has the potential to revolutionize passive acoustic monitoring (PAM) for ecological assessments. However, high annotation and compute costs limit the field's efficacy. Generalizable pretrained networks can overcome these costs, but high-quality pretraining requires vast annotated libraries, limiting its current applicability primarily to bird taxa. Here, we identify the optimum pretraining strategy for a data-deficient domain using coral reef bioacoustics. We assemble ReefSet, a large annotated library of reef sounds, though modest compared to bird libraries at 2% of the sample count. Through testing few-shot transfer learning performance, we observe that pretraining on bird audio provides notably superior generalizability compared to pretraining on ReefSet or unrelated audio alone. However, our key findings show that cross-domain mixing which leverages bird, reef and unrelated audio during pretraining maximizes reef generalizability. SurfPerch, our pretrained network, provides a strong foundation for automated analysis of marine PAM data with minimal annotation and compute costs.
机器学习有可能彻底颠覆生态评估中的被动声学监测(PAM)。然而,高标注和计算成本限制了该领域的有效性。通用预训练网络可以克服这些成本,但高质量的预训练需要庞大的注释库,这限制了它目前主要应用于鸟类物种。在这里,我们通过使用珊瑚礁生物声学来确定一个数据不足领域的最优预训练策略。我们组装了ReefSet,一个大型的珊瑚礁声学注释库,尽管规模相对于鸟类库来说较小,为2%的样本计数。通过测试少样本迁移学习性能,我们观察到,与单独使用鸟类音频进行预训练相比,预训练在珊瑚礁上的效果具有显著的优越性。然而,我们的关键发现是,在预训练过程中利用鸟类、珊瑚礁和无关音频可以最大化珊瑚礁的泛化能力。我们的预训练网络SurfPerch为自动分析海洋PAM数据提供了强大的基础,同时具有最小的标注和计算成本。
https://arxiv.org/abs/2404.16436
Prompt learning has become the most effective paradigm for adapting large pre-trained vision-language models (VLMs) to downstream tasks. Recently, unsupervised prompt tuning methods, such as UPL and POUF, directly leverage pseudo-labels as supervisory information to fine-tune additional adaptation modules on unlabeled data. However, inaccurate pseudo labels easily misguide the tuning process and result in poor representation capabilities. In light of this, we propose Training-Free Unsupervised Prompts (TFUP), which maximally preserves the inherent representation capabilities and enhances them with a residual connection to similarity-based prediction probabilities in a training-free and labeling-free manner. Specifically, we integrate both instance confidence and prototype scores to select representative samples, which are used to customize a reliable Feature Cache Model (FCM) for training-free inference. Then, we design a Multi-level Similarity Measure (MSM) that considers both feature-level and semantic-level similarities to calculate the distance between each test image and the cached sample as the weight of the corresponding cached label to generate similarity-based prediction probabilities. In this way, TFUP achieves surprising performance, even surpassing the training-base method on multiple classification datasets. Based on our TFUP, we propose a training-based approach (TFUP-T) to further boost the adaptation performance. In addition to the standard cross-entropy loss, TFUP-T adopts an additional marginal distribution entropy loss to constrain the model from a global perspective. Our TFUP-T achieves new state-of-the-art classification performance compared to unsupervised and few-shot adaptation approaches on multiple benchmarks. In particular, TFUP-T improves the classification accuracy of POUF by 3.3% on the most challenging Domain-Net dataset.
prompt learning已成为将大型预训练视觉语言模型(VLMs)适应下游任务的最具效力的范式。最近,无监督提示调整方法(如UPL和POUF)直接利用伪标签作为监督信息,在未标注数据上微调附加适应模块。然而,不准确的伪标签容易误导调整过程,导致表现不佳。针对这个问题,我们提出了 Training-Free Unsupervised Prompts (TFUP),在训练-免费和无标签的方式下,保留预训练模型的固有表示能力,并将其与基于相似度的预测概率的残差连接来增强表现。具体来说,我们将实例置信度和原型分数相结合,以选择具有代表性的样本,用于为无标签数据训练-免费的特征缓存模型(FCM)。然后,我们设计了一个多级相似度度量(MSM),考虑特征级别和语义级别的相似性,计算测试图像与缓存样本之间的距离,作为对应缓存标签的权重,生成基于相似度的预测概率。这样,TFUP 在多分类数据集上实现了惊人的表现,甚至超过了基于训练的方法。根据我们的TFUP,我们提出了一个基于训练的改进方法(TFUP-T),以进一步提高适应性能。除了标准的交叉熵损失外,TFUP-T还采用了一种额外的边际分布熵损失来约束模型从全局角度出发。我们的TFUP-T在多个基准数据集上的表现与无监督和少样本调整方法相当。特别是,TFUP-T 通过在最具挑战性的Domain-Net数据集上将POUF的分类准确率提高了3.3%而超过了该方法。
https://arxiv.org/abs/2404.16339
Electronic health records (EHR) even though a boon for healthcare practitioners, are growing convoluted and longer every day. Sifting around these lengthy EHRs is taxing and becomes a cumbersome part of physician-patient interaction. Several approaches have been proposed to help alleviate this prevalent issue either via summarization or sectioning, however, only a few approaches have truly been helpful in the past. With the rise of automated methods, machine learning (ML) has shown promise in solving the task of identifying relevant sections in EHR. However, most ML methods rely on labeled data which is difficult to get in healthcare. Large language models (LLMs) on the other hand, have performed impressive feats in natural language processing (NLP), that too in a zero-shot manner, i.e. without any labeled data. To that end, we propose using LLMs to identify relevant section headers. We find that GPT-4 can effectively solve the task on both zero and few-shot settings as well as segment dramatically better than state-of-the-art methods. Additionally, we also annotate a much harder real world dataset and find that GPT-4 struggles to perform well, alluding to further research and harder benchmarks.
电子病历(EHR)虽然对医疗保健专业人员来说是一个福音,但它们每天都变得越来越复杂和冗长。在搜寻这些漫长的EHR时,这会使人疲惫不堪,成为医生和患者互动过程中的一个繁琐的部分。为了解决这个问题,提出了几种方法,包括摘要和分段,但只有少数方法真正有效。随着自动方法的兴起,机器学习(ML)在解决在EHR中识别相关节点的任务方面显示出前景。然而,大多数ML方法依赖于有标签的数据,而在医疗保健领域获得这些数据非常困难。大语言模型(LLMs)等其他方法在自然语言处理(NLP)方面也表现出色,而且在大规模数据集上表现出色,完全不需要任何有标签的数据。因此,我们提出使用LLMs来识别相关节点的建议。我们发现,GPT-4在零和少样本设置下都能够有效解决这个任务,而且性能比现有方法还要好。此外,我们还用更为困难的现实世界数据集进行了标注,发现GPT-4表现不佳,这表明需要进一步的研究和更为严苛的基准测试。
https://arxiv.org/abs/2404.16294
While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.
尽管将机器学习技术融入医学图像分析领域已经经历了一次变革性的转变,但这种技术的主要挑战通常是缺乏大型、多样化和具有良好标注的大型数据集。 医学图像在格式、大小和其他参数上有所不同,因此需要进行广泛的预处理和标准化,以便在机器学习应用程序中使用。为解决这些挑战,我们引入了医学图像元数据集(MedIMeta),这是一个新型的多领域、多任务元数据集。MedIMeta包含19个医学图像数据集,跨越10个不同的领域,涵盖54个不同的医学任务,所有这些数据集都已标准化为相同的格式,且易于在PyTorch或其他ML框架中使用。我们通过完全监督和跨域少样本学习基准对MedIMeta进行了技术验证,证明了其实用性。
https://arxiv.org/abs/2404.16000
Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.
近年来,在生成式人工智能(Generative AI)方面的进步导致了生成视觉上逼真的合成视频的技术的发展。虽然已经开发了许多方法来检测由AI生成的合成图像,但在本文中,我们证明了合成图像检测器无法检测合成视频。我们证明了这是因为合成视频生成器引入了与图像生成器留下的痕迹显著不同的迹线。尽管如此,我们证明了合成视频痕迹可以学习,并用于可靠的合成视频检测或生成器来源 attribution,即使在H.264重新压缩之后。此外,我们证明了从零散转移学习中检测新生成器生成的视频是具有挑战性的,但通过几散学习可以准确地从新生成器中检测到视频。
https://arxiv.org/abs/2404.15955
Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer.
尽管从多种视角获取的乳腺X光片信息融合对提高乳腺癌检测的准确性具有重要意义,但基于多视角乳腺X光片开发的计算机辅助诊断(CAD)方案仍然面临挑战,并且在临床实践中尚未应用到这样的CAD方案。为了克服这些挑战,我们研究了一种基于Contrastive Language-Image Pre-training(CLIP)的新方法,该方法在各种医学影像任务中引起了人们的关注。通过解决(1)有效适应单视图CLIP的多人视角特征融合和(2)通过有限样本和计算资源 efficiently微调参数密集模型,我们引入了Mammo-CLIP,这是第一个多模态框架处理乳腺多视角X光片和相应简单文本。Mammo-CLIP使用早期特征融合策略从左、右乳头的CC和MLO视角的四个乳腺X光片中学习多视角关系。为了提高学习效率,我们将自适应器添加到CLIP图像和文本编码器以微调参数和限制更新至约参数的1%。对于框架评估,我们将两个数据集按时间顺序组装起来。第一个数据集包括470个恶性和479个良性病例,用于通过5轮交叉验证对提出的Mammo-CLIP进行微调并进行内部评估。第二个数据集包括60个恶性和294个良性病例,用于测试Mammo-CLIP的泛化能力。研究结果表明,Mammo-CLIP在两个数据集上都优于最先进的交叉视角Transformer,其AUC(0.841 vs 0.817,0.837 vs 0.807)均值分别为0.817和0.837。它还比基于CLIP的前两种方法超出20.3%和14.3%。这项研究突出了将微调视觉-语言模型的应用于开发下一代,基于图像-文本乳腺癌检测方案的潜力。
https://arxiv.org/abs/2404.15946
Individual feedback can help students improve their essay writing skills. However, the manual effort required to provide such feedback limits individualization in practice. Automatically-generated essay feedback may serve as an alternative to guide students at their own pace, convenience, and desired frequency. Large language models (LLMs) have demonstrated strong performance in generating coherent and contextually relevant text. Yet, their ability to provide helpful essay feedback is unclear. This work explores several prompting strategies for LLM-based zero-shot and few-shot generation of essay feedback. Inspired by Chain-of-Thought prompting, we study how and to what extent automated essay scoring (AES) can benefit the quality of generated feedback. We evaluate both the AES performance that LLMs can achieve with prompting only and the helpfulness of the generated essay feedback. Our results suggest that tackling AES and feedback generation jointly improves AES performance. However, while our manual evaluation emphasizes the quality of the generated essay feedback, the impact of essay scoring on the generated feedback remains low ultimately.
个人反馈有助于提高学生的论文写作技能。然而,提供这样的反馈需要消耗大量的努力,因此在实践中很难实现个性化。自动生成的论文反馈可以作为指导学生自行 pace、convenience 和 desired frequency 的替代方案。大型语言模型(LLMs)已经在生成连贯且上下文相关的文本方面表现出强大的性能。然而,它们提供有帮助的论文反馈的能力仍然不清楚。本研究探讨了基于LLM的零 shot 和零 shot 生成论文反馈的几种提示策略。受到 Chain-of-Thought 提示的启发,我们研究了自动评分(AES)在生成反馈质量方面的优势和程度。我们评估了LLM仅通过提示所能达到的AES性能以及生成的论文反馈的有用性。我们的结果表明,联合处理AES和反馈生成可以提高AES性能。然而,尽管我们的手动评估强调了生成的论文反馈的质量,但论文评分对生成的反馈的影响仍然较低。
https://arxiv.org/abs/2404.15845
The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.
扩散模型的出现极大地推动了图像和视频生成的进步。最近,在可控制视频生成方面,包括文本到视频生成和视频运动控制,已经做出了一些努力。其中,相机运动控制是一个重要的话题。然而,现有的相机运动控制方法依赖于训练一个时间相机模块,由于视频生成模型的巨大参数数量,需要大量的计算资源。此外,现有的方法在训练过程中预定义了相机运动类型,这限制了他们在相机控制方面的灵活性。因此,为了降低训练成本并实现灵活的相机控制,我们提出了COMD,一种新颖的训练-free视频运动传输模型,它解耦了源视频中的相机运动和物体运动,并将提取的相机运动传输到新的视频中。我们首先提出了一种单击相机运动解耦方法,从单个源视频中提取相机运动,将移动物体与背景分离,并根据背景中的运动在运动物体区域求解泊松方程。此外,我们还提出了一种几 shot相机运动解耦方法,从具有相似相机运动的多视频中提取共同的相机运动,采用基于窗口的聚类技术提取多个视频中的共同特征。最后,我们提出了一种运动组合方法,将不同类型的相机运动结合在一起,使我们的模型具有更可控制和灵活的相机控制。大量实验证明,我们的无训练方法可以有效地将相机-物体运动与可控制视频生成任务分开,将解耦后的相机运动应用到广泛的控制视频生成任务中,实现灵活和多样化的相机运动控制。
https://arxiv.org/abs/2404.15789
Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces, but they often lack explicit coding of part-whole relations, a prominent property of medical imaging. To overcome this limitation, we introduce Adam-v2, a new self-supervised learning framework extending Adam [79] by explicitly incorporating part-whole hierarchies into its learning objectives through three key branches: (1) Localizability, acquiring discriminative representations to distinguish different anatomical patterns; (2) Composability, learning each anatomical structure in a parts-to-whole manner; and (3) Decomposability, comprehending each anatomical structure in a whole-to-parts manner. Experimental results across 10 tasks, compared to 11 baselines in zero-shot, few-shot transfer, and full fine-tuning settings, showcase Adam-v2's superior performance over large-scale medical models and existing SSL methods across diverse downstream tasks. The higher generality and robustness of Adam-v2's representations originate from its explicit construction of hierarchies for distinct anatomical structures from unlabeled medical images. Adam-v2 preserves a semantic balance of anatomical diversity and harmony in its embedding, yielding representations that are both generic and semantically meaningful, yet overlooked in existing SSL methods. All code and pretrained models are available at this https URL.
人类轻松地通过解析图像将其分解成部分-整体层次结构;深度学习在多级特征空间中表现出色,但他们通常缺乏对部分-整体关系的明确编码,这是医学成像的一个突出特点。为了克服这个局限性,我们引入了Adam-v2,一种在Adam [79]的基础上引入部分-整体层次结构的自监督学习框架。通过三个关键分支:(1)局部可解释性,获得区分不同解剖模式的可鉴别表示;(2)可组合性,以部分-整体的方式学习每个解剖结构;(3)可分解性,以整体-部分的方式理解每个解剖结构。在10个任务中的实验结果与11个基线在零散转移、少散转移和完整微调设置中的表现进行了比较,结果表明Adam-v2在大型医疗模型和现有SSL方法方面表现出色。Adam-v2表示的语义平衡和稳健性源于其对不同解剖结构从无标签医学图像中显式构建层次结构。Adam-v2保留了解剖多样性与和谐的语言平衡,使其嵌入具有既通用又具有语义意义的表示,然而在现有SSL方法中被忽视。所有代码和预训练模型都可以在https://这个链接中找到。
https://arxiv.org/abs/2404.15672
Systematic review (SR) is a popular research method in software engineering (SE). However, conducting an SR takes an average of 67 weeks. Thus, automating any step of the SR process could reduce the effort associated with SRs. Our objective is to investigate if Large Language Models (LLMs) can accelerate title-abstract screening by simplifying abstracts for human screeners, and automating title-abstract screening. We performed an experiment where humans screened titles and abstracts for 20 papers with both original and simplified abstracts from a prior SR. The experiment with human screeners was reproduced with GPT-3.5 and GPT-4 LLMs to perform the same screening tasks. We also studied if different prompting techniques (Zero-shot (ZS), One-shot (OS), Few-shot (FS), and Few-shot with Chain-of-Thought (FS-CoT)) improve the screening performance of LLMs. Lastly, we studied if redesigning the prompt used in the LLM reproduction of screening leads to improved performance. Text simplification did not increase the screeners' screening performance, but reduced the time used in screening. Screeners' scientific literacy skills and researcher status predict screening performance. Some LLM and prompt combinations perform as well as human screeners in the screening tasks. Our results indicate that the GPT-4 LLM is better than its predecessor, GPT-3.5. Additionally, Few-shot and One-shot prompting outperforms Zero-shot prompting. Using LLMs for text simplification in the screening process does not significantly improve human performance. Using LLMs to automate title-abstract screening seems promising, but current LLMs are not significantly more accurate than human screeners. To recommend the use of LLMs in the screening process of SRs, more research is needed. We recommend future SR studies publish replication packages with screening data to enable more conclusive experimenting with LLM screening.
系统综述法(SR)是软件工程领域(SE)中的一种流行研究方法。然而,进行SR平均需要67周的时间。因此,自动化SR过程中任何步骤都可能减少与SR相关的努力。我们的目标是调查大型语言模型(LLMs)是否可以通过简化摘要,从而加速标题摘要筛选,并自动化标题摘要筛选。我们进行了一项实验,其中人类对20篇具有原始和简化摘要的论文进行了筛选。使用人类筛选者和基于GPT-3.5和GPT-4的LLM进行了相同筛选任务。我们还研究了不同的提示技术(零击(ZS)、一次击(OS)、少量击(FS)和少量击与思考(FS-CoT))是否改善LLM的筛选性能。最后,我们研究了在LLM复制筛选提示的使用是否会导致性能提升。虽然文本简化没有提高筛选者的性能,但减少了筛选所需的时间。筛选者的科学素养和研究者身份预测了筛选绩效。一些LLM和提示组合在筛选任务中表现与人类筛选者相当。我们的结果表明,GPT-4 LLM比其前任GPT-3.5更好。此外,少量击和一次击提示优于零击提示。在筛选过程中使用LLM进行文本简化并没有显著提高人类性能。使用LLM自动进行标题摘要筛选看起来很有前途,但目前的LLM并没有比人类筛选者更准确。为了推荐在SR筛选过程中使用LLM,还需要进行更多的研究。我们建议,未来的SR研究者在SR研究中发布带有筛选数据的复制包,以促进更确凿的尝试使用LLM进行筛选。
https://arxiv.org/abs/2404.15667
Recent advancements in instruction-following models have made user interactions with models more user-friendly and efficient, broadening their applicability. In graphic design, non-professional users often struggle to create visually appealing layouts due to limited skills and resources. In this work, we introduce a novel multimodal instruction-following framework for layout planning, allowing users to easily arrange visual elements into tailored layouts by specifying canvas size and design purpose, such as for book covers, posters, brochures, or menus. We developed three layout reasoning tasks to train the model in understanding and executing layout instructions. Experiments on two benchmarks show that our method not only simplifies the design process for non-professionals but also surpasses the performance of few-shot GPT-4V models, with mIoU higher by 12% on Crello. This progress highlights the potential of multimodal instruction-following models to automate and simplify the design process, providing an approachable solution for a wide range of design tasks on visually-rich documents.
近年来,指令跟随模型的进步使得用户与模型之间的交互更加友好和高效,拓宽了其应用范围。在图形设计中,非专业用户通常由于技能和资源有限,难以创建视觉上吸引人的布局。在这项工作中,我们引入了一个新颖的多模态指令跟随布局规划框架,允许用户通过指定画布大小和设计目的,轻松地将视觉元素排版到定制布局中,如书籍封面、海报、宣传册或菜单。我们开发了三个布局推理任务来训练模型理解并执行布局指令。在两个基准测试上的实验证明,我们的方法不仅简化了非专业用户的设计流程,而且超越了少样本GPT-4V模型的性能,在Crello上的mIoU值较高。这一进步突出了多模态指令跟随模型的潜力,可以自动化和简化设计过程,为各种视觉丰富的文档提供了一种易于设计的解决方案。
https://arxiv.org/abs/2404.15271
Natural language generation tools are powerful and effective for generating content. However, language models are known to display bias and fairness issues, making them impractical to deploy for many use cases. We here focus on how fairness issues impact automatically generated test content, which can have stringent requirements to ensure the test measures only what it was intended to measure. Specifically, we identify test content that is focused on particular domains and experiences that only reflect a certain demographic or that are potentially emotionally upsetting; both of which could inadvertently impact a test-taker's score. This kind of content doesn't reflect typical biases out of context, making it challenging even for modern models that contain safeguards. We build a dataset of 621 generated texts annotated for fairness and explore a variety of methods for classification: fine-tuning, topic-based classification, and prompting, including few-shot and self-correcting prompts. We find that combining prompt self-correction and few-shot learning performs best, yielding an F1 score of .791 on our held-out test set, while much smaller BERT- and topic-based models have competitive performance on out-of-domain data.
自然语言生成工具对于生成内容非常强大和有效。然而,语言模型已经被证明存在偏见和不公平问题,这使得它们在许多用例上部署不实用。在这里,我们关注公平性问题如何影响自动生成的测试内容,这些内容可能对测试者得分产生严格的要求,以确保测试只衡量了它本应测量的内容。具体来说,我们识别出关注特定领域和经验的测试内容,这可能只反映了某些人口统计学或可能引起情感不安的内容;这两者都可能无意中影响测试者的得分。这类内容不反映上下文的典型偏见,这使得现代模型(包含安全措施)更难以处理。我们建立了一个为公平性 annotated的621个生成的文本的数据集,并探讨了分类的方法:微调、基于主题的分类和提示,包括少样本和自纠正提示。我们发现,结合自纠正提示和少样本学习效果最好,在 hold-out 测试集上的 F1 分数为.791,而BERT 和基于主题的模型在离域数据上的竞争性能较小。
https://arxiv.org/abs/2404.15104