AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
AI生成的视频已经推动了短视频制作、电影制作和个性化媒体的发展,使视频本地编辑成为必不可少的工具。然而,这一进步也模糊了现实与虚构之间的界线,对多媒体forensics造成了挑战。为解决这一紧迫问题,V2A-Mark提出了通过解决当前视频篡改forensics的局限性来 addressing the limitations of current video tampering forensics,例如缺乏一般性、单一功能和单模态关注。将视频转录为视频的隐式视觉-音频本地化水印和版权水印相结合,我们的方法可以将隐形的视觉-音频本地化水印和版权水印嵌入原始视频帧和音频中,实现精确的本地处理和版权保护。我们还设计了一个时间对齐和融合模块以及退化提示学习来提高定位准确性和解码稳健性。同时,我们引入了音频级联定位方法和跨模态版权提取机制,将音频和视频帧的信息耦合在一起。V2A-Mark的有效性已在视觉-音频篡改数据集上得到验证,强调了其在本地定位精度和版权准确性方面的优越性,这对AIGC视频时代的视频编辑的可持续发展至关重要。
https://arxiv.org/abs/2404.16824
Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
Modeling non-stationary data is a challenging problem in the field of continual learning, and data distribution shifts may result in negative consequences on the performance of a machine learning model. Classic learning tools are often vulnerable to perturbations of the input covariates, and are sensitive to outliers and noise, and some tools are based on rigid algebraic assumptions. Distribution shifts are frequently occurring due to changes in raw materials for production, seasonality, a different user base, or even adversarial attacks. Therefore, there is a need for more effective distribution shift detection techniques. In this work, we propose a continual learning framework for monitoring and detecting distribution changes. We explore the problem in a latent space generated by a bio-inspired self-organizing clustering and statistical aspects of the latent space. In particular, we investigate the projections made by two topology-preserving maps: the Self-Organizing Map and the Scale Invariant Map. Our method can be applied in both a supervised and an unsupervised context. We construct the assessment of changes in the data distribution as a comparison of Gaussian signals, making the proposed method fast and robust. We compare it to other unsupervised techniques, specifically Principal Component Analysis (PCA) and Kernel-PCA. Our comparison involves conducting experiments using sequences of images (based on MNIST and injected shifts with adversarial samples), chemical sensor measurements, and the environmental variable related to ozone levels. The empirical study reveals the potential of the proposed approach.
建模非平稳数据是连续学习领域的一个具有挑战性的问题,数据分布的变化可能导致机器学习模型的性能下降。经典的 learning 工具通常对输入协变量的小扰动敏感,对异常值和噪声敏感,有些工具是基于刚性的代数假设。由于生产原材料的变化、季节性、不同的用户群或甚至恶意攻击等原因,数据分布的变化经常发生。因此,有必要开发更有效的分布变化检测技术。 在这项工作中,我们提出了一个连续学习框架,用于监测和检测分布变化。我们在由生物启发的自组织聚类生成的潜在空间中研究这个问题。特别是,我们研究了两个保持拓扑不变的映射的投影:自组织映射和收缩不变映射。我们的方法可以在有监督和无监督两种情况下应用。我们对数据分布的变化进行评估,通过比较高斯信号,使所提出的方法快速且具有鲁棒性。我们将其与其它无监督技术(特别是主成分分析(PCA)和核聚类)进行比较。我们的比较包括使用图像序列(基于 MNIST 数据集并注入对抗样本)、化学传感器测量和与臭氧水平相关的环境变量进行的实验。实证研究揭示了所提出方法的优势。
https://arxiv.org/abs/2404.16656
Beyond improving trust and validating model fairness, xAI practices also have the potential to recover valuable scientific insights in application domains where little to no prior human intuition exists. To that end, we propose a method to extract global concept explanations from the predictions of graph neural networks to develop a deeper understanding of the tasks underlying structure-property relationships. We identify concept explanations as dense clusters in the self-explaining Megan models subgraph latent space. For each concept, we optimize a representative prototype graph and optionally use GPT-4 to provide hypotheses about why each structure has a certain effect on the prediction. We conduct computational experiments on synthetic and real-world graph property prediction tasks. For the synthetic tasks we find that our method correctly reproduces the structural rules by which they were created. For real-world molecular property regression and classification tasks, we find that our method rediscovers established rules of thumb. More specifically, our results for molecular mutagenicity prediction indicate more fine-grained resolution of structural details than existing explainability methods, consistent with previous results from chemistry literature. Overall, our results show promising capability to extract the underlying structure-property relationships for complex graph property prediction tasks.
除了提高信任度和验证模型的公平性外,基于AI的研究还有可能在缺乏先前人类直觉的应用领域中恢复有价值的科学见解。为此,我们提出了一种从图神经网络的预测中提取全局概念解释的方法,以更深入地理解支撑任务结构与属性之间关系的任务结构。我们将概念解释确定为自解释Megan模型的子图潜在空间中的密集聚类。对于每个概念,我们优化一个具有代表性的图,并可选地使用GPT-4来提供关于每个结构对预测的影响的假设。我们在合成和现实世界的图属性预测任务上进行计算实验。对于合成任务,我们发现我们的方法正确地复制了它们创建的结构规则。对于现实世界的分子属性回归和分类任务,我们发现我们的方法重新发现了已有的经验法则。具体来说,我们的分子突变预测结果表明,我们的方法比现有的解释性方法具有更细粒度的结构细节的分辨率,这与化学文献中的 previous results 相一致。总体而言,我们的结果表明,基于AI的研究具有从复杂图属性预测任务中提取底层结构与属性之间关系的有前景的能力。
https://arxiv.org/abs/2404.16532
Computational historical linguistics seeks to systematically understand processes of sound change, including during periods at which little to no formal recording of language is attested. At the same time, few computational resources exist which deeply explore phonological and morphological connections between proto-languages and their descendants. This is particularly true for the family of Italic languages. To assist historical linguists in the study of Italic sound change, we introduce the Proto-Italic to Latin (PILA) dataset, which consists of roughly 3,000 pairs of forms from Proto-Italic and Latin. We provide a detailed description of how our dataset was created and organized. Then, we exhibit PILA's value in two ways. First, we present baseline results for PILA on a pair of traditional computational historical linguistics tasks. Second, we demonstrate PILA's capability for enhancing other historical-linguistic datasets through a dataset compatibility study.
计算历史语言学旨在系统地理解声变过程,包括在语言形式正式记录几乎没有到没有的时候。同时,在深入探索proto-语言及其后代之间的音位和形态联系方面,几乎没有计算资源存在。这一点尤其对于意大利语家族来说更是如此。为了帮助历史语言学家研究意大利语的声变,我们引入了proto-意大利语到拉丁语(PILA)数据集,它包括大约3,000对proto-意大利语和拉丁语的形式。我们详细描述了我们的数据集是如何创建和组织的。接着,我们展示了PILA的价值。首先,我们展示了PILA在传统计算历史语言学任务上的基线结果。其次,我们通过一个数据兼容性研究展示了PILA通过数据集兼容性增强其他历史语言数据的能力。
https://arxiv.org/abs/2404.16341
Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation, and target recognition. However, prevailing methods often face challenges in simultaneously capturing thermal region features and detailed information due to the disparate characteristics of infrared and visible images. Consequently, fusion outcomes frequently entail a compromise between thermal target area information and texture details. In this study, we introduce a novel heterogeneous dual-discriminator generative adversarial network (HDDGAN) to address this issue. Specifically, the generator is structured as a multi-scale skip-connected structure, facilitating the extraction of essential features from different source images. To enhance the information representation ability of the fusion result, an attention mechanism is employed to construct the information fusion layer within the generator, leveraging the disparities between the source images. Moreover, recognizing the distinct learning requirements of information in infrared and visible images, we design two discriminators with differing structures. This approach aims to guide the model to learn salient information from infrared images while simultaneously capturing detailed information from visible images. Extensive experiments conducted on various public datasets demonstrate the superiority of our proposed HDDGAN over other state-of-the-art (SOTA) algorithms, highlighting its enhanced potential for practical applications.
红外和可见图像融合(IVIF)旨在保留红外图像的热辐射信息,同时整合可见图像的纹理细节,从而使捕捉复杂场景和受干扰环境中的主题重要特征和隐藏细节成为可能。因此,在实际应用中,例如视频监控、夜间导航和目标识别,IVIF具有显著的优势。然而,由于红外和可见图像的差异特征,现有的方法在同时捕捉热区域特征和详细信息方面常常面临挑战。因此,融合结果通常需要在热目标区域信息与纹理细节之间做出权衡。在这项研究中,我们引入了一种新颖的异质双判别器生成对抗网络(HDDGAN)来解决这一问题。具体来说,生成器采用多尺度跳转连接结构,促进从不同源图像中提取关键特征。为了增强融合结果的信息表示能力,采用关注机制在生成器中构建信息融合层,利用源图像之间的差异。此外,考虑到红外和可见图像之间的不同学习需求,我们设计了两部分结构不同的判别器。这种方法旨在指导模型从红外图像中学习显著信息,同时从可见图像中捕捉详细信息。在各种公开数据集上进行的大量实验证明,与最先进的(SOTA)算法相比,我们提出的HDDGAN具有卓越的实用性能,强调了其在实际应用中的潜在优势。
https://arxiv.org/abs/2404.15992
Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.
几何和外观控全身体像生成是一个有趣但具有挑战性的任务。现有的解决方案或依赖于粗略的条件(例如姿态,文本),从而缺乏对身体的几何和外观控制。绘画提供了这种编辑能力,并在各种基于草图的脸部生成和编辑解决方案中得到了应用。然而,直接将基于草图的脸部生成应用到全身生成通常由于姿态、身体形状和服装形状和纹理的高复杂性和多样性而无法产生高质量和多样性的结果。最近,基于几何可控制扩散的解决方案主要依赖于提示生成外观,而在输入粗糙时很难平衡现实感和结果的准确性。 这项工作提出了Sketch2Human,第一个基于语义草图的全身体人像生成系统(用于几何控制)和参考图像(用于外观控制)。我们的解决方案基于StyleGAN-Human的潜在空间,具有倒置的形状和外观潜在代码作为输入。具体来说,我们提出了一种用从StyleGAN-Human的潜在空间中采样的大规模合成数据训练的草图编码器,并直接从草图中监督生成图像。考虑到StyleGAN-Human中部分几何和纹理的纠缠信息以及缺乏解耦的数据集,我们设计了一种新的训练计划,以创建几何保留和外观传递的训练数据来调整生成器实现解耦几何和外观控制。尽管我们的方法是基于合成数据训练的,但它还可以处理手绘草图。定性和定量评估证明了我们的方法在现有方法中的卓越性能。
https://arxiv.org/abs/2404.15889
The Sentinel-2 (S2) mission from the European Space Agency's Copernicus program provides essential data for Earth surface analysis. Its Level-2A products deliver high-to-medium resolution (10-60 m) surface reflectance (SR) data through the MultiSpectral Instrument (MSI). To enhance the accuracy and comparability of SR data, adjustments simulating a nadir viewing perspective are essential. These corrections address the anisotropic nature of SR and the variability in sun and observation angles, ensuring consistent image comparisons over time and under different conditions. The $c$-factor method, a simple yet effective algorithm, adjusts observed S2 SR by using the MODIS BRDF model to achieve Nadir BRDF Adjusted Reflectance (NBAR). Despite the straightforward application of the $c$-factor to individual images, a cohesive Python framework for its application across multiple S2 images and Earth System Data Cubes (ESDCs) from cloud-stored data has been lacking. Here we introduce sen2nbar, a Python package crafted to convert S2 SR data to NBAR, supporting both individual images and ESDCs derived from cloud-stored data. This package simplifies the conversion of S2 SR data to NBAR via a single function, organized into modules for efficient process management. By facilitating NBAR conversion for both SAFE files and ESDCs from SpatioTemporal Asset Catalogs (STAC), sen2nbar is developed as a flexible tool that can handle diverse data format requirements. We anticipate that sen2nbar will considerably contribute to the standardization and harmonization of S2 data, offering a robust solution for a diverse range of users across various applications. sen2nbar is an open-source tool available at this https URL.
欧洲航天局 Copernicus 计划中的 Sentinel-2 (S2) 任务为地球表面分析提供了必要数据。其 Level-2A 产品通过 MultiSpectral Instrument (MSI) 提供了高至中分辨率(10-60 m)的表面反射率(SR)数据。为了提高 SR 数据的准确性和可比性,调整模拟顶视 perspective 是必要的。这些调整解决了 SR 的各向同性特性以及太阳和观察角度的变异性,确保在时间和不同条件下进行一致图像比较。$c$-factor 方法,一种简单而有效的算法,通过使用 MODIS BRDF 模型对观察到的 S2 SR 进行调整,以实现顶视 BRDF 调整反射率(NBAR)。尽管 $c$-factor 的应用非常简单,但在单个图像上进行调整仍然存在局限性。为了在多个 S2 图像和地球系统数据立方 (ESDC) 上应用 $c$-factor,我们缺乏一个统一 Python 框架。在这里,我们介绍 sen2nbar,一个 Python 包,专门用于将 S2 SR 数据转换为 NBAR,支持单张图像和来自云存储数据的 ESDC。这个软件包通过一个模块化的方式组织单函数转换 S2 SR 数据到 NBAR,简化了转换过程。通过促进 SAFE 文件和 ESDC 从空间时间资产目录(STAC)中的转换,sen2nbar 被开发成为一个灵活的工具,可以处理各种数据格式要求。我们预计,sen2nbar 将极大地促进 S2 数据的标准化和统一,为各种应用提供稳健的解决方案。sen2nbar是一个开源工具,您可以在这个链接处获得:https://www.esa.int/web/api/仁/sentinel2/send2nbar/
https://arxiv.org/abs/2404.15812
Autonomous vehicles (AVs) heavily rely on LiDAR perception for environment understanding and navigation. LiDAR intensity provides valuable information about the reflected laser signals and plays a crucial role in enhancing the perception capabilities of AVs. However, accurately simulating LiDAR intensity remains a challenge due to the unavailability of material properties of the objects in the environment, and complex interactions between the laser beam and the environment. The proposed method aims to improve the accuracy of intensity simulation by incorporating physics-based modalities within the deep learning framework. One of the key entities that captures the interaction between the laser beam and the objects is the angle of incidence. In this work we demonstrate that the addition of the LiDAR incidence angle as a separate input to the deep neural networks significantly enhances the results. We present a comparative study between two prominent deep learning architectures: U-NET a Convolutional Neural Network (CNN), and Pix2Pix a Generative Adversarial Network (GAN). We implemented these two architectures for the intensity prediction task and used SemanticKITTI and VoxelScape datasets for experiments. The comparative analysis reveals that both architectures benefit from the incidence angle as an additional input. Moreover, the Pix2Pix architecture outperforms U-NET, especially when the incidence angle is incorporated.
自动驾驶车辆(AVs)对环境理解和导航重度依赖激光雷达感知。激光雷达强度提供了关于反射激光信号的有价值的信息,并在增强AV的感知能力中发挥了关键作用。然而,准确模拟激光雷达强度仍然是一个挑战,由于环境中物体的材料性质不可用,以及激光束与环境的复杂相互作用。所提出的方法旨在通过在深度学习框架中引入基于物理的模态来提高强度模拟的准确性。一个捕捉激光束与物体之间互动的关键实体是入射角。在本文中,我们证明了将激光雷达入射角作为额外的输入到深度神经网络可以显著增强结果。我们比较了两个著名的深度学习架构:U-NET和Pix2Pix。我们将这两个架构用于强度预测任务,并使用SemanticKITTI和VoxelScape数据集进行实验。比较分析揭示了,这两个架构都从入射角作为额外的输入受益。此外,Pix2Pix架构在纳入入射角时优于U-NET。
https://arxiv.org/abs/2404.15774
This paper handles the problem of converting real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though this problem could be realized by a wide range of image-to-image translation models, a notable issue with all these methods is that the original image content details could be easily erased or corrupted due to transfer of ink-wash style elements. To solve or ameliorate this issue, we propose to incorporate saliency detection into the unpaired image-to-image translation framework to regularize content information of the generated paintings. The saliency map is utilized for content regularization from two aspects, both explicitly and implicitly: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize saliency consistency before and after stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances content integrity of the generated paintings by injecting saliency information to the generator network to guide painting generation. Besides, we also propose saliency attended discriminator network which harnesses saliency mask to focus generative adversarial attention onto salient image regions, it contributes to producing finer ink-wash stylization effect for salient objects of images. Qualitative and quantitative experiments consistently demonstrate superiority of our model over related advanced methods for Chinese ink-wash painting style transfer.
本文处理将真实图像转换成传统中国水墨画的问题,即水墨画风格的图像-to-图像转换问题。尽管这个问题可以通过广泛的图像到图像转换模型来解决,但所有这些方法的一个显着的问题是,由于墨水风格元素的转移,原始图像内容细节很容易被轻松地擦除或损坏。为解决这个问题或减轻这个问题,我们提出将局部可用性检测引入无配对图像到图像转换框架中,以规范化生成的绘画的内容信息。局部可用性图用于从两个方面进行内容规范化:(罗马数字1)我们提出局部可用性IOU(SIOU)损失,以明确规范化和 stylization 之前的和之后的 saliency 一致性;(罗马数字2)我们提出局部可用性自适应规范化(SANorm),它通过向生成网络注入 saliency 信息来增强生成的绘画的内容完整性,从而指导绘画生成。此外,我们还提出了 saliency 注意的判别器网络,它利用 saliency 掩码将生成对抗注意力聚焦于显眼的图像区域,有助于产生更清晰的墨水画风格效果。定性和定量的实验结果都一致地证明了我们的模型在相关先进方法中的优越性。
https://arxiv.org/abs/2404.15743
Deepfakes, synthetic images generated by deep learning algorithms, represent one of the biggest challenges in the field of Digital Forensics. The scientific community is working to develop approaches that can discriminate the origin of digital images (real or AI-generated). However, these methodologies face the challenge of generalization, that is, the ability to discern the nature of an image even if it is generated by an architecture not seen during training. This usually leads to a drop in performance. In this context, we propose a novel approach based on three blocks called Base Models, each of which is responsible for extracting the discriminative features of a specific image class (Diffusion Model-generated, GAN-generated, or real) as it is trained by exploiting deliberately unbalanced datasets. The features extracted from each block are then concatenated and processed to discriminate the origin of the input image. Experimental results showed that this approach not only demonstrates good robust capabilities to JPEG compression but also outperforms state-of-the-art methods in several generalization tests. Code, models and dataset are available at this https URL.
Deepfakes,由深度学习算法生成的合成图像,是数字取证领域的一个巨大挑战。科学界正在努力开发方法来区分数字图像(真实或由AI生成)的来源。然而,这些方法面临着泛化能力的挑战,即,即使是在训练过程中没有见过的架构中生成图像,也能辨别出图像的性质。通常会导致性能下降。在这种情况下,我们提出了一个基于三个模块的新颖方法,称为基本模型,每个模块负责从训练过程中提取特定图像类别的区分性特征(扩散模型生成,GAN生成或真实)并利用故意不平衡的数据集。从每个模块提取的特征然后进行连接和处理,以区分输入图像的来源。实验结果表明,这种方法不仅表现出对JPEG压缩的良好容错性,而且在几个通用测试中超过了最先进的方法。代码,模型和数据集都可以在上述链接中找到。
https://arxiv.org/abs/2404.15697
Recent advances in text-to-image models have opened new frontiers in human-centric generation. However, these models cannot be directly employed to generate images with consistent newly coined identities. In this work, we propose CharacterFactory, a framework that allows sampling new characters with consistent identities in the latent space of GANs for diffusion models. More specifically, we consider the word embeddings of celeb names as ground truths for the identity-consistent generation task and train a GAN model to learn the mapping from a latent space to the celeb embedding space. In addition, we design a context-consistent loss to ensure that the generated identity embeddings can produce identity-consistent images in various contexts. Remarkably, the whole model only takes 10 minutes for training, and can sample infinite characters end-to-end during inference. Extensive experiments demonstrate excellent performance of the proposed CharacterFactory on character creation in terms of identity consistency and editability. Furthermore, the generated characters can be seamlessly combined with the off-the-shelf image/video/3D diffusion models. We believe that the proposed CharacterFactory is an important step for identity-consistent character generation. Project page is available at: this https URL.
近年来在文本到图像模型的进步为人类中心化生成开辟了新领域。然而,这些模型无法直接应用于具有新颖独特身份的图像生成。在本文中,我们提出了CharacterFactory,一种允许在GAN的潜在空间中具有一致身份的采样新字符的框架。具体来说,我们将名人名字的词向量视为身份一致性生成任务的地面真值,并训练了一个GAN模型,从潜在空间到名人嵌入空间进行映射。此外,我们设计了一个上下文一致损失,以确保生成的身份嵌入可以在各种上下文中产生身份一致的图像。值得注意的是,整个模型仅在训练时需要10分钟,并且在推理过程中可以连续采样无限个字符。大量实验证明,与身份一致性相比,CharacterFactory在角色创建方面的表现非常出色。此外,生成的角色可以与标准的图像/视频/3D扩散模型无缝结合。我们相信,所提出的CharacterFactory是对身份一致性角色生成的重要一步。项目页面可在此处访问:https://this URL。
https://arxiv.org/abs/2404.15677
In mesh simplification, common requirements like accuracy, triangle quality, and feature alignment are often considered as a trade-off. Existing algorithms concentrate on just one or a few specific aspects of these requirements. For example, the well-known Quadric Error Metrics (QEM) approach prioritizes accuracy and can preserve strong feature lines/points as well but falls short in ensuring high triangle quality and may degrade weak features that are not as distinctive as strong ones. In this paper, we propose a smooth functional that simultaneously considers all of these requirements. The functional comprises a normal anisotropy term and a Centroidal Voronoi Tessellation (CVT) energy term, with the variables being a set of movable points lying on the surface. The former inherits the spirit of QEM but operates in a continuous setting, while the latter encourages even point distribution, allowing various surface metrics. We further introduce a decaying weight to automatically balance the two terms. We selected 100 CAD models from the ABC dataset, along with 21 organic models, to compare the existing mesh simplification algorithms with ours. Experimental results reveal an important observation: the introduction of a decaying weight effectively reduces the conflict between the two terms and enables the alignment of weak features. This distinctive feature sets our approach apart from most existing mesh simplification methods and demonstrates significant potential in shape understanding.
在网格简化中,常见的精度、三角形质量以及特征对齐等要求通常被视为一个权衡。现有的算法集中于仅仅关注这些要求的一个或几个特定方面。例如,著名的四元误差度量(QEM)方法优先考虑精度,可以保留强烈的特征线/点,但高三角形质量的保证程度不高,甚至可能削弱那些与强特征不同的弱特征。在本文中,我们提出了一个平滑的函数,同时考虑所有这些要求。该函数包括一个正则化离散度项和一个中心势能项,变量是一个在表面上的可移动点集。前一个项继承了QEM的精神,但在连续设置中操作,而后者鼓励甚至点分布,允许各种表面度量。我们进一步引入了一个衰减权重,以自动平衡这两个项。我们选择了ABC数据集中的100个CAD模型和21个有机模型,与我们的网格简化算法进行比较。实验结果表明,引入衰减权重有效地减少了两个项之间的冲突,并使弱特征对齐。这一独特的特征使我们的方法与大多数现有的网格简化方法区别开来,并展示了在形状理解方面的显著潜力。
https://arxiv.org/abs/2404.15661
The Vision Transformer (ViT) has demonstrated remarkable performance in Self-Supervised Learning (SSL) for 3D medical image analysis. Mask AutoEncoder (MAE) for feature pre-training can further unleash the potential of ViT on various medical vision tasks. However, due to large spatial sizes with much higher dimensions of 3D medical images, the lack of hierarchical design for MAE may hinder the performance of downstream tasks. In this paper, we propose a novel \textit{Mask in Mask (MiM)} pre-training framework for 3D medical images, which aims to advance MAE by learning discriminative representation from hierarchical visual tokens across varying scales. We introduce multiple levels of granularity for masked inputs from the volume, which are then reconstructed simultaneously ranging at both fine and coarse levels. Additionally, a cross-level alignment mechanism is applied to adjacent level volumes to enforce anatomical similarity hierarchically. Furthermore, we adopt a hybrid backbone to enhance the hierarchical representation learning efficiently during the pre-training. MiM was pre-trained on a large scale of available 3D volumetric images, \textit{i.e.,} Computed Tomography (CT) images containing various body parts. Extensive experiments on thirteen public datasets demonstrate the superiority of MiM over other SSL methods in organ/lesion/tumor segmentation and disease classification. We further scale up the MiM to large pre-training datasets with more than 10k volumes, showing that large-scale pre-training can further enhance the performance of downstream tasks. The improvement also concluded that the research community should pay more attention to the scale of the pre-training dataset towards the healthcare foundation model for 3D medical images.
Vision Transformer (ViT) 在自监督学习 (SSL) 中对 3D 医疗图像分析的性能已经取得了显著的突破。对于特征预训练,采用掩码自动编码器(MAE)进行预训练可能进一步释放 ViT 在各种医疗视觉任务上的潜力。然而,由于 3D 医疗图像具有较大的空间尺寸和高维,MAE 可能缺乏层次结构设计,这可能会阻碍下游任务的性能。在本文中,我们提出了一个名为“掩码在掩码(MiM)”预训练框架,旨在通过从不同尺度之间的层次视觉令牌中学习具有判别性的表示来提高 MAE 的性能。我们引入了多个级别的粒度来处理掩码输入的体积,然后同时在不同精度和粗略水平上进行重建。此外,在相邻级别卷积中应用跨层对齐机制,以确保解剖结构相似性层次结构的强制。此外,我们还采用了一种混合骨干网络来提高预训练期间层次表示学习的高效性。MiM 在包括全身各个部位的较大范围内预先训练,这些预训练数据集包含各种器官/病灶/肿瘤。在十三个公开数据集上的广泛实验证明,MiM 在其他 SSL 方法在器官/病灶/肿瘤分割和疾病分类方面具有优越性。我们进一步将 MiM 扩展到具有超过 10k 个卷积的预训练大样本数据集,表明大规模预训练可以进一步增强下游任务的性能。这种改进还得出结论,研究社区应更加关注 healthcare foundation model 用于 3D 医疗图像的预训练数据集的大小。
https://arxiv.org/abs/2404.15580
General purpose Large Language Models (LLM) such as the Generative Pretrained Transformer (GPT) and Large Language Model Meta AI (LLaMA) have attracted much attention in recent years. There is strong evidence that these models can perform remarkably well in various natural language processing tasks. However, how to leverage them to approach domain-specific use cases and drive value remains an open question. In this work, we focus on a specific use case, pharmaceutical manufacturing investigations, and propose that leveraging historical records of manufacturing incidents and deviations in an organization can be beneficial for addressing and closing new cases, or de-risking new manufacturing campaigns. Using a small but diverse dataset of real manufacturing deviations selected from different product lines, we evaluate and quantify the power of three general purpose LLMs (GPT-3.5, GPT-4, and Claude-2) in performing tasks related to the above goal. In particular, (1) the ability of LLMs in automating the process of extracting specific information such as root cause of a case from unstructured data, as well as (2) the possibility of identifying similar or related deviations by performing semantic search on the database of historical records are examined. While our results point to the high accuracy of GPT-4 and Claude-2 in the information extraction task, we discuss cases of complex interplay between the apparent reasoning and hallucination behavior of LLMs as a risk factor. Furthermore, we show that semantic search on vector embedding of deviation descriptions can be used to identify similar records, such as those with a similar type of defect, with a high level of accuracy. We discuss further improvements to enhance the accuracy of similar record identification.
近年来,通用大型语言模型(LLM)如生成预训练Transformer(GPT)和大语言模型元AI(LLaMA)引起了广泛关注。这些模型在各种自然语言处理任务中的表现确实非常出色。然而,如何将它们应用于领域特定应用场景并实现价值仍然是一个未解之谜。在这项工作中,我们关注一个具体的应用场景,即药品制造业调查,并提出利用组织历史记录的制造事件和偏差有益于解决和关闭新案件,或降低新生产活动的风险。使用来自不同产品线的真实制造偏差的小而多样的数据集,我们评估并量化三种通用LLM(GPT-3.5,GPT-4和Claude-2)在执行与上述目标相关的任务的功率。 特别是,我们检查了LLM在提取特定信息,如案件根原因,以及通过数据库执行语义搜索来识别类似或相关偏差的可能性。虽然我们的结果表明GPT-4和Claude-2在信息提取任务中的高准确性,但讨论了LLM似乎推理和幻觉行为的复杂相互作用作为风险因素。此外,我们还证明了通过向偏差描述的向量嵌入进行语义搜索可以用来识别具有相似类型的缺陷的类似记录,具有很高的准确性。我们进一步讨论了提高类似记录识别准确性的改进措施。
https://arxiv.org/abs/2404.15578
Phishing, a prevalent cybercrime tactic for decades, remains a significant threat in today's digital world. By leveraging clever social engineering elements and modern technology, cybercrime targets many individuals, businesses, and organizations to exploit trust and security. These cyber-attackers are often disguised in many trustworthy forms to appear as legitimate sources. By cleverly using psychological elements like urgency, fear, social proof, and other manipulative strategies, phishers can lure individuals into revealing sensitive and personalized information. Building on this pervasive issue within modern technology, this paper aims to analyze the effectiveness of 15 Large Language Models (LLMs) in detecting phishing attempts, specifically focusing on a randomized set of "419 Scam" emails. The objective is to determine which LLMs can accurately detect phishing emails by analyzing a text file containing email metadata based on predefined criteria. The experiment concluded that the following models, ChatGPT 3.5, GPT-3.5-Turbo-Instruct, and ChatGPT, were the most effective in detecting phishing emails.
网络钓鱼,这种了几十年来的普遍网络犯罪手段,在当今数字世界中仍然是一个重要的威胁。通过利用聪明的社交工程要素和现代技术,网络犯罪目标众多个人、企业和组织,以利用信任和安全性。这些网络攻击者通常以多种可信形式伪装自己,伪装成合法来源。通过巧妙地使用心理要素,如紧迫感、恐惧、社交证据等操纵策略,网络钓鱼者可以将个人引入透露敏感和个人信息。在现代技术的普遍问题基础上,本文旨在分析15个大型语言模型(LLMs)在检测网络钓鱼尝试方面的有效性,特别关注一个预定义的“419诈骗”电子邮件随机集。目标是要确定哪些LLM可以准确检测到网络钓鱼电子邮件,通过分析包含邮件元数据的文本文件来确定预定义标准。实验结果表明,ChatGPT 3.5、GPT-3.5-Turbo-Instruct和ChatGPT是最有效的检测网络钓鱼电子邮件的模型。
https://arxiv.org/abs/2404.15485
Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection. However, the currently existing models generally focus on the Encoder-side Transformer to extract features, from which decoder improvement can bring further potential with well-designed architecture. We propose CFPFormer, a novel decoder block that integrates feature pyramids and transformers. Specifically, by leveraging patch embedding, cross-layer feature concatenation, and Gaussian attention mechanisms, CFPFormer enhances feature extraction capabilities while promoting generalization across diverse tasks. Benefiting from Transformer structure and U-shaped Connections, our introduced model gains the ability to capture long-range dependencies and effectively up-sample feature maps. Our model achieves superior performance in detecting small objects compared to existing methods. We evaluate CFPFormer on medical image segmentation datasets and object detection benchmarks (VOC 2007, VOC2012, MS-COCO), demonstrating its effectiveness and versatility. On the ACDC Post-2017-MICCAI-Challenge online test set, our model reaches exceptionally impressive accuracy, and performed well compared with the original decoder setting in Synapse multi-organ segmentation dataset.
特点金字塔已经在卷积神经网络(CNN)和Transformer中广泛应用于医学图像分割和目标检测任务。然而,目前现有的模型通常只关注编码器侧Transformer提取特征,从而使得解码器的改进具有更大的潜力,并设计出好的架构会进一步提高潜力。我们提出了CFPFormer,一种新型的解码器模块,集成了特征金字塔和Transformer。具体来说,通过利用补丁嵌入、跨层特征连接和Gaussian注意力机制,CFPFormer增强了特征提取能力,并促进了不同任务上的泛化。得益于Transformer结构和U形连接,我们引入的模型具有捕捉长距离依赖关系的能力,并且能够有效上采样特征图。与现有方法相比,我们的模型在检测小物体方面表现出优越的性能。我们在医学图像分割数据集和目标检测基准(VOC 2007,VOC2012,MS-COCO)上评估CFPFormer,证明了其有效性和多样性。在ACDC后2017-MICCAI挑战在线测试集中,我们的模型达到令人惊讶的准确度,并且在Synapse多器官分割数据集上的解码器设置中表现良好。
https://arxiv.org/abs/2404.15451
Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse negative samples. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model's superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.
医疗视觉-语言预训练(Med-VLP)建立了视觉内容从医学图像和相关的文本描述之间的联系。现有的Med-VLP方法主要集中在描述单个身体部位的2D图像,特别是胸部X光片。在本文中,我们将Med-VLP的视野扩展到包括3D图像,特别是全身情景,通过使用包含CT图像和报告的多模态数据集。与2D版本相比,3D VLP需要有效地从显著稀疏的3D成像表示中捕捉关键语义信息。本文我们引入了CT-GLIP(基于CT的 grounded 语言-图像预训练),一种新颖的方法,用于构建器官级别的图像-文本对以增强多模态对比学习,将 grounded visual features 与精确的诊断文本对齐。此外,我们还开发了一个异常情况词典,以增加对比学习中的多样负样本。我们的方法,在包括17,702名患者跨越104个器官的44,011个器官级别视觉-文本对的多模态CT数据集上进行训练,能够以零散的方式识别器官和异常情况。CT-GLIP的性能在一个包括1,130名患者的独立测试集上进行了验证,重点关注7个器官中最常见的异常情况。实验结果表明,在我们的模型在零散和微调场景下超过了标准CLIP框架,使用了CNN和ViT架构。
https://arxiv.org/abs/2404.15272
The traditional Transformer model encounters challenges with variable-length input sequences, particularly in Hyperspectral Image Classification (HSIC), leading to efficiency and scalability concerns. To overcome this, we propose a pyramid-based hierarchical transformer (PyFormer). This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels, thereby enhancing processing efficiency for lengthy sequences. At each level, a dedicated transformer module is applied, effectively capturing both local and global context. Spatial and spectral information flow within the hierarchy facilitates communication and abstraction propagation. Integration of outputs from different levels culminates in the final input representation. Experimental results underscore the superiority of the proposed method over traditional approaches. Additionally, the incorporation of disjoint samples augments robustness and reliability, thereby highlighting the potential of our approach in advancing HSIC. The source code is available at this https URL.
传统的Transformer模型在变长输入序列方面遇到了挑战,特别是在 Hyperspectral Image Classification(HSIC)中,导致效率和可扩展性方面的担忧。为了克服这个问题,我们提出了一个金字塔基于的层次Transformer(PyFormer)。这种创新方法将输入数据组织成层次结构,每个层次表示不同的抽象级别,从而提高处理长序列的效率。在每一层,都应用了一个专用的Transformer模块,有效地捕捉了局部和全局上下文。层次结构中的空间和频谱信息流动促进了沟通和抽象传播。不同层次输出的集成导致了最终的输入表示。实验结果证实了与传统方法相比,所提出的方法具有优越性。此外,结合离散样本的增强增强了鲁棒性和可靠性,从而突出了我们在推进 HSIC 方面的潜在能力。源代码可在此处访问:https://www.huaweicloud.com/cloud/models/pyformer/
https://arxiv.org/abs/2404.14945
Anomaly detection in real-world scenarios poses challenges due to dynamic and often unknown anomaly distributions, requiring robust methods that operate under an open-world assumption. This challenge is exacerbated in practical settings, where models are employed by private organizations, precluding data sharing due to privacy and competitive concerns. Despite potential benefits, the sharing of anomaly information across organizations is restricted. This paper addresses the question of enhancing outlier detection within individual organizations without compromising data confidentiality. We propose a novel method leveraging representation learning and federated learning techniques to improve the detection of unknown anomalies. Specifically, our approach utilizes latent representations obtained from client-owned autoencoders to refine the decision boundary of inliers. Notably, only model parameters are shared between organizations, preserving data privacy. The efficacy of our proposed method is evaluated on two standard financial tabular datasets and an image dataset for anomaly detection in a distributed setting. The results demonstrate a strong improvement in the classification of unknown outliers during the inference phase for each organization's model.
在现实场景中,异常检测 poses 挑战,因为动态且往往不确定的异常分布,需要操作在开放世界假设上的稳健方法。在实际场景中,私人组织使用模型,这使得数据无法共享,因为隐私和竞争担忧。尽管存在潜在好处,但组织之间共享异常信息受到限制。本文回答了一个问题:在保留数据机密性的前提下,如何提高组织内部的个人异常检测。我们提出了一种利用客户端自定义的自动编码器的隐式表示来改善未知的异常检测的新方法。具体来说,我们的方法利用客户端自定义的自动
https://arxiv.org/abs/2404.14933