Diffusion-based audio and music generation models commonly generate music by constructing an image representation of audio (e.g., a mel-spectrogram) and then converting it to audio using a phase reconstruction model or vocoder. Typical vocoders, however, produce monophonic audio at lower resolutions (e.g., 16-24 kHz), which limits their effectiveness. We propose MusicHiFi -- an efficient high-fidelity stereophonic vocoder. Our method employs a cascade of three generative adversarial networks (GANs) that convert low-resolution mel-spectrograms to audio, upsamples to high-resolution audio via bandwidth expansion, and upmixes to stereophonic audio. Compared to previous work, we propose 1) a unified GAN-based generator and discriminator architecture and training procedure for each stage of our cascade, 2) a new fast, near downsampling-compatible bandwidth extension module, and 3) a new fast downmix-compatible mono-to-stereo upmixer that ensures the preservation of monophonic content in the output. We evaluate our approach using both objective and subjective listening tests and find our approach yields comparable or better audio quality, better spatialization control, and significantly faster inference speed compared to past work. Sound examples are at this https URL.
扩散基于音频和音乐的生成模型通常通过构建音频的图像表示(例如,一个频谱图)并使用相位重建模型或 vocoder 将它转换为音频来生成音乐。然而,典型的 vocoder 产生的音频在较低的分辨率(例如 16-24 kHz)上,这限制了它们的有效性。我们提出 MusicHiFi -- 一个高效的高保真立体声 vocoder。我们的方法采用了一个级联的三个生成对抗网络(GANs)来将低分辨率频谱图转换为音频,通过带宽扩展将低分辨率音频升级到高分辨率音频,并使用上混合器将单声道音频转换为立体声音频。与之前的工作相比,我们提出了以下1) 对于我们级联的每个阶段,使用统一的 GAN 生成器和判别器架构以及训练程序;2) 一个新型的快速、近降采样兼容的带宽扩展模块;3) 一个快速降采样兼容的单声道到立体声上混合器,确保在输出中保留单声道内容的完整性。我们通过客观和主观听测试评估了我们的方法,发现我们的方法产生了相当不错的音频质量、更好的空间定位控制,以及比过去工作显著更快的推理速度。音频示例可以在以下链接中找到:https://www.example.com/
https://arxiv.org/abs/2403.10493
As AI systems become more advanced, companies and regulators will make difficult decisions about whether it is safe to train and deploy them. To prepare for these decisions, we investigate how developers could make a 'safety case,' which is a structured rationale that AI systems are unlikely to cause a catastrophe. We propose a framework for organizing a safety case and discuss four categories of arguments to justify safety: total inability to cause a catastrophe, sufficiently strong control measures, trustworthiness despite capability to cause harm, and deference to credible AI advisors. We evaluate concrete examples of arguments in each category and outline how arguments could be combined to justify that AI systems are safe to deploy.
随着人工智能系统变得越来越复杂,公司和监管机构将面临关于是否安全训练和部署它们的艰难决策。为了准备这些决策,我们研究了开发人员如何制定一个“安全案例”,这是一个结构化的论据,表明人工智能系统不太可能造成灾难。我们提出了一个组织安全案例的框架,并讨论了四个类别论据来证明安全性:完全无法造成灾难,足够的控制措施,尽管有造成伤害的能力,但值得信赖,以及尊重可信的人工智能顾问的权威。我们评估了每个类别的具体论据,并概述了论据如何结合以证明AI系统可以安全部署。
https://arxiv.org/abs/2403.10462
Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult. To address this conflict, our study undertakes a comprehensive examination of task difficulties, focusing on convergence behavior and changes in relative entropy between consecutive probability distributions across timesteps. Our observational study reveals that denoising at earlier timesteps poses challenges characterized by slower convergence and higher relative entropy, indicating increased task difficulty at these lower timesteps. Building on these observations, we introduce an easy-to-hard learning scheme, drawing from curriculum learning, to enhance the training process of diffusion models. By organizing timesteps or noise levels into clusters and training models with descending orders of difficulty, we facilitate an order-aware training regime, progressing from easier to harder denoising tasks, thereby deviating from the conventional approach of training diffusion models simultaneously across all timesteps. Our approach leads to improved performance and faster convergence by leveraging the benefits of curriculum learning, while maintaining orthogonality with existing improvements in diffusion training techniques. We validate these advantages through comprehensive experiments in image generation tasks, including unconditional, class-conditional, and text-to-image generation.
基于扩散的生成模型在生成建模领域成为了强大的工具。尽管在各种时间步和非噪声水平上进行了广泛的去噪研究,但关于去噪任务的相对难易程度仍存在分歧。一些研究表明,较低的时间步具有更具有挑战性的任务,而其他研究表明,较高的时间步具有更具有挑战性的任务。为了解决这一分歧,我们的研究对任务难度进行全面审查,重点关注连续时间步中相邻概率分布的变化和收敛行为。我们的观察研究显示,在较早的时间步进行去噪会面临具有更慢收敛率和较高相对熵的挑战,表明这些较低时间步的task难度增加。在这些观察结果的基础上,我们引入了一种简单到难的学习方案,基于课程学习,以增强扩散模型的训练过程。通过将时间步或噪声水平组织成簇并按难度递减训练模型,我们促进了具有感知序号的训练状态,从较简单的去噪任务到较困难的任务,从而与同时对所有时间步训练扩散模型的传统方法相分离。通过利用课程学习的优势,我们获得了更好的性能和更快的收敛速度,同时保持与现有扩散训练技术改进的互异性。我们通过在图像生成任务中进行全面的实验来验证这些优势,包括条件概率、类条件概率和文本到图像生成。
https://arxiv.org/abs/2403.10348
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at this https URL.
深度伪造对社会的威胁和对信息网络安全的影响引起了公众的广泛关注,推动了在深度伪造视频检测领域加大投入。目前,基于3D CNN的方法在计算需求方面较高,尽管在性能方面已经取得了一定的成果。本文介绍了一种优雅而简单 yet effective 的策略,名为缩略图布局(TALL),该策略将视频剪辑转换为预定义的布局,以实现保留空间和时间依赖关系的目的。这个转换过程涉及在每帧相同的位置逐序遮罩帧。这些帧 then 被缩放成子帧并重新排列成预定义的布局,形成缩略图。TALL对模型具有模型无关性,具有显著的简单性,只需要很少的代码修改。此外,我们还引入了图推理单元(GRB)和语义一致性(SC)损失,以增强TALL,最终实现TALL++。GRB增强了不同语义区域之间的交互,以捕捉语义级别的不一致性线索。语义一致性损失对语义特征施加一致性约束,以提高模型的泛化能力。在内部数据集、跨数据集、扩散生成的图像检测和深度伪造生成方法识别等大量实验中,TALL++实现了与最先进方法相当或超越最先进方法的结果,证明了我们的方法在各种深度伪造检测问题上的有效性。代码可在此处下载:https://www.xxxxxx.com。
https://arxiv.org/abs/2403.10261
The standard approach to tackling computer vision problems is to train deep convolutional neural network (CNN) models using large-scale image datasets which are representative of the target task. However, in many scenarios, it is often challenging to obtain sufficient image data for the target task. Data augmentation is a way to mitigate this challenge. A common practice is to explicitly transform existing images in desired ways so as to create the required volume and variability of training data necessary to achieve good generalization performance. In situations where data for the target domain is not accessible, a viable workaround is to synthesize training data from scratch--i.e., synthetic data augmentation. This paper presents an extensive review of synthetic data augmentation techniques. It covers data synthesis approaches based on realistic 3D graphics modeling, neural style transfer (NST), differential neural rendering, and generative artificial intelligence (AI) techniques such as generative adversarial networks (GANs) and variational autoencoders (VAEs). For each of these classes of methods, we focus on the important data generation and augmentation techniques, general scope of application and specific use-cases, as well as existing limitations and possible workarounds. Additionally, we provide a summary of common synthetic datasets for training computer vision models, highlighting the main features, application domains and supported tasks. Finally, we discuss the effectiveness of synthetic data augmentation methods. Since this is the first paper to explore synthetic data augmentation methods in great detail, we are hoping to equip readers with the necessary background information and in-depth knowledge of existing methods and their attendant issues.
解决计算机视觉问题的标准方法是使用大型图像数据集训练深度卷积神经网络(CNN)模型,这些数据集代表目标任务。然而,在许多情况下,获得足够的目标任务图像数据具有挑战性。数据增强是一种减轻这一挑战的方法。一种常见的做法是对现有的图像进行显式转换,以便创建实现良好泛化性能所需的训练数据量。在目标领域数据不可访问的情况下,一个可行的解决方法是从零开始合成训练数据,即合成数据增强。 本文对合成数据增强技术进行了全面的回顾。它涵盖了基于现实3D图形建模的数据生成方法、神经风格迁移(NST)、差分神经渲染和生成人工智能(AI)技术(如生成对抗网络(GANs)和变分自编码器(VAEs)的数据生成方法。对于每种方法,我们重点关注重要的数据生成和增强技术、应用范围和具体用例,以及现有的局限性和可能的解决方案。此外,我们还提供了用于训练计算机视觉模型的常见合成数据集的总结,突出了主要特点、应用领域和支持任务。最后,我们讨论了合成数据增强方法的有效性。由于这是对详细探索合成数据增强方法的第一篇论文,我们希望能够为读者提供必要的背景信息和现有方法的深入知识及其相关问题。
https://arxiv.org/abs/2403.10075
Relying on paired synthetic data, existing learning-based Computational Aberration Correction (CAC) methods are confronted with the intricate and multifaceted synthetic-to-real domain gap, which leads to suboptimal performance in real-world applications. In this paper, in contrast to improving the simulation pipeline, we deliver a novel insight into real-world CAC from the perspective of Unsupervised Domain Adaptation (UDA). By incorporating readily accessible unpaired real-world data into training, we formalize the Domain Adaptive CAC (DACAC) task, and then introduce a comprehensive Real-world aberrated images (Realab) dataset to benchmark it. The setup task presents a formidable challenge due to the intricacy of understanding the target aberration domain. To this intent, we propose a novel Quntized Domain-Mixing Representation (QDMR) framework as a potent solution to the issue. QDMR adapts the CAC model to the target domain from three key aspects: (1) reconstructing aberrated images of both domains by a VQGAN to learn a Domain-Mixing Codebook (DMC) which characterizes the degradation-aware priors; (2) modulating the deep features in CAC model with DMC to transfer the target domain knowledge; and (3) leveraging the trained VQGAN to generate pseudo target aberrated images from the source ones for convincing target domain supervision. Extensive experiments on both synthetic and real-world benchmarks reveal that the models with QDMR consistently surpass the competitive methods in mitigating the synthetic-to-real gap, which produces visually pleasant real-world CAC results with fewer artifacts. Codes and datasets will be made publicly available.
依赖于成对合成数据,现有的基于学习的计算错配纠正(CAC)方法在现实世界中遇到了复杂的多面体合成到真实世界的领域差距,导致在现实应用中的性能较低。在本文中,我们通过无监督领域自适应(UDA)的视角,从合成到真实世界的角度,提供了一种新颖的实世界CAC洞察。通过将易于获取的实世界数据集成到训练中,我们形式化定义了领域自适应CAC(DACAC)任务,然后引入了一个全面的实世界错配图像(Realab)数据集来对其进行基准测试。设置任务因为对目标错配域的复杂性而带来了巨大的挑战。为了实现这一目标,我们提出了一个新的量化域混合表示(QDMR)框架作为解决这个问题的有效途径。QDMR从三个方面对CAC模型进行了修改:(1)通过VQGAN重构两个领域的错配图像,学习领域混合码本(DMC),描述了降解注意到的 prior;(2)通过DMC调整CAC模型的深度特征以传递目标域知识;(3)利用训练过的VQGAN生成从源域生成的伪目标错配图像,以说服目标域的监督。在合成和真实世界基准测试的大量实验中,具有QDMR的模型在减轻合成到真实世界的差距方面始终超越了竞争方法,生成了具有较少伪影的视觉上愉悦的实世界CAC结果。代码和数据集将公开提供。
https://arxiv.org/abs/2403.10012
Human-robot collaborative applications require scene representations that are kept up-to-date and facilitate safe motions in dynamic scenes. In this letter, we present an interactive distance field mapping and planning (IDMP) framework that handles dynamic objects and collision avoidance through an efficient representation. We define \textit{interactive} mapping and planning as the process of creating and updating the representation of the scene online while simultaneously planning and adapting the robot's actions based on that representation. Given depth sensor data, our framework builds a continuous field that allows to query the distance and gradient to the closest obstacle at any required position in 3D space. The key aspect of this work is an efficient Gaussian Process field that performs incremental updates and implicitly handles dynamic objects with a simple and elegant formulation based on a temporary latent model. In terms of mapping, IDMP is able to fuse point cloud data from single and multiple sensors, query the free space at any spatial resolution, and deal with moving objects without semantics. In terms of planning, IDMP allows seamless integration with gradient-based motion planners facilitating fast re-planning for collision-free navigation. The framework is evaluated on both real and synthetic datasets. A comparison with similar state-of-the-art frameworks shows superior performance when handling dynamic objects and comparable or better performance in the accuracy of the computed distance and gradient field. Finally, we show how the framework can be used for fast motion planning in the presence of moving objects. An accompanying video, code, and datasets are made publicly available this https URL.
人类机器人协同应用需要保持场景表示更新并促进动态场景中的安全运动。在这封信中,我们提出了一个交互式距离场映射和规划(IDMP)框架,通过有效的表示处理动态物体和碰撞避免。我们将交互式映射和规划定义为在同时规划并根据场景表示更新场景表示的过程。 基于深度传感器数据,我们的框架构建了一个连续场,使得在3D空间中查询到最近障碍物的距离和梯度。本工作的关键在于一个高效的高斯过程,它通过一个基于临时拉格朗日模型的简单而优雅的公式实现动态物体。在映射方面,IDMP能够将来自单一和多个传感器的点云数据进行融合,查询任意空间分辨率下的空闲空间,并处理运动物体。在规划方面,IDMP与基于梯度的运动规划器无缝集成,实现快速重新规划以实现无碰撞的导航。 该框架在真实和合成数据集上都进行了评估。与类似 state-of-the-art 框架的比较显示,在处理动态物体方面具有卓越的性能。最后,我们展示了如何在有运动物体的情况下使用该框架进行快速运动规划。 附录中提供了视频、代码和数据集。您可以通过以下链接访问:https://www.example.com/
https://arxiv.org/abs/2403.09988
We present the evaluation methodology, datasets and results of the BOP Challenge 2023, the fifth in a series of public competitions organized to capture the state of the art in model-based 6D object pose estimation from an RGB/RGB-D image and related tasks. Besides the three tasks from 2022 (model-based 2D detection, 2D segmentation, and 6D localization of objects seen during training), the 2023 challenge introduced new variants of these tasks focused on objects unseen during training. In the new tasks, methods were required to learn new objects during a short onboarding stage (max 5 minutes, 1 GPU) from provided 3D object models. The best 2023 method for 6D localization of unseen objects (GenFlow) notably reached the accuracy of the best 2020 method for seen objects (CosyPose), although being noticeably slower. The best 2023 method for seen objects (GPose) achieved a moderate accuracy improvement but a significant 43% run-time improvement compared to the best 2022 counterpart (GDRNPP). Since 2017, the accuracy of 6D localization of seen objects has improved by more than 50% (from 56.9 to 85.6 AR_C). The online evaluation system stays open and is available at: this http URL.
我们展示了BOP Challenge 2023的评价方法、数据集和结果,这是系列公共比赛之一,旨在捕捉基于RGB/RGB-D图像的模型的最新进展和相关任务的状态。除了2022年(基于模型的2D检测、2D分割和训练过程中看到的对象的6D定位)中的三个任务,2023年挑战引入了这些任务的新版本,重点关注训练过程中未见过的对象。在新任务中,方法需要在短暂的入门阶段(最多5分钟,1个GPU)从提供的3D物体模型中学习新的物体。2023年6D对象定位的最佳方法(GenFlow)值得注意的是,尽管速度明显较慢,但达到了看到对象的最佳2020方法的准确度。2023年看到物体的最佳方法(GPose)实现了适度的准确度提高,但与最佳2022年对应方法(GDRNPP)相比,有显著的43%的运行时间提高。自2017年以来,基于看到对象的6D定位的准确度提高了50%以上(从56.9到85.6 AR_C)。在线评估系统仍然开放,您可以访问此链接:http:// this http URL。
https://arxiv.org/abs/2403.09799
Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has yielded exceptional results, achieving an FID of 10.53 on Cityscapes and 12.66 on ADE20K.The code and models can be accessed via the project page.
语义图像合成(SIS)在传感器模拟方面显示出良好的前景。然而,根据GANs的最佳实践,这个领域的现有做法还没有达到期望的质量和水平。随着潜在扩散模型的图像生成取得重大进展,我们被迫评估控制网,这是一种具有出色控制能力的方法。我们的研究发现了其结果中存在两个主要问题:大语义区域内存在奇怪的子结构,以及内容和语义掩码之间存在不匹配。通过实验研究,我们确定这些问题是由噪声训练数据分布和推理阶段应用的标准正态先验之间的不匹配导致的。为了应对这个挑战,我们为SIS开发了特定的噪声先验,包括空间、分类和一种新的空间分类联合先验。这种方法,我们称之为SCP-Diff,已经产生了惊人的结果,在Cityscapes上的FID分数为10.53,在ADE20K上的FID分数为12.66。代码和模型可以通过项目页面访问。
https://arxiv.org/abs/2403.09638
At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric relighting method that is capable of synthesizing novel viewpoints, and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to reconstruct geometry and appearance from an input portrait as a set of 3D-aware features. We design a relighting module conditioned on a given lighting to process these features, and predict a relit 3D representation in the form of a tri-plane, which can render to an arbitrary viewpoint through volume rendering. Besides viewpoint and lighting control, Holo-Relighting also takes the head pose as a condition to enable head-pose-dependent lighting effects. With these novel designs, Holo-Relighting can generate complex non-Lambertian lighting effects (e.g., specular highlights and cast shadows) without using any explicit physical lighting priors. We train Holo-Relighting with data captured with a light stage, and propose two data-rendering techniques to improve the data quality for training the volumetric relighting system. Through quantitative and qualitative experiments, we demonstrate Holo-Relighting can achieve state-of-the-arts relighting quality with better photorealism, 3D consistency and controllability.
肖像摄影的核心在于寻找理想的照明和拍摄角度。这个过程通常需要 photography 方面的先进知识和复杂的工作室设备。在这项工作中,我们提出了 Holo-Relighting,一种能够合成新视角和新照明效果的体积 relighting 方法,以及从单张图片预测 relit 3D 表示的方法。Holo-Relighting 利用预训练的 3D GAN (EG3D) 来重构输入肖像中的几何形状和外观,作为一组 3D 意识到的特征。我们根据给定的照明设计了一个 relighting 模块,处理这些特征,并预测一个 relit 3D 表示,可以在体积渲染中渲染到任意视角。除了视点和光照控制之外,Holo-Relighting 还利用头姿态作为条件,以实现基于头姿态的照明效果。通过使用这些新颖的设计,Holo-Relighting 可以在不使用任何显式的物理照明先验的情况下生成复杂的非兰伯特ian 照明效果(例如镜面光和高光)。我们用光机捕捉到的数据训练 Holo-Relighting,并提出两种数据渲染技术来提高训练体积 relighting 系统的数据质量。通过定量和定性实验,我们证明了 Holo-Relighting 可以在更好的照片现实主义、3D 一致性和可控制性方面实现最先进的 relighting 质量。
https://arxiv.org/abs/2403.09632
The capacitated location-routing problem involves determining the depots from a set of candidate capacitated depot locations and finding the required routes from the selected depots to serve a set of customers whereas minimizing a cost function that includes the cost of opening the chosen depots, the fixed utilization cost per vehicle used, and the total cost (distance) of the routes. This paper presents a multi-population integrated framework in which a multi-depot edge assembly crossover generates promising offspring solutions from the perspective of both depot location and route edge assembly. The method includes an effective neighborhood-based local search, a feasibility-restoring procedure and a diversification-oriented mutation. Of particular interest is the multi-population scheme which organizes the population into multiple subpopulations based on depot configurations. Extensive experiments on 281 benchmark instances from the literature show that the algorithm performs remarkably well, by improving 101 best-known results (new upper bounds) and matching 84 best-known results. Additional experiments are presented to gain insight into the role of the key elements of the algorithm.
带容量定位问题涉及从一组候选带容量仓库位置中确定仓库,并从选定的仓库找到满足一组客户需求的要求的路线,同时最小化包括选择仓库的开设费用、每辆车的固定利用率成本以及路线总成本的成本函数。本文介绍了一种多人口集成框架,其中多仓库边集接提供了从仓库位置和路线边集成的有前景的子代解决方案。方法包括一个基于邻居的有效局部搜索、一个可行性恢复过程和一个分叉导向突变。特别是有趣的 Multi-population 方案根据仓库配置将人口组织成多个子人口。从文献中收集的 281 个基准实例的实验表明,该算法表现出色,通过提高 101 个最佳已知的结果(新的上界)和匹配 84 个最佳已知的结果。此外,还提供了更多实验,以深入研究算法的关键要素的作用。
https://arxiv.org/abs/2403.09361
Stain normalization algorithms aim to transform the color and intensity characteristics of a source multi-gigapixel histology image to match those of a target image, mitigating inconsistencies in the appearance of stains used to highlight cellular components in the images. We propose a new approach, StainFuser, which treats this problem as a style transfer task using a novel Conditional Latent Diffusion architecture, eliminating the need for handcrafted color components. With this method, we curate SPI-2M the largest stain normalization dataset to date of over 2 million histology images with neural style transfer for high-quality transformations. Trained on this data, StainFuser outperforms current state-of-the-art GAN and handcrafted methods in terms of the quality of normalized images. Additionally, compared to existing approaches, it improves the performance of nuclei instance segmentation and classification models when used as a test time augmentation method on the challenging CoNIC dataset. Finally, we apply StainFuser on multi-gigapixel Whole Slide Images (WSIs) and demonstrate improved performance in terms of computational efficiency, image quality and consistency across tiles over current methods.
污渍归一化算法旨在将源多兆像素 histology图像的色和强度特性转化为与目标图像相同的特征,从而减轻在图像中使用污渍突出细胞组分时出现的 inconsistencies。我们提出了一种新方法StainFuser,将其视为一种风格迁移任务,利用新颖的条件随机场架构解决此问题,无需手动创建颜色组件。通过这种方法,我们curate SPI-2M,至今最大的污渍归一化数据集,为超过200万张历史图像提供神经风格迁移高质变换。在训练此数据的基础上,StainFuser在高品质图像变换方面优于当前的 GAN 和手工方法。此外,与现有方法相比,它改进了用于挑战性的 CoNIC 数据集作为测试时间扩充方法时对核实例分割和分类模型的性能。最后,我们将 StainFuser 应用于多兆像素 whole slide images (WSIs),并在计算效率、图像质量和贴图中的一致性方面展示出改进。与现有方法相比,StainFuser 在此方面的表现更加卓越。
https://arxiv.org/abs/2403.09302
Enterprises and organizations are faced with potential threats from insider employees that may lead to serious consequences. Previous studies on insider threat detection (ITD) mainly focus on detecting abnormal users or abnormal time periods (e.g., a week or a day). However, a user may have hundreds of thousands of activities in the log, and even within a day there may exist thousands of activities for a user, requiring a high investigation budget to verify abnormal users or activities given the detection results. On the other hand, existing works are mainly post-hoc methods rather than real-time detection, which can not report insider threats in time before they cause loss. In this paper, we conduct the first study towards real-time ITD at activity level, and present a fine-grained and efficient framework LAN. Specifically, LAN simultaneously learns the temporal dependencies within an activity sequence and the relationships between activities across sequences with graph structure learning. Moreover, to mitigate the data imbalance problem in ITD, we propose a novel hybrid prediction loss, which integrates self-supervision signals {from normal activities} and supervision signals from abnormal activities into a unified loss for anomaly detection. We evaluate the performance of LAN on two widely used datasets, i.e., CERT r4.2 and CERT r5.2. Extensive and comparative experiments demonstrate the superiority of LAN, outperforming 9 state-of-the-art baselines by at least 9.92% and 6.35% in AUC for real-time ITD on CERT r4.2 and r5.2, respectively. Moreover, LAN can be also applied to post-hoc ITD, surpassing 8 competitive baselines by at least 7.70% and 4.03% in AUC on two datasets. Finally, the ablation study, parameter analysis, and compatibility analysis evaluate the impact of each module and hyper-parameter in LAN.
企业和组织面临着来自内部员工可能导致的潜在威胁,这些威胁可能导致严重的后果。以前的研究主要集中在检测异常用户或异常时间段(例如,一个星期或一天)。然而,用户可能会有数百万个活动记录,即使在一天内,用户也可能有数千个活动。要验证基于检测结果的异常用户或活动,需要高调查预算。另一方面,现有的作品主要是后验方法,而不是实时检测,不能在事件导致损失之前报告内部威胁。在本文中,我们第一个研究了基于活动的实时ITD,并提出了一个细粒度和高效的LAN框架。具体来说,LAN同时学习活动序列内的时序依赖关系和序列间活动关系,并通过图结构学习实现。此外,为了减轻ITD中的数据不平衡问题,我们提出了一个新颖的混合预测损失,将来自正常活动的自监督信号和来自异常活动的监督信号统一为异常检测的统一损失。我们对LAN在两个广泛使用的数据集(CERT r4.2和CERT r5.2)上的性能进行了评估。广泛的比较实验证明LAN具有优越的性能,至少在CERT r4.2和r5.2上的实时ITD的AUC方面优于9个最先进的基线,分别达到9.92%和6.35%。此外,LAN还可以应用于后验ITD,在两个数据集上的AUC分别比8个竞争基线高7.70%和4.03%。最后,抽象研究、参数分析和兼容性分析评估了LAN中每个模块和超参数对性能的影响。
https://arxiv.org/abs/2403.09209
Generative adversial network (GAN) is a type of generative model that maps a high-dimensional noise to samples in target distribution. However, the dimension of noise required in GAN is not well understood. Previous approaches view GAN as a mapping from a continuous distribution to another continous distribution. In this paper, we propose to view GAN as a discrete sampler instead. From this perspective, we build a connection between the minimum noise required and the bits to losslessly compress the images. Furthermore, to understand the behaviour of GAN when noise dimension is limited, we propose divergence-entropy trade-off. This trade-off depicts the best divergence we can achieve when noise is limited. And as rate distortion trade-off, it can be numerically solved when source distribution is known. Finally, we verifies our theory with experiments on image generation.
生成对抗网络(GAN)是一种生成模型,它将高维噪声映射到目标分布中的样本。然而,GAN所需的噪声维度并不清楚。之前的方法将GAN视为从连续分布到另一个连续分布的映射。在本文中,我们提出了将GAN视为离散采样器的观点。从这种角度来看,我们建立了最低噪声所需量和图像无损压缩所需的比特数之间的联系。此外,为了了解GAN在噪声维度受限时的行为,我们提出了熵增益 trade-off。这个 trade-off 描述了当噪声受限时我们能实现的最佳熵。当源分布已知时,它可以通过数值求解得到。最后,我们通过图像生成实验验证了我们的理论。
https://arxiv.org/abs/2403.09196
Systemic amyloidosis is a group of diseases characterized by the deposition of misfolded proteins in various organs and tissues, leading to progressive organ dysfunction and failure. Congo red stain is the gold standard chemical stain for the visualization of amyloid deposits in tissue sections, as it forms complexes with the misfolded proteins and shows a birefringence pattern under polarized light microscopy. However, Congo red staining is tedious and costly to perform, and prone to false diagnoses due to variations in the amount of amyloid, staining quality and expert interpretation through manual examination of tissue under a polarization microscope. Here, we report the first demonstration of virtual birefringence imaging and virtual Congo red staining of label-free human tissue to show that a single trained neural network can rapidly transform autofluorescence images of label-free tissue sections into brightfield and polarized light microscopy equivalent images, matching the histochemically stained versions of the same samples. We demonstrate the efficacy of our method with blind testing and pathologist evaluations on cardiac tissue where the virtually stained images agreed well with the histochemically stained ground truth images. Our virtually stained polarization and brightfield images highlight amyloid birefringence patterns in a consistent, reproducible manner while mitigating diagnostic challenges due to variations in the quality of chemical staining and manual imaging processes as part of the clinical workflow.
系统性淀粉样蛋白病是一组以各种器官和组织中错误折叠的蛋白质的沉积为特征的疾病,导致器官功能和功能的进行性恶化。Congo red染色是检测组织切片中淀粉样蛋白沉积的黄金标准化学染色方法,因为它与错误折叠的蛋白质形成复合物,并在极化光显微镜下显示出偏振纹理。然而,Congo red染色过程费时且昂贵,容易出现由于淀粉样蛋白数量、染色质量和专家解释的差异而导致的错误诊断。在这里,我们报道了第一个使用无标签的人类组织虚拟偏振显微镜和虚拟Congo红染色显示无标签组织切片自旋光显微镜等值的图像,表明单训练的神经网络可以迅速将无标签组织切片自旋光显微镜图像转化为明场和偏振光显微镜等值的图像,与同一样本的化学染色和手动成像的显微镜观察结果相匹配。我们在心脏组织中进行了盲测试和病理学家评估,证明我们的方法的有效性。我们的无标签偏振和明场图像以一致、可重复的方式突出显示淀粉样蛋白的偏振模式,同时减轻了由于化学染色质量和手动成像过程的差异而导致的诊断挑战,作为临床工作流程的一部分。
https://arxiv.org/abs/2403.09100
The launch of ChatGPT by OpenAI in November 2022 marked a pivotal moment for Artificial Intelligence, introducing Large Language Models (LLMs) to the mainstream and setting new records in user adoption. LLMs, particularly ChatGPT, trained on extensive internet data, demonstrate remarkable conversational capabilities across various domains, suggesting a significant impact on the workforce. However, these models are susceptible to errors - "hallucinations" and omissions, generating incorrect or incomplete information. This poses risks especially in contexts where accuracy is crucial, such as legal compliance, medicine or fine-grained process frameworks. There are both technical and human solutions to cope with this isse. This paper explores the human factors that enable users to detect errors in LLM outputs, a critical component in mitigating risks associated with their use in professional settings. Understanding these factors is essential for organizations aiming to leverage LLM technology efficiently, guiding targeted training and deployment strategies to enhance error detection by users. This approach not only aims to optimize the use of LLMs but also to prevent potential downstream issues stemming from reliance on inaccurate model responses. The research emphasizes the balance between technological advancement and human insight in maximizing the benefits of LLMs while minimizing the risks, particularly in areas where precision is paramount. This paper performs a systematic literature research on this research topic, analyses and synthesizes the findings, and outlines future research directions. Literature selection cut-off date is January 11th 2024.
11月2022年11月OpenAI推出ChatGPT标志着人工智能(AI)的一个关键时刻,将大型语言模型(LLMs)引入主流,创造了用户采用的新纪录。这些模型特别是ChatGPT,通过训练在广泛的互联网数据上,展示了在不同领域的非凡会话能力,暗示了对工作场所的重大影响。然而,这些模型易出错——"错觉"和疏漏,产生错误或不完整的信息。这在要求准确性的环境中尤为危险,例如法律遵从、医学或精细过程框架。既有技术解决方案,也有人类解决方案来应对这一问题。本文探讨了用户如何检测LLM输出中的错误,这是在减轻其用于专业环境中的风险的一个重要组成部分。理解这些因素对于组织利用LLM技术高效至关重要,指导针对性的培训和部署策略,以提高用户错误检测。这种方法不仅旨在优化LLM的使用,还旨在防止可能的下游问题,这些问题源于对不准确模型响应的过度依赖。研究重点强调了在推进LLM技术的同时,兼顾技术和人类洞察力的平衡,特别是在精确度至关重要的地方。本文对这一研究课题进行了系统性的文献研究,分析了这些发现,并概述了未来的研究方向。文献选择截止日期是2024年1月11日。
https://arxiv.org/abs/2403.09743
Radiation therapy is crucial in cancer treatment. Experienced experts typically iteratively generate high-quality dose distribution maps, forming the basis for excellent radiation therapy plans. Therefore, automated prediction of dose distribution maps is significant in expediting the treatment process and providing a better starting point for developing radiation therapy plans. With the remarkable results of diffusion models in predicting high-frequency regions of dose distribution maps, dose prediction methods based on diffusion models have been extensively studied. However, existing methods mainly utilize CNNs or Transformers as denoising networks. CNNs lack the capture of global receptive fields, resulting in suboptimal prediction performance. Transformers excel in global modeling but face quadratic complexity with image size, resulting in significant computational overhead. To tackle these challenges, we introduce a novel diffusion model, MD-Dose, based on the Mamba architecture for predicting radiation therapy dose distribution in thoracic cancer patients. In the forward process, MD-Dose adds Gaussian noise to dose distribution maps to obtain pure noise images. In the backward process, MD-Dose utilizes a noise predictor based on the Mamba to predict the noise, ultimately outputting the dose distribution maps. Furthermore, We develop a Mamba encoder to extract structural information and integrate it into the noise predictor for localizing dose regions in the planning target volume (PTV) and organs at risk (OARs). Through extensive experiments on a dataset of 300 thoracic tumor patients, we showcase the superiority of MD-Dose in various metrics and time consumption.
放射治疗在癌症治疗中至关重要。经验丰富的专家通常会逐步生成高质量的剂量分布图,为优秀的放射治疗计划奠定基础。因此,自动预测剂量分布图在加速治疗过程和提供更好的制定放射治疗计划起点方面具有重要意义。 随着扩散模型的惊人预测结果在预测高频率剂量分布图方面取得了显著的成果,基于扩散模型的剂量预测方法已经得到了广泛研究。然而,现有的方法主要利用卷积神经网络(CNN)或Transformer作为去噪网络。CNN缺乏捕捉全局接收野的能力,导致预测性能较低。Transformer在全局建模方面表现出色,但是当图像尺寸较大时,面临明显的计算开销。为了应对这些挑战,我们引入了基于Mamba架构的新型扩散模型MD-Dose,用于预测胸癌患者放射治疗剂量分布。在前向过程中,MD-Dose对剂量分布图添加高斯噪声以获得纯噪声图像。在反向过程中,MD-Dose利用基于Mamba的噪声预测器预测噪声,最终输出剂量分布图。此外,我们还开发了Mamba编码器,用于提取结构信息并将之集成到噪声预测器中,用于在计划目标体积(PTV)和受威胁组织(OARs)中定位剂量区域。通过在300名胸癌患者数据集上的广泛实验,我们展示了MD-Dose在各种指标和时间开销上的优越性。
https://arxiv.org/abs/2403.08479
Migrations of systems from on-site premises to the cloud has been a fundamental endeavor by many industrial institutions. A crucial component of such cloud migrations is the transition of databases to be hosted online. In this work, we consider the difficulties of this migration for SQL databases. While SQL is one of the prominent methods for storing database procedures, there are a plethora of different SQL dialects (e.g., MySQL, Postgres, etc.) which can complicate migrations when the on-premise SQL dialect differs to the dialect hosted on the cloud. Tools exist by common cloud provides such as AWS and Azure to aid in translating between dialects in order to mitigate the majority of the difficulties. However, these tools do not successfully translate $100\%$ of the code. Consequently, software engineers must manually convert the remainder of the untranslated database. For large organizations, this task quickly becomes intractable and so more innovative solutions are required. We consider this challenge a novel yet vital industrial research problem for any large corporation that is considering cloud migrations. Furthermore, we introduce potential avenues of research to tackle this challenge that have yielded promising preliminary results.
将系统从现场设施迁移到云是许多工业机构的基本目标。这种云迁移的一个关键组成部分是数据库的在线托管。在这项工作中,我们考虑了对于 SQL 数据库的迁移难度。虽然 SQL 是存储数据库过程的著名方法之一,但是存在许多不同的 SQL 方言(例如 MySQL、PostgreSQL 等),当本地 SQL 方言与托管在云上的方言不同时,迁移会变得复杂。云提供者(如 AWS 和 Azure)为此提供了工具,以帮助转换方言,从而减轻大部分的困难。然而,这些工具并没有成功地将代码的剩余部分翻译完全。因此,软件工程师必须手动转换未翻译的数据库的其余部分。对于大型组织来说,这个任务很快变得不可行,因此需要更创新的方法。我们认为,这是任何考虑进行云迁移的大公司都面临着一个新的、关键的工业研究问题。此外,我们还介绍了一些研究方法来应对这个挑战,并已经产生了有前景的初步结果。
https://arxiv.org/abs/2403.08375
Existing generative adversarial network (GAN) based conditional image generative models typically produce fixed output for the same conditional input, which is unreasonable for highly subjective tasks, such as large-mask image inpainting or style transfer. On the other hand, GAN-based diverse image generative methods require retraining/fine-tuning the network or designing complex noise injection functions, which is computationally expensive, task-specific, or struggle to generate high-quality results. Given that many deterministic conditional image generative models have been able to produce high-quality yet fixed results, we raise an intriguing question: is it possible for pre-trained deterministic conditional image generative models to generate diverse results without changing network structures or parameters? To answer this question, we re-examine the conditional image generation tasks from the perspective of adversarial attack and propose a simple and efficient plug-in projected gradient descent (PGD) like method for diverse and controllable image generation. The key idea is attacking the pre-trained deterministic generative models by adding a micro perturbation to the input condition. In this way, diverse results can be generated without any adjustment of network structures or fine-tuning of the pre-trained models. In addition, we can also control the diverse results to be generated by specifying the attack direction according to a reference text or image. Our work opens the door to applying adversarial attack to low-level vision tasks, and experiments on various conditional image generation tasks demonstrate the effectiveness and superiority of the proposed method.
现有的基于条件图像生成模型的生成对抗网络(GAN)通常会对相同的条件输入产生固定的输出,这对于高度主观的任务(如大规模掩码图像修复或风格迁移)来说是不合理的。另一方面,基于GAN的多样图像生成方法需要重新训练或微调网络或设计复杂的噪声注入函数,这会导致计算开销、任务特定或很难生成高质量结果。鉴于许多确定性条件图像生成模型已经能够产生高质量但固定的结果,我们提出了一个有趣的问题:是否可以在不改变网络结构或参数的情况下,使预训练的确定性条件图像生成模型产生多样化的结果?为了回答这个问题,我们重新审视了条件图像生成任务,从攻击者的角度出发,提出了一种简单而有效的插值平滑梯度下降(PGD)类似方法,用于多样且可控制图像生成。关键思想是对输入条件添加一个微小的扰动。这样,就可以在不调整网络结构或对预训练模型进行微调的情况下生成多样化的结果。此外,我们还可以根据参考文本或图像指定攻击方向,从而控制生成的多样结果。我们的工作为将对抗攻击应用于低级视觉任务打开了大门,而各种条件图像生成任务的实验结果也证明了所提出方法的有效性和优越性。
https://arxiv.org/abs/2403.08294
Navigating complex environments in Minecraft poses significant challenges for multi-agent systems due to the game's dynamic and unpredictable open-world setting. Agents need to interact with the environment and coordinate their actions with other agents to achieve common objectives. However, traditional approaches often struggle to efficiently manage inter-agent communication and task distribution, which are crucial for effective multi-agent navigation. Furthermore, processing and integrating multi-modal information (such as visual, textual, and auditory data) is essential for agents to fully comprehend their goals and navigate the environment successfully. To address this issue, we design the HAS framework to auto-organize groups of LLM-based agents to complete Navigation tasks. In our approach, we devise a hierarchical auto-organizing navigation system, which is characterized by 1) a hierarchical system for multi-agent organization, ensuring centralized planning and decentralized execution; 2) an auto-organizing and intra-communication mechanism, enabling dynamic group adjustment under subtasks; 3) a multi-modal information platform, facilitating multi-modal perception to perform the three navigation tasks with one system. To assess organizational behavior, we design a series of navigation tasks in the Minecraft environment, which includes searching and exploring. We aim to develop embodied organizations that push the boundaries of embodied AI, moving it towards a more human-like organizational structure.
在《我的世界》中探索复杂的环境对多智能体系统来说是一个具有挑战性的任务,因为游戏具有动态且不可预测的开放世界设置。智能体需要与环境进行交互,并与其他智能体协调行动以实现共同目标。然而,传统的解决方案往往难以有效地管理智能体之间的通信和任务分配,这是实现有效多智能体导航所必需的。此外,处理和整合多模态信息(如视觉、文本和音频数据)对智能体完全理解其目标并成功探索环境至关重要。为了解决这个问题,我们设计了一个基于LLM智能体的自组织导航框架。在我们的方法中,我们设计了一个分层的自组织导航系统,其特点包括:1)多智能体组织层次结构,确保集中规划和分布式执行;2)自组织和内部分享机制,允许在子任务下动态调整智能体数量;3)多模态信息平台,实现通过一个系统进行三维导航任务。为了评估组织行为,我们在《我的世界》环境中设计了一系列导航任务,包括搜索和探索。我们的目标是开发具有人机交互特点的组织,推动实体AI的发展,使其更接近人类组织结构。
https://arxiv.org/abs/2403.08282