Large language models (LLMs) have significantly advanced audio processing through audio codecs that convert audio into discrete tokens, enabling the application of language modeling techniques to audio data. However, audio codecs often operate at high frame rates, resulting in slow training and inference, especially for autoregressive models. To address this challenge, we present the Low Frame-rate Speech Codec (LFSC): a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve high-quality audio compression with a 1.89 kbps bitrate and 21.5 frames per second. We demonstrate that our novel codec can make the inference of LLM-based text-to-speech models around three times faster while improving intelligibility and producing quality comparable to previous models.
大语言模型(LLMs)通过将音频转换为离散 tokens 的音频编解码器显著提高了音频处理。然而,音频编解码器通常以高帧率为操作,导致训练和推理缓慢,特别是对于自回归模型。为了解决这个问题,我们提出了低帧率语音编解码器(LFSC):一种利用有限标量量化和大语言模型中的对抗训练来获得高品质音频压缩的神经音频编解码器,具有1.89 kbps的比特速率和21.5帧每秒。我们证明了我们的新编解码器可以在不降低质量的情况下将基于LLM的文本到语音模型的推理速度提高三倍,同时提高可听度和产生与以前模型相当的质量。
https://arxiv.org/abs/2409.12117
To mitigate the susceptibility of neural networks to adversarial attacks, adversarial training has emerged as a prevalent and effective defense strategy. Intrinsically, this countermeasure incurs a trade-off, as it sacrifices the model's accuracy in processing normal samples. To reconcile the trade-off, we pioneer the incorporation of null-space projection into adversarial training and propose two innovative Null-space Projection based Adversarial Training(NPAT) algorithms tackling sample generation and gradient optimization, named Null-space Projected Data Augmentation (NPDA) and Null-space Projected Gradient Descent (NPGD), to search for an overarching optimal solutions, which enhance robustness with almost zero deterioration in generalization performance. Adversarial samples and perturbations are constrained within the null-space of the decision boundary utilizing a closed-form null-space projector, effectively mitigating threat of attack stemming from unreliable features. Subsequently, we conducted experiments on the CIFAR10 and SVHN datasets and reveal that our methodology can seamlessly combine with adversarial training methods and obtain comparable robustness while keeping generalization close to a high-accuracy model.
为了减轻神经网络对对抗攻击的易感性,对抗训练已成为一种普遍且有效的防御策略。本质上,这一措施牺牲了模型在处理正常样本时的准确性。为解决这一权衡,我们首创了将核空间投影引入对抗训练,并提出了两种新颖的核空间投影 based adversarial training(NPAT)算法,解决样本生成和梯度优化问题,名为Null-space Projected Data Augmentation(NPDA)和Null-space Projected Gradient Descent(NPGD),以寻找一个总的最优解决方案,该方案可以在几乎不损失泛化性能的情况下提高鲁棒性。对抗样本和扰动在决策边界的核空间内受到约束,有效地减轻了攻击来自不可靠特征所带来的威胁。随后,我们在CIFAR10和SVHN数据集上进行了实验,证实了我们的方法可以与对抗训练方法无缝结合,获得与高准确度模型相当的保护性能,同时保持泛化性能接近于高准确度模型。
https://arxiv.org/abs/2409.11754
Vision Based Navigation consists in utilizing cameras as precision sensors for GNC after extracting information from images. To enable the adoption of machine learning for space applications, one of obstacles is the demonstration that available training datasets are adequate to validate the algorithms. The objective of the study is to generate datasets of images and metadata suitable for training machine learning algorithms. Two use cases were selected and a robust methodology was developed to validate the datasets including the ground truth. The first use case is in-orbit rendezvous with a man-made object: a mockup of satellite ENVISAT. The second use case is a Lunar landing scenario. Datasets were produced from archival datasets (Chang'e 3), from the laboratory at DLR TRON facility and at Airbus Robotic laboratory, from SurRender software high fidelity image simulator using Model Capture and from Generative Adversarial Networks. The use case definition included the selection of algorithms as benchmark: an AI-based pose estimation algorithm and a dense optical flow algorithm were selected. Eventually it is demonstrated that datasets produced with SurRender and selected laboratory facilities are adequate to train machine learning algorithms.
基于视觉的导航是指在从图像中提取信息后,利用摄像头作为高精度传感器来验证惯导导航控制系统的算法的技术。为了使机器学习在空间应用中得到采用,一个障碍是证明所使用的训练数据集足够充分来验证算法。本研究旨在生成适合训练机器学习算法的图像和元数据的图像。选择了两个用例,并开发了一种包括地面真实值的 robust 方法来验证数据集。第一个用例是在轨会合与人工物体:卫星 ENVISAT 的模型对照。第二个用例是月球着陆场景。数据集来源于:(Chang'e 3)归档数据集,(DLR TRON)实验室和空客机器人实验室,SurRender 高保真度图像模拟软件使用 Model Capture 和 Generative Adversarial Networks。用例定义包括选择基准算法:选择了一种基于 AI 的姿态估计算法和一种基于密度的光流算法作为基准。最终证明,使用 SurRender 和所选实验室设施生成的数据集是足够充分来训练机器学习算法。
https://arxiv.org/abs/2409.11383
Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial targets, constraints, and attack scenarios. Particularly, we consider two types of adversarial targets: stealing users' specific personally identifiable information (PII) or stealing the entire user request. To achieve these objectives, we propose a novel attack method, termed Environmental Injection Attack (EIA). This attack injects malicious content designed to adapt well to different environments where the agents operate, causing them to perform unintended actions. This work instantiates EIA specifically for the privacy scenario. It inserts malicious web elements alongside persuasive instructions that mislead web agents into leaking private information, and can further leverage CSS and JavaScript features to remain stealthy. We collect 177 actions steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct extensive experiments using one of the most capable generalist web agent frameworks to date, SeeAct. The results demonstrate that EIA achieves up to 70% ASR in stealing users' specific PII. Stealing full user requests is more challenging, but a relaxed version of EIA can still achieve 16% ASR. Despite these concerning results, it is important to note that the attack can still be detectable through careful human inspection, highlighting a trade-off between high autonomy and security. This leads to our detailed discussion on the efficacy of EIA under different levels of human supervision as well as implications on defenses for generalist web agents.
通用网络代理商发展迅速,表现出巨大的潜力。然而,与这些代理商相关的空前安全风险尚未被探索。在这项工作中,我们旨在通过在对抗环境中研究通用网络代理的隐私风险来缩小这个差距。首先,我们提出了一个威胁模型,讨论了攻击目标、约束和攻击场景。特别是,我们考虑了两种类型的攻击目标:窃取用户特定个人可识别信息(PII)或窃取整个用户请求。为了实现这些目标,我们提出了一个名为环境注入攻击(EIA)的新攻击方法。这个攻击通过注入恶意内容来适应代理商操作的环境,导致它们执行意外的行动。这项工作特别为隐私场景进行了EIA。它附加了恶意网络元素,同时在说服性指令中误导网络代理,使其泄露私人信息,并且还可以进一步利用CSS和JavaScript功能保持隐身。我们从Mind2Web数据集中收集了177个涉及不同PII类别的实际网站的行动步骤,并使用目前最强大的通用网络代理框架之一——SeeAct,进行广泛的实验。结果表明,EIA在窃取用户特定PII方面可以达到70%的ASR。窃取完整用户请求更具挑战性,但一个放松版本的EIA仍可以实现16%的ASR。尽管这些令人担忧的结果,但需要注意的是,攻击仍然可以通过仔细的人检查检测出来,这揭示了高自主性与安全之间的权衡。这导致了我们关于EIA在不同水平人类监督下的效果以及对其防御措施的详细讨论。
https://arxiv.org/abs/2409.11295
Multi-robot collaboration for target tracking presents significant challenges in hazardous environments, including addressing robot failures, dynamic priority changes, and other unpredictable factors. Moreover, these challenges are increased in adversarial settings if the environment is unknown. In this paper, we propose a resilient and adaptive framework for multi-robot, multi-target tracking in environments with unknown sensing and communication danger zones. The damages posed by these zones are temporary, allowing robots to track targets while accepting the risk of entering dangerous areas. We formulate the problem as an optimization with soft chance constraints, enabling real-time adjustments to robot behavior based on varying types of dangers and failures. An adaptive replanning strategy is introduced, featuring different triggers to improve group performance. This approach allows for dynamic prioritization of target tracking and risk aversion or resilience, depending on evolving resources and real-time conditions. To validate the effectiveness of the proposed method, we benchmark and evaluate it across multiple scenarios in simulation and conduct several real-world experiments.
多机器人协同目标跟踪在具有未知感测和通信危险区域的环境中具有显著的挑战,包括解决机器人故障、动态优先级变化和其他不可预测的因素。此外,如果在未知环境中,这些挑战将变得更加严重。在本文中,我们提出了一个适用于未知感测和通信危险区域的多机器人多目标跟踪的弹性自适应框架。这些区域所造成的伤害是暂时的,允许机器人在接受进入危险区域的风险的同时跟踪目标。我们将问题转化为带有软概率约束的优化问题,以便根据不同类型的危险和故障进行实时调整机器人的行为。我们引入了一种自适应规划策略,包括不同的触发器以提高团队表现。这种方法允许根据不断变化资源和实时条件动态优先化目标跟踪和风险回避或弹性。为了验证所提出方法的有效性,我们在仿真中进行了多次基准测试和现实世界实验。
https://arxiv.org/abs/2409.11230
Whole Slide Images (WSIs) are critical for various clinical applications, including histopathological analysis. However, current deep learning approaches in this field predominantly focus on individual tumor types, limiting model generalization and scalability. This relatively narrow focus ultimately stems from the inherent heterogeneity in histopathology and the diverse morphological and molecular characteristics of different tumors. To this end, we propose a novel approach for multi-cohort WSI analysis, designed to leverage the diversity of different tumor types. We introduce a Cohort-Aware Attention module, enabling the capture of both shared and tumor-specific pathological patterns, enhancing cross-tumor generalization. Furthermore, we construct an adversarial cohort regularization mechanism to minimize cohort-specific biases through mutual information minimization. Additionally, we develop a hierarchical sample balancing strategy to mitigate cohort imbalances and promote unbiased learning. Together, these form a cohesive framework for unbiased multi-cohort WSI analysis. Extensive experiments on a uniquely constructed multi-cancer dataset demonstrate significant improvements in generalization, providing a scalable solution for WSI classification across diverse cancer types. Our code for the experiments is publicly available at <link>.
完整的幻灯片图像(WSIs)对于各种临床应用(包括病理学分析)至关重要。然而,该领域目前主要关注于个体肿瘤类型,这限制了模型的泛化能力和可扩展性。这种相对狭窄的焦点最终源于病理学异质性和不同肿瘤的形态学和分子特征的多样性。因此,我们提出了一个名为多期 WSI 分析的新方法,旨在利用不同肿瘤类型的多样性。我们引入了Cohort-Aware 注意力模块,使捕捉到共享和肿瘤特异性病理模式,提高跨肿瘤泛化能力。此外,我们通过相互信息最小化构建了敌对 cohort regularization 机制,以最小化 cohort-specific biases。此外,我们还开发了层次化的样本平衡策略,通过减少 cohort imbalances 促进无偏学习。这些方法共同构成了一个无偏多期 WSI 分析的完整框架。在构建独特多癌种数据集的实验中,我们证明了泛化能力的显著提高,为 WSI 分类提供了一种可扩展的解决方案。我们的实验代码公开可用,<链接>。
https://arxiv.org/abs/2409.11119
Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model's performance in realistic textual input.
上下文问题回答模型容易受到输入上下文中的对抗扰动的影响,这种扰动在现实场景中很常见。这些对抗噪声旨在通过扭曲文本输入来降低模型的性能。我们引入了一个独特的数据集,其中包含七种不同的对抗噪声类型,每种类型在不同的强度水平上应用于SQuAD数据集。为了衡量鲁棒性,我们利用 robustness 指标,提供对模型性能在不同噪声类型和水平上的标准化衡量。对于基于Transformer的问答模型的实验表明,鲁棒性存在漏洞,并且对于现实文本输入,模型在处理不同噪声类型和水平时的重要见解。
https://arxiv.org/abs/2409.10997
Creating and updating pixel art character sprites with many frames spanning different animations and poses takes time and can quickly become repetitive. However, that can be partially automated to allow artists to focus on more creative tasks. In this work, we concentrate on creating pixel art character sprites in a target pose from images of them facing other three directions. We present a novel approach to character generation by framing the problem as a missing data imputation task. Our proposed generative adversarial networks model receives the images of a character in all available domains and produces the image of the missing pose. We evaluated our approach in the scenarios with one, two, and three missing images, achieving similar or better results to the state-of-the-art when more images are available. We also evaluate the impact of the proposed changes to the base architecture.
创建和更新像素艺术角色纹理需要时间和耐心,而且可能会变得重复。然而,这可以部分自动化,以便艺术家可以集中精力完成更创造性的任务。在这项工作中,我们专注于从它们面对其他三个方向的图像中创建像素艺术角色纹理的目标姿势。我们提出了一种将问题视为缺失数据修复任务的新颖方法。我们提出的生成对抗网络模型接收所有可用领域的图像中角色的图像,并产生缺失姿势的图像。我们在有1个、2个和3个缺失图像的场景中评估了我们的方法,当有更多图像可用时,我们获得与最先进方法相似或更好的结果。我们还评估了所提出的对基本架构的改变对性能的影响。
https://arxiv.org/abs/2409.10721
This work introduces a framework to diagnose the strengths and shortcomings of Autonomous Vehicle (AV) collision avoidance technology with synthetic yet realistic potential collision scenarios adapted from real-world, collision-free data. Our framework generates counterfactual collisions with diverse crash properties, e.g., crash angle and velocity, between an adversary and a target vehicle by adding perturbations to the adversary's predicted trajectory from a learned AV behavior model. Our main contribution is to ground these adversarial perturbations in realistic behavior as defined through the lens of data-alignment in the behavior model's parameter space. Then, we cluster these synthetic counterfactuals to identify plausible and representative collision scenarios to form the basis of a test suite for downstream AV system evaluation. We demonstrate our framework using two state-of-the-art behavior prediction models as sources of realistic adversarial perturbations, and show that our scenario clustering evokes interpretable failure modes from a baseline AV policy under evaluation.
本文提出了一种诊断自动驾驶车辆(AV)碰撞避免技术优势和不足的框架,该框架使用从现实世界中无碰撞数据的合成且真实的碰撞场景来评估。我们的框架通过向敌方预测轨迹 learned AV 行为模型添加扰动来生成各种碰撞特性的反事实碰撞。我们主要的贡献是将这些敌方扰动与数据对齐的行为模型参数空间中的真实行为相联系。然后,我们将这些合成反事实进行聚类,以识别可能的碰撞场景,为下游 AV 系统评估提供测试套件。我们使用两个最先进的 AV 行为预测模型作为现实世界中现实敌方扰动的来源,并表明,我们的场景聚类从评估基准 AV 策略中激发可解释的失效模式。
https://arxiv.org/abs/2409.10669
This paper presents a diffusion-based recommender system that incorporates classifier-free guidance. Most current recommender systems provide recommendations using conventional methods such as collaborative or content-based filtering. Diffusion is a new approach to generative AI that improves on previous generative AI approaches such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). We incorporate diffusion in a recommender system that mirrors the sequence users take when browsing and rating items. Although a few current recommender systems incorporate diffusion, they do not incorporate classifier-free guidance, a new innovation in diffusion models as a whole. In this paper, we present a diffusion recommender system that augments the underlying recommender system model for improved performance and also incorporates classifier-free guidance. Our findings show improvements over state-of-the-art recommender systems for most metrics for several recommendation tasks on a variety of datasets. In particular, our approach demonstrates the potential to provide better recommendations when data is sparse.
本文介绍了一种基于扩散的推荐系统,该系统集成了分类器无关指导。大多数现有的推荐系统使用传统方法,如合作或内容过滤来进行推荐。扩散是一种新的生成人工智能方法,它在VAE和生成对抗网络等先前的生成人工智能方法上进行了改进。我们将扩散融入到一个反映用户在浏览和评分项目时所采取的序列的推荐系统中。尽管目前一些推荐系统已经包含了扩散,但它们并没有包含分类器无关指导,这是扩散模型整体的全新创新。在本文中,我们介绍了一种用于增强底层推荐系统模型的扩散推荐系统,以提高性能,并还集成了分类器无关指导。我们的研究结果表明,对于大多数数据集,基于扩散的推荐系统在多个推荐任务上均优于最先进的推荐系统。特别是,我们的方法在数据稀疏时提供了更好的推荐。
https://arxiv.org/abs/2409.10494
This paper presents a novel hybrid quantum generative model, the VAE-QWGAN, which combines the strengths of a classical Variational AutoEncoder (VAE) with a hybrid Quantum Wasserstein Generative Adversarial Network (QWGAN). The VAE-QWGAN integrates the VAE decoder and QGAN generator into a single quantum model with shared parameters, utilizing the VAE's encoder for latent vector sampling during training. To generate new data from the trained model at inference, input latent vectors are sampled from a Gaussian Mixture Model (GMM), learnt on the training latent vectors. This, in turn, enhances the diversity and quality of generated images. We evaluate the model's performance on MNIST/Fashion-MNIST datasets, and demonstrate improved quality and diversity of generated images compared to existing approaches.
本文提出了一种新颖的混合量子生成模型,即VAE-QWGAN,它将经典变分自编码器(VAE)和混合量子Wasserstein生成对抗网络(QWGAN)的优势相结合。VAE-QWGAN将VAE编码器和解码器集成到一个共享参数的量子模型中,在训练期间利用VAE的编码器进行潜在向量采样。为了在推理时生成新的数据,从训练后的模型中采样协方差矩阵高斯混合模型(GMM),学习来自训练latent向量。这反过来又增强了生成的图像的多样性和质量。我们在MNIST/Fashion-MNIST数据集上评估了该模型的性能,并证明了与现有方法相比,生成的图像具有更好的质量和多样性。
https://arxiv.org/abs/2409.10339
Verification and validation of autonomous driving (AD) systems and components is of increasing importance, as such technology increases in real-world prevalence. Safety-critical scenario generation is a key approach to robustify AD policies through closed-loop training. However, existing approaches for scenario generation rely on simplistic objectives, resulting in overly-aggressive or non-reactive adversarial behaviors. To generate diverse adversarial yet realistic scenarios, we propose SEAL, a scenario perturbation approach which leverages learned scoring functions and adversarial, human-like skills. SEAL-perturbed scenarios are more realistic than SOTA baselines, leading to improved ego task success across real-world, in-distribution, and out-of-distribution scenarios, of more than 20%. To facilitate future research, we release our code and tools: this https URL
验证和验证自动驾驶(AD)系统和组件的重要性日益增加,因为这种技术在现实世界的普及程度越来越高。通过闭环训练来增强AD政策的鲁棒性是生成关键方法。然而,现有的场景生成方法依赖于简单的目标,导致过于激进或非反应性 adversarial 行为。为了生成多样、真实的 adversarial 场景,我们提出了 SEAL,一种基于学习评分函数和 adversarial,人类似技能的场景扰动方法。SEAL-扰动的场景比现有的 SOTA 基线更真实,在现实世界、离散和分布式场景中,成功率提高了20%以上。为了促进未来的研究,我们发布了我们的代码和工具:这个链接。
https://arxiv.org/abs/2409.10320
Multi-modality image fusion aims to integrate complementary data information from different imaging modalities into a single image. Existing methods often generate either blurry fused images that lose fine-grained semantic information or unnatural fused images that appear perceptually cropped from the inputs. In this work, we propose a novel two-phase discriminative autoencoder framework, termed DAE-Fuse, that generates sharp and natural fused images. In the adversarial feature extraction phase, we introduce two discriminative blocks into the encoder-decoder architecture, providing an additional adversarial loss to better guide feature extraction by reconstructing the source images. While the two discriminative blocks are adapted in the attention-guided cross-modality fusion phase to distinguish the structural differences between the fused output and the source inputs, injecting more naturalness into the results. Extensive experiments on public infrared-visible, medical image fusion, and downstream object detection datasets demonstrate our method's superiority and generalizability in both quantitative and qualitative evaluations.
多模态图像融合的目的是将不同成像模态的互补数据信息集成到单个图像中。现有的方法通常生成模糊的融合图像,损失了细粒度语义信息,或生成了从输入看起来不自然的融合图像。在本文中,我们提出了一个名为DAE-Fuse的新颖二阶段判别自动编码器框架,它生成清晰和自然的融合图像。在 adversarial 特征提取阶段,我们将两个判别块引入到编码器-解码器架构中,为更好地指导通过重构源图像进行特征提取提供额外的对抗损失。虽然这两个判别块在注意力引导跨模态融合阶段中适应,以区分融合输出和源输入之间的结构差异,并在结果中增加自然性,但通过大量公开红外-可见医学图像融合和下游目标检测数据集的实验,证明了我们的方法在数量和质量评估方面的卓越性和普适性。
https://arxiv.org/abs/2409.10080
The deployment of embodied navigation agents in safety-critical environments raises concerns about their vulnerability to adversarial attacks on deep neural networks. However, current attack methods often lack practicality due to challenges in transitioning from the digital to the physical world, while existing physical attacks for object detection fail to achieve both multi-view effectiveness and naturalness. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches with learnable textures and opacity to objects. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which uses feedback from the navigation model to optimize the patch's texture. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, where opacity is refined after texture optimization. Experimental results show our adversarial patches reduce navigation success rates by about 40%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: [this https URL].
在将具有感知能力的智能体部署在关键环境中时,人们担心它们容易受到深度神经网络的对抗攻击。然而,由于从数字世界到物理世界的转换困难,当前的攻击方法通常缺乏实际应用。同时,现有的物体检测攻击方法也无法实现多视角效果和自然性。为了应对这个问题,我们提出了一种实用的攻击方法,通过在物体上附加具有可学习纹理和透明度的对抗补丁。具体来说,为了确保效果在不同的视点保持一致,我们基于物体感知的采样多视角优化策略,利用导航模型的反馈来优化补丁的纹理。为了使补丁对人类观察者不可见,我们引入了两个阶段的透明度优化机制,在纹理优化后进行透明度的优化。实验结果表明,我们的对抗补丁将导航成功率降低了约40%,超过了前方法的实用性、有效性和自然性。代码可在此处访问:[https:// this URL]。
https://arxiv.org/abs/2409.10071
Learning compact and meaningful latent space representations has been shown to be very useful in generative modeling tasks for visual data. One particular example is applying Vector Quantization (VQ) in variational autoencoders (VQ-VAEs, VQ-GANs, etc.), which has demonstrated state-of-the-art performance in many modern generative modeling applications. Quantizing the latent space has been justified by the assumption that the data themselves are inherently discrete in the latent space (like pixel values). In this paper, we propose an alternative representation of the latent space by relaxing the structural assumption than the VQ formulation. Specifically, we assume that the latent space can be approximated by a union of subspaces model corresponding to a dictionary-based representation under a sparsity constraint. The dictionary is learned/updated during the training process. We apply this approach to look at two models: Dictionary Learning Variational Autoencoders (DL-VAEs) and DL-VAEs with Generative Adversarial Networks (DL-GANs). We show empirically that our more latent space is more expressive and has leads to better representations than the VQ approach in terms of reconstruction quality at the expense of a small computational overhead for the latent space computation. Our results thus suggest that the true benefit of the VQ approach might not be from discretization of the latent space, but rather the lossy compression of the latent space. We confirm this hypothesis by showing that our sparse representations also address the codebook collapse issue as found common in VQ-family models.
学习紧凑且富有意义的潜在空间表示在生成建模任务中对于视觉数据非常有益。一个具体的例子是在变分自编码器(VQ-VAE,VQ-GAN等)中应用向量量化(VQ),已经在许多现代生成建模应用中证明了最先进的性能。将潜在空间量化为数据固有的离散性质(如像素值)的假设,为向量量化提供了正当理由。在本文中,我们提出了另一种潜在空间表示,通过放宽VQ公式的结构假设,将潜在空间建模为一个字典基于表示的子空间模型的并集。字典在训练过程中学习/更新。我们将这种方法应用于看两个模型:词典学习变分自编码器(DL-VAEs)和词典学习变分自编码器(DL-VAEs)与生成对抗网络(DL-GANs)。我们通过实验实证地证明,我们的更丰富的潜在空间具有更丰富的表达性,并且在一定程度上优于VQ方法在重构质量方面的表现,但代价是计算开销略小。因此,我们得出的结论是,VQ方法的真实益处可能不在于对潜在空间的离散化,而在于对潜在空间的损失压缩。我们证实了这一假设,通过实验发现我们的稀疏表示同样解决了VQ家族模型中常见代码book collapse问题。
https://arxiv.org/abs/2409.11184
The rapid development of large-scale text-to-speech (TTS) models has led to significant advancements in modeling diverse speaker prosody and voices. However, these models often face issues such as slow inference speeds, reliance on complex pre-trained neural codec representations, and difficulties in achieving naturalness and high similarity to reference speakers. To address these challenges, this work introduces StyleTTS-ZS, an efficient zero-shot TTS model that leverages distilled time-varying style diffusion to capture diverse speaker identities and prosodies. We propose a novel approach that represents human speech using input text and fixed-length time-varying discrete style codes to capture diverse prosodic variations, trained adversarially with multi-modal discriminators. A diffusion model is then built to sample this time-varying style code for efficient latent diffusion. Using classifier-free guidance, StyleTTS-ZS achieves high similarity to the reference speaker in the style diffusion process. Furthermore, to expedite sampling, the style diffusion model is distilled with perceptual loss using only 10k samples, maintaining speech quality and similarity while reducing inference speed by 90%. Our model surpasses previous state-of-the-art large-scale zero-shot TTS models in both naturalness and similarity, offering a 10-20 faster sampling speed, making it an attractive alternative for efficient large-scale zero-shot TTS systems. The audio demo, code and models are available at this https URL.
大规模文本转语音(TTS)模型的快速发展导致了对多样化说话人声和声音的建模取得了显著的进步。然而,这些模型通常面临诸如推理速度较慢、依赖于复杂的预训练神经编码表示以及难以实现自然度和与参考说话者的高相似度等问题。为解决这些挑战,本文介绍了一种有效的零散TTS模型——StyleTTS-ZS,该模型利用提取的时间变长风格扩散来捕捉多样说话人的身份和语调。我们提出了一种将人类语音表示为输入文本和固定长度的时间变长离散风格码的新颖方法,以捕捉多样语调变化,并通过多模态判别器进行 adversarial 训练。接下来,我们构建了一个扩散模型,用于抽样时间变长风格码以实现高效的潜在扩散。利用分类器无关的指导,StyleTTS-ZS在风格扩散过程中实现了与参考说话者的高相似度。此外,为了加速抽样,我们使用仅10k个样本的感知损失对风格扩散模型进行蒸馏,在保持语音质量和相似性的同时,将推理速度降低90%。我们的模型在自然性和相似性方面超过了之前的最佳大规模零散TTS模型,提供了10-20倍的更快抽样速度,成为高效大型零散TTS系统的有吸引力的新选择。音频演示、代码和模型都可以在这个https://url.com/处找到。
https://arxiv.org/abs/2409.10058
Omni-directional images have been increasingly used in various applications, including virtual reality and SNS (Social Networking Services). However, their availability is comparatively limited in contrast to normal field of view (NFoV) images, since specialized cameras are required to take omni-directional images. Consequently, several methods have been proposed based on generative adversarial networks (GAN) to synthesize omni-directional images, but these approaches have shown difficulties in training of the models, due to instability and/or significant time consumption in the training. To address these problems, this paper proposes a novel omni-directional image synthesis method, 2S-ODIS (Two-Stage Omni-Directional Image Synthesis), which generated high-quality omni-directional images but drastically reduced the training time. This was realized by utilizing the VQGAN (Vector Quantized GAN) model pre-trained on a large-scale NFoV image database such as ImageNet without fine-tuning. Since this pre-trained model does not represent distortions of omni-directional images in the equi-rectangular projection (ERP), it cannot be applied directly to the omni-directional image synthesis in ERP. Therefore, two-stage structure was adopted to first create a global coarse image in ERP and then refine the image by integrating multiple local NFoV images in the higher resolution to compensate the distortions in ERP, both of which are based on the pre-trained VQGAN model. As a result, the proposed method, 2S-ODIS, achieved the reduction of the training time from 14 days in OmniDreamer to four days in higher image quality.
越来越多的应用程序包括虚拟现实和社交网络服务(SNS)中使用全向量图像(Omni-directional images)。然而,与普通场视野(NFoV)图像相比,它们的可用性相对有限,因为需要专用相机才能拍摄全向量图像。因此,基于生成对抗网络(GAN)提出了几种方法来合成全向量图像,但这些方法在训练模型时遇到了困难,因为训练过程中存在不稳定和/或显著的延迟。为了应对这些问题,本文提出了一种新颖的全向量图像合成方法:2S-ODIS(两阶段全向量图像合成),它生成了高质量的全向量图像,但显著减少了训练时间。这是通过利用预训练的大型NFoV图像数据库ImageNet,而无需对模型进行微调来实现的。由于这个预训练模型没有在ERP上表示全向量图像的失真,因此它不能直接应用于ERP的全向量图像合成。因此,采用了两个阶段的结构,首先在ERP上创建全局粗图像,然后通过在更高分辨率中整合多个局部NFoV图像来优化图像,这都是基于预训练的VQGAN模型。因此,与OmniDreamer相比,所提出的方法2S-ODIS将训练时间从14天减少到4天,实现了高图像质量的全向量图像合成。
https://arxiv.org/abs/2409.09969
Traffic Sign Recognition (TSR) is crucial for safe and correct driving automation. Recent works revealed a general vulnerability of TSR models to physical-world adversarial attacks, which can be low-cost, highly deployable, and capable of causing severe attack effects such as hiding a critical traffic sign or spoofing a fake one. However, so far existing works generally only considered evaluating the attack effects on academic TSR models, leaving the impacts of such attacks on real-world commercial TSR systems largely unclear. In this paper, we conduct the first large-scale measurement of physical-world adversarial attacks against commercial TSR systems. Our testing results reveal that it is possible for existing attack works from academia to have highly reliable (100\%) attack success against certain commercial TSR system functionality, but such attack capabilities are not generalizable, leading to much lower-than-expected attack success rates overall. We find that one potential major factor is a spatial memorization design that commonly exists in today's commercial TSR systems. We design new attack success metrics that can mathematically model the impacts of such design on the TSR system-level attack success, and use them to revisit existing attacks. Through these efforts, we uncover 7 novel observations, some of which directly challenge the observations or claims in prior works due to the introduction of the new metrics.
交通信号识别(TSR)对于安全且正确的自动驾驶自动化至关重要。最近的工作揭示了TSR模型对物理世界攻击的普遍漏洞,这些攻击可能成本低廉、部署方便,甚至可能造成严重攻击效果,如隐藏关键交通标志或伪造虚假标志。然而,迄今为止,现有的工作只考虑评估学术TSR模型的攻击效果,而未对现实世界的商业TSR系统产生影响。在本文中,我们对商业TSR系统进行了首次针对物理世界攻击的大规模测量。我们的测试结果表明,现有的学术攻击作品在某些商业TSR系统的某些功能上具有高度可靠的(100\%)攻击成功率,但这种攻击能力并不具有普适性,导致整体攻击成功率远低于预期。我们发现,一个可能的主要因素是存在于当前商业TSR系统中的空间记忆设计。我们设计了一些可以数学地描述这种设计对TSR系统级攻击成功影响的新的攻击成功指标,并使用它们重新审视现有的攻击。通过这些努力,我们发现了7个新的观察结果,其中一些直接挑战了之前作品中的观察或声称,因为引入了新的指标。
https://arxiv.org/abs/2409.09860
In the age of powerful diffusion models such as DALL-E and Stable Diffusion, many in the digital art community have suffered style mimicry attacks due to fine-tuning these models on their works. The ability to mimic an artist's style via text-to-image diffusion models raises serious ethical issues, especially without explicit consent. Glaze, a tool that applies various ranges of perturbations to digital art, has shown significant success in preventing style mimicry attacks, at the cost of artifacts ranging from imperceptible noise to severe quality degradation. The release of Glaze has sparked further discussions regarding the effectiveness of similar protection methods. In this paper, we propose GLEAN- applying I2I generative networks to strip perturbations from Glazed images, evaluating the performance of style mimicry attacks before and after GLEAN on the results of Glaze. GLEAN aims to support and enhance Glaze by highlighting its limitations and encouraging further development.
在诸如DALL-E和Stable Diffusion等强大的扩散模型出现的时代,许多数字艺术领域的人由于在他们的作品上对这些模型进行微调而遭受了风格模仿攻击。通过文本转图像扩散模型模仿艺术家的风格,引发了严重的道德问题,尤其是没有明确授权的情况下。Glaze是一个应用各种范围扰动的数字艺术工具,已经在防止风格模仿攻击方面取得了显著的成功,代价是從微不可见的噪音到严重质量退化。Glaze的发布引发了关于类似保护方法的效性的进一步讨论。在本文中,我们提出了GLEAN,将I2I生成网络应用于Glazed图像,在GLEAN之前和之后比较了风格模仿攻击的效果。GLEAN旨在通过突出其局限性并鼓励进一步发展来支持和增强Glaze。
https://arxiv.org/abs/2409.10578
This paper presents a novel approach to training neural networks with formal safety guarantees using semidefinite programming (SDP) for verification. Our method focuses on verifying safety over large, high-dimensional input regions, addressing limitations of existing techniques that focus on adversarial robustness bounds. We introduce an ADMM-based training scheme for an accurate neural network classifier on the Adversarial Spheres dataset, achieving provably perfect recall with input dimensions up to $d=40$. This work advances the development of reliable neural network verification methods for high-dimensional systems, with potential applications in safe RL policies.
本文提出了一种使用半定规划(SDP)基于形式安全保证训练神经网络的新方法,用于验证。我们的方法重点关注验证大型、高维输入区域的完整性,解决了现有技术集中在对抗性鲁棒性上限的局限性。我们引入了一种基于ADMM的神经网络分类器训练方案,在Adversarial Spheres数据集上实现了具有可证明完美召回的输入维度达到$d=40$的情况。本研究为高维系统可靠神经网络验证方法的发展做出了贡献,这些方法的应用在安全强化学习策略中具有潜在意义。
https://arxiv.org/abs/2409.09687