The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
https://arxiv.org/abs/2309.13006
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.
在计算机断层扫描(CT)生成中,重建内核一致性至关重要,因为 underlying CT texture 在 quantitative image analysis 中可能会影响测量结果。一致性(即内核转换)最小化由于不一致的重建内核引起的测量差异。现有方法研究在一家或多家制造商中一致性 CT 扫描。但是,这些方法需要具有空间和行为上的匹配的硬和软的重建内核的配对扫描。此外,需要在制造商内部不同内核配对之间训练大量模型。在本研究中,我们采用一个无配对的图像转换方法,以研究来自不同制造商的重建内核之间的一致性,并通过构建多路径循环生成对抗网络(GAN)来构建路径循环生成器。我们使用来自国家肺筛检试验数据集的西门子和GE的硬和软的重建内核。我们使用每个重建内核的 50 次扫描训练路径循环生成器。为了评估一致性对重建内核的影响,我们每个从西门子硬内核、GE软内核和GE硬内核中将 50 次扫描 harmonize 到西门子软内核(B30f)上并评估微血管计数。我们考虑年龄、吸烟状况、性别和供应商等因素,并使用线性模型进行方差分析,以评估微血管计数结果的精度。我们的方法最小化了微血管测量的差异,并强调年龄、性别、吸烟状况和供应商对微血管计数量化的影响。
https://arxiv.org/abs/2309.12953
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
关键字检测(KWS)是指识别音频流中的预先定义词汇的任务。随着深度学习网络的最新进展,它已经成为激活和控制小型设备的流行技术,例如语音助手。然而,依靠此类模型来处理边缘设备可能会由于硬件限制而面临挑战。此外,随着对基于语音技术的对抗攻击的增加,开发对此类攻击具有鲁棒性的解决方案变得越来越重要。在这个研究中,我们提出了VIC-KD,一个模型压缩和对抗鲁棒性的鲁棒分岔方法。通过使用自监督语音表示,我们证明了在教师和学生模型的潜在表示中添加几何先验可以生成更加鲁棒的目标模型。在Google Speech 命令数据集上的实验表明,该方法在鲁棒精度方面相对于当前先进的鲁棒分岔方法如ard和RSLAD分别提高了12%和8%。
https://arxiv.org/abs/2309.12914
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
有限次学习在解决目标查询集合中 novel 类的新样本以及在不同域之间的视觉转换方面取得了令人印象深刻的进展。然而,现有技术在域转换下识别目标异常样本方面存在缺陷,通过学习从源域中拒绝源域中的伪异常样本,导致对两个问题的不完整解决方案。为了全面解决这些挑战,我们提出了一种名为“域自适应有限次开放集识别”(DA-FSOS)的新方法,并介绍了名为 DAFOSNET 的元学习架构。在训练期间,我们的模型学习一个共享且具有区别性的嵌入空间,同时创建一个伪开放空间的决策边界,给定一个完全监督的源域和一个标签独立的有限次目标域。为了增强数据密度,我们使用具有可调节噪声均值的两个条件对抗网络,增加两个域的关闭和伪开放空间。此外,我们提出了一个域特定的批量归一化类原型对齐策略,以全球对齐两个域,同时通过新度量目标保证类分类性。我们的训练方法确保了 DAFOS-NET 可以在目标域中的新场景下泛化良好。基于 Office-Home、迷你 ImageNet/CUB 和 DomainNet 数据集,我们提出了三个基准指标,用于 DA-FSOS,并通过广泛的实验证明了 DAFOS-NET 的效力。
https://arxiv.org/abs/2309.12814
Robot multimodal locomotion encompasses the ability to transition between walking and flying, representing a significant challenge in robotics. This work presents an approach that enables automatic smooth transitions between legged and aerial locomotion. Leveraging the concept of Adversarial Motion Priors, our method allows the robot to imitate motion datasets and accomplish the desired task without the need for complex reward functions. The robot learns walking patterns from human-like gaits and aerial locomotion patterns from motions obtained using trajectory optimization. Through this process, the robot adapts the locomotion scheme based on environmental feedback using reinforcement learning, with the spontaneous emergence of mode-switching behavior. The results highlight the potential for achieving multimodal locomotion in aerial humanoid robotics through automatic control of walking and flying modes, paving the way for applications in diverse domains such as search and rescue, surveillance, and exploration missions. This research contributes to advancing the capabilities of aerial humanoid robots in terms of versatile locomotion in various environments.
机器人的多模式行走涵盖了步行和飞行之间的平滑过渡,代表了机器人领域的一个重大挑战。这项工作提出了一种方法,可以使机器人实现自动平滑过渡,即从步行到飞行的转型。利用对抗运动先验的概念,我们的算法使机器人能够模仿运动数据集,并完成所需的任务,而不需要复杂的奖励函数。机器人从人类步态学习步行模式,从通过路径优化获得的飞行模式中学习空中行走模式。通过这个过程,机器人使用强化学习环境反馈来适应步行和飞行模式,并出现了模式切换行为。结果突出了通过自动控制步行和飞行模式实现多模式行走的潜力,为各种应用领域(如搜索和救援、监视和探索)提供了应用前景。这项工作为空中型人类机器人在各种环境中的多功能行走提供了扩展能力。
https://arxiv.org/abs/2309.12784
Barren plateaus are a central bottleneck in the scalability of variational quantum algorithms (VQAs), and are known to arise in various ways, from circuit depth and hardware noise to global observables. However, a caveat of most existing results is the requirement of t-design circuit assumptions that are typically not satisfied in practice. In this work, we loosen these assumptions altogether and derive tight upper and lower bounds on gradient concentration, for a large class of parameterized quantum circuits and arbitrary observables. By requiring only a couple of design choices that are constructive and easily verified, our results can readily be leveraged to rule out barren plateaus for explicit circuits and mixed observables, namely, observables containing a non-vanishing local term. This insight has direct implications for hybrid Quantum Generative Adversarial Networks (qGANs), a generative model that can be reformulated as a VQA with an observable composed of local and global terms. We prove that designing the discriminator appropriately leads to 1-local weights that stay constant in the number of qubits, regardless of discriminator depth. Combined with our first contribution, this implies that qGANs with shallow generators can be trained at scale without suffering from barren plateaus -- making them a promising candidate for applications in generative quantum machine learning. We demonstrate this result by training a qGAN to learn a 2D mixture of Gaussian distributions with up to 16 qubits, and provide numerical evidence that global contributions to the gradient, while initially exponentially small, may kick in substantially over the course of training.
荒芜的凸起是Variational Quantum 算法(VQA) scalability 的核心瓶颈,它们通常以不同的方式出现,从电路深度和硬件噪声到全局观测器。然而,大多数现有结果的一个缺点是要求 t 设计电路假设,这些假设在实际应用中通常无法满足。在本文中,我们放松了这些假设,并推导了梯度聚集的 tight upper and lower bounds,适用于大量参数化的量子电路和任意观测器。只需要几个 constructive 且容易验证的设计选择,我们就能够利用结果来排除 explicit 电路和混合观测器的荒芜凸起,即包含局部变量的观测器。这一见解直接适用于混合量子生成对抗网络(qGANs),一种可以重新表述为 VQA 的生成模型,观测器由 local 和全局变量组成。我们证明,适当地设计分岔器会导致 1 局部权重,在qubit 数量不变的情况下,无论分岔器深度如何,都保持恒定。与我们的第一项贡献相结合,这意味着 qGAN 的浅生成器可以在规模上训练,而无需经历荒芜凸起,使其成为生成量子机器学习应用程序的有前途的选择。我们训练了一个 qGAN 来学习一个 2D 混合高斯分布,高达 16 qubit,并提供了数值证据, global 对梯度的贡献,尽管开始时呈指数级小,可能在训练过程中显著增加。
https://arxiv.org/abs/2309.12681
As Machine Learning (ML) is increasingly used in solving various tasks in real-world applications, it is crucial to ensure that ML algorithms are robust to any potential worst-case noises, adversarial attacks, and highly unusual situations when they are designed. Studying ML robustness will significantly help in the design of ML algorithms. In this paper, we investigate ML robustness using adversarial training in centralized and decentralized environments, where ML training and testing are conducted in one or multiple computers. In the centralized environment, we achieve a test accuracy of 65.41% and 83.0% when classifying adversarial examples generated by Fast Gradient Sign Method and DeepFool, respectively. Comparing to existing studies, these results demonstrate an improvement of 18.41% for FGSM and 47% for DeepFool. In the decentralized environment, we study Federated learning (FL) robustness by using adversarial training with independent and identically distributed (IID) and non-IID data, respectively, where CIFAR-10 is used in this research. In the IID data case, our experimental results demonstrate that we can achieve such a robust accuracy that it is comparable to the one obtained in the centralized environment. Moreover, in the non-IID data case, the natural accuracy drops from 66.23% to 57.82%, and the robust accuracy decreases by 25% and 23.4% in C&W and Projected Gradient Descent (PGD) attacks, compared to the IID data case, respectively. We further propose an IID data-sharing approach, which allows for increasing the natural accuracy to 85.04% and the robust accuracy from 57% to 72% in C&W attacks and from 59% to 67% in PGD attacks.
机器学习(ML)在实际应用中越来越用于解决各种任务,因此确保机器学习算法在设计和开发时能够抵御潜在的最佳情况噪声、对抗攻击和非常规情况是至关重要的。研究机器学习鲁棒性将极大地有助于机器学习算法的设计。在本文中,我们使用对抗训练在集中式和分布式环境中研究机器学习鲁棒性。在集中式环境中,我们使用 Fast Gradient Sign Method 和 DeepFool 生成的对抗样本进行分类,分别得出测试准确率为65.41%和83.0%。与现有研究相比,这些结果表明 FGSM 和 DeepFool 的测试准确率都有显著提高。在分布式环境中,我们使用独立和同分布(IID)数据和非 IID 数据分别研究联邦学习(FL)的鲁棒性,其中 CIFAR-10 是本研究所使用的数据集。在 IID 数据集中,我们的实验结果表明,我们能够实现类似于集中式环境所获得的鲁棒精度。此外,在非 IID 数据集中,自然精度从66.23%降至57.82%,而对抗精度从25%降至23.4%,C&W 和 projected梯度下降(PGD)攻击相对于 IID 数据集中的抗攻击精度下降了25%和23.4%。我们还提出了 IID 数据共享方法,该方法允许将自然精度提高至85.04%,C&W 攻击的抗攻击精度从57%至72%,而 PGD 攻击的抗攻击精度从59%至67%。
https://arxiv.org/abs/2309.12593
In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
在监控中,准确识别车牌常常由于它们的质量较低和尺寸较小而受到限制,从而影响识别精度。尽管基于人工智能的图像超分辨率技术取得了进展,但像卷积神经网络(CNNs)和生成对抗网络(GANs)等方法在增强车牌图像方面仍无法满足要求。本研究利用最先进的扩散模型,该模型在图像恢复方面一直比其他深度学习技术表现更好。通过使用沙特车牌的 curated 数据集,以低和高分辨率两种形式训练该模型,我们发现了扩散模型的优越性。方法在峰值信号-噪声比(PSNR)方面实现了12.55\%和37.32%的 improvement,分别比 SwinIR 和ESRGAN 提高了37.32%和12.55%。此外,我们的方法和这些技术在结构相似性指数(SSIM)方面超过了它们,分别提高了4.89%和17.66%。此外,92%的人类评估者认为我们的图像比来自其他算法的图像更喜欢。本研究提出了车牌超分辨率的开创性解决方案,对于监控系统具有实际潜力。
https://arxiv.org/abs/2309.12506
Instruction-tuned Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. In fact, several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation. In this paper, we investigate It-LLMs' resilience abilities towards a series of probing tests using four MCQ benchmarks. Introducing adversarial examples, we show a significant performance gap, mainly when varying the order of the choices, which reveals a selection bias and brings into discussion reasoning abilities. Following a correlation between first positions and model choices due to positional bias, we hypothesized the presence of structural heuristics in the decision-making process of the It-LLMs, strengthened by including significant examples in few-shot scenarios. Finally, by using the Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate the bias by obtaining more robust models.
指令优化的大型语言模型(It-LLMs)表现出卓越的推理能力,能够处理所有参与者的认知状态、意图和反应,让人类能够有效地指导和理解日常的社交互动。实际上,已经提出了几个多选题问题(MCQ)基准来构建模型能力的全面评估。然而,以前的研究已经表明It-LLMs存在固有的“顺序偏差”,这给适当的评估带来了挑战。在本文中,我们使用四个MCQ基准来研究It-LLMs在面对一系列测试时的坚韧能力。引入对抗性示例后,我们展示了显著的性能差距,主要是在改变选择顺序时,这揭示了选择偏差,并讨论了推理能力。由于位置偏差的影响,我们假设It-LLMs的决策过程中存在结构性启发式,通过在少量情况下包含重要的例子来加强。最后,我们使用思维链(CoT)技术,激发模型推理并减轻偏差,通过获得更加可靠的模型。
https://arxiv.org/abs/2309.12481
Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Recently, based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer's feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these "skill neurons", using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data, with higher robustness for T5 than RoBERTa. At the same time, we replicate the existence of skill neurons in RoBERTa and further show that skill neurons also seem to exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model's ability to activate the relevant skill neurons on adversarial data.
Prompt Tuning 是一种常用于训练大型语言模型(PLM)的高效参数微调方法。最近,基于对罗BERTa的实验,有研究表明,Prompt Tuning 激活了Transformer 的反馈网络中的特定神经元,这些神经元对于给定任务具有很高的预测性和选择性。在本文中,我们使用罗BERTa和T5来研究Prompt Tuning 与这些“技能神经元”的鲁棒性关系。我们显示,对于特定的任务,Prompt Tuning 可以将其迁移到相同类型的任务,但它们对于对抗数据不太具有鲁棒性。T5的鲁棒性比罗BERTa更高。同时,我们复制了罗BERTa中技能神经元的存在,并进一步显示,技能神经元在非对抗数据上的预测性比其他抗干扰数据上的神经元更高,这与罗BERTa不同。我们得出结论,更高的对抗鲁棒性可能与模型在对抗数据上激活相关技能神经元的能力有关。
https://arxiv.org/abs/2309.12263
Biomedical image datasets can be imbalanced due to the rarity of targeted diseases. Generative Adversarial Networks play a key role in addressing this imbalance by enabling the generation of synthetic images to augment datasets. It is important to generate synthetic images that incorporate a diverse range of features to accurately represent the distribution of features present in the training imagery. Furthermore, the absence of diverse features in synthetic images can degrade the performance of machine learning classifiers. The mode collapse problem impacts Generative Adversarial Networks' capacity to generate diversified images. Mode collapse comes in two varieties: intra-class and inter-class. In this paper, both varieties of the mode collapse problem are investigated, and their subsequent impact on the diversity of synthetic X-ray images is evaluated. This work contributes an empirical demonstration of the benefits of integrating the adaptive input-image normalization with the Deep Convolutional GAN and Auxiliary Classifier GAN to alleviate the mode collapse problems. Synthetically generated images are utilized for data augmentation and training a Vision Transformer model. The classification performance of the model is evaluated using accuracy, recall, and precision scores. Results demonstrate that the DCGAN and the ACGAN with adaptive input-image normalization outperform the DCGAN and ACGAN with un-normalized X-ray images as evidenced by the superior diversity scores and classification scores.
生物医学图像数据集可能因为目标疾病的罕见性而存在不平衡。生成对抗网络在解决这种不平衡方面发挥着关键作用,通过允许生成合成图像来扩展数据集。生成合成图像非常重要,因为它们必须包含各种特征,以准确地代表训练图像中的特征分布。此外,合成图像中没有各种特征的情况可能会降低机器学习分类器的性能。模式崩溃问题影响生成对抗网络生成多样化图像的能力。模式崩溃有两种类型:内部类和外部类。在本文中,两种类型的模式崩溃问题都被研究,并评估了它们对合成X射线图像多样性的影响。这项工作提供了经验证的示例,证明集成自适应输入图像标准化与深度卷积生成对抗网络和辅助分类生成对抗网络可以减轻模式崩溃问题。合成生成的图像被用于数据增强和训练视觉转换模型。模型的分类性能使用准确性、召回率和精度得分进行评估。结果证明,具有自适应输入图像标准化的DCGAN和ACGAN比未标准化的X射线图像的DCGAN和ACGAN表现更好,因为多样性得分和分类得分都更高。
https://arxiv.org/abs/2309.12245
Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
本文提出了TextCLIP,一个无对抗训练的文本指导图像生成和操纵的统一框架。该框架接受对应这两个不同任务的图像或随机噪声输入,并利用StyleGAN的强大生成能力和Contrastive Language-Image Pre-training(CLIP)的文本图像表示能力,以生成分辨率为$1024\times1024$的图像。在多模态CelebA-HQ数据集上进行广泛的实验表明,我们提出的方法在文本指导生成任务和操纵任务中都优于现有的最先进的方法。
https://arxiv.org/abs/2309.11923
Safety controllers is widely used to achieve safe reinforcement learning. Most methods that apply a safety controller are using handcrafted safety constraints to construct the safety controller. However, when the environment dynamics are sophisticated, handcrafted safety constraints become unavailable. Therefore, it worth to research on constructing safety controllers by learning algorithms. We propose a three-stage architecture for safe reinforcement learning, namely TU-Recovery Architecture. A safety critic and a recovery policy is learned before task training. They form a safety controller to ensure safety in task training. Then a phenomenon induced by disagreement between task policy and recovery policy, called adversarial phenomenon, which reduces learning efficiency and model performance, is described. Auxiliary reward is proposed to mitigate adversarial phenomenon, while help the task policy to learn to recover from high-risk states. A series of experiments are conducted in a robot navigation environment. Experiments demonstrate that TU-Recovery outperforms unconstrained counterpart in both reward gaining and constraint violations during task training, and auxiliary reward further improve TU-Recovery in reward-to-cost ratio by significantly reduce constraint violations.
安全性控制器被广泛应用来实现安全 reinforcement learning。大多数应用安全性控制器的方法都使用手工制成的安全限制来构建安全性控制器。然而,当环境动态变得复杂时,手工制成的安全限制变得不可用。因此,研究使用学习算法构建安全性控制器非常值得。我们提出了一种安全强化学习的三阶段架构,即 TU- Recovery Architecture。在任务训练之前,我们需要学习安全性批判和恢复策略,组成了安全性控制器,以确保任务训练中的安全。然后,由于任务政策和恢复政策之间的不同意而产生的一种现象,被称为对抗性现象,它会降低学习效率和模型表现。我们提出了一种辅助奖励来缓解对抗性现象,同时帮助任务政策从高风险状态中学习恢复。一系列实验在机器人导航环境中进行。实验表明,在任务训练期间,TU- Recovery 在奖励获得和约束违反方面比没有限制的对照方案表现更好,辅助奖励进一步改善了 TU- Recovery 在奖励与成本之间的奖励比例,通过显著减少约束违反。
https://arxiv.org/abs/2309.11907
We present a novel adversarial model for authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. The attack idea is inspired by and named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. In particular, this work investigates whether it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target's IMUGait pattern. Nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. Each pattern attacked a wide variety of user authentication models. The deeper analysis of error rates (before and after the attack) challenges the belief that authentication systems based on IMUGait patterns are the most difficult to spoof; further research is needed on adversarial models and associated countermeasures.
我们提出了一种针对身份验证系统的新型dversarial模型,该模型利用智能手机内置的惯性测量单元(IMU)记录的步态模式。攻击灵感来自于对基于知识(PIN或密码)的身份验证系统的字典攻击概念,并因此命名为IMUGait模式攻击。特别,该研究调查了是否可以建立IMUGait模式字典并使用它进行攻击,或者找到能够积极模仿目标IMUGait模式的模仿者。九个身体和年龄差异不同的个体在四个预先定义的可控制和适应的步态因素的等级上行走,生成了178个独特的IMUGait模式。每个模式攻击了多种用户身份验证模型。更深入的分析(攻击前和攻击后)挑战了基于IMUGait模式的身份验证系统是最具欺骗性的想法;需要更多的dversarial模型和相关对策研究。
https://arxiv.org/abs/2309.11766
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
https://arxiv.org/abs/2309.11751
Unsupervised domain adaptation (UDA) is an effective approach to handle the lack of annotations in the target domain for the semantic segmentation task. In this work, we consider a more practical UDA setting where the target domain contains sequential frames of the unlabeled videos which are easy to collect in practice. A recent study suggests self-supervised learning of the object motion from unlabeled videos with geometric constraints. We design a motion-guided domain adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object motion to learn effective representations in the target domain. MoDA differs from previous methods that use temporal consistency regularization for the target domain frames. Instead, MoDA deals separately with the domain alignment on the foreground and background categories using different strategies. Specifically, MoDA contains foreground object discovery and foreground semantic mining to align the foreground domain gaps by taking the instance-level guidance from the object motion. Additionally, MoDA includes background adversarial training which contains a background category-specific discriminator to handle the background domain gaps. Experimental results on multiple benchmarks highlight the effectiveness of MoDA against existing approaches in the domain adaptive image segmentation and domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in conjunction with existing state-of-the-art approaches to further improve performance.
无监督域适应(UDA)是一种有效的方法,用于处理目标域中缺乏标注数据的任务,即语义分割任务。在本文中,我们考虑一个更加实用的UDA场景,其中目标域包含未标注视频的连续帧,在实践中很容易收集。一项最近的研究建议从未标注视频进行自监督的物体运动学习。我们设计了一个基于运动指导的目标域自适应语义分割框架(MoDA),该框架利用自监督的物体运动学习在目标域中学习有效的表示。MoDA与之前的方法不同,使用目标域帧的时序一致性 regularization,而不同于以往的方法,MoDA采用不同的策略来处理前方和背景类别之间的域对齐。具体而言,MoDA包括前方物体发现和前方语义挖掘,通过从物体运动获取实例级指导,将前方域差距对齐。此外,MoDA还包括背景对抗训练,其中包含背景类别特定的分类器,以处理背景域差距。多个基准测试的实验结果强调了MoDA在域自适应图像分割和域自适应视频分割中的效果。此外,MoDA是灵活的,可以与现有的高级方法相结合,以进一步提高性能。
https://arxiv.org/abs/2309.11711
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.
现如今,在监督学习场景中,3D人类姿态估计可以实现极高的精度。因此,解决3D姿态标注不足的问题已经引起了越来越多的关注。特别是,有一些方法提出了通过自监督学习来学习图像表示,以便将外观信息从姿态信息中分离。这些方法只需要少量的监督数据来训练姿态回归器,使用姿态相关的隐向量作为输入,因为外观信息应该被排除在外。在本文中,我们进行了深入分析,以了解art-of-the-art的分离表示学习方法在何种程度上真正将外观信息从姿态信息中分离。首先,我们从自监督网络的角度出发,通过多种图像合成实验研究了分离的实现方式。其次,我们研究了从攻击视角下的3D姿态回归器的分离实现。具体来说,我们设计了一种新的攻击策略,重点是生成物体的自然外观变化,我们可以期望分离网络具有鲁棒性。总之,我们的分析表明,三个art-of-the-art的分离表示学习框架的分离程度虽然还没有完全完成,但它们的姿态代码中包含重要的外观信息。我们相信,我们的方法提供了评估自监督3D人类姿态估计中姿态与外观分离程度的宝贵测试平台。
https://arxiv.org/abs/2309.11667
The prevalence of ubiquitous location-aware devices and mobile Internet enables us to collect massive individual-level trajectory dataset from users. Such trajectory big data bring new opportunities to human mobility research but also raise public concerns with regard to location privacy. In this work, we present the Conditional Adversarial Trajectory Synthesis (CATS), a deep-learning-based GeoAI methodological framework for privacy-preserving trajectory data generation and publication. CATS applies K-anonymity to the underlying spatiotemporal distributions of human movements, which provides a distributional-level strong privacy guarantee. By leveraging conditional adversarial training on K-anonymized human mobility matrices, trajectory global context learning using the attention-based mechanism, and recurrent bipartite graph matching of adjacent trajectory points, CATS is able to reconstruct trajectory topology from conditionally sampled locations and generate high-quality individual-level synthetic trajectory data, which can serve as supplements or alternatives to raw data for privacy-preserving trajectory data publication. The experiment results on over 90k GPS trajectories show that our method has a better performance in privacy preservation, spatiotemporal characteristic preservation, and downstream utility compared with baseline methods, which brings new insights into privacy-preserving human mobility research using generative AI techniques and explores data ethics issues in GIScience.
普遍的地理位置感知设备和移动互联网使得我们可以从用户那里收集大量的个人级轨迹数据集。这些轨迹大数据为人类移动研究带来了新的机会,但也引发了关于位置隐私的公众关注。在本研究中,我们提出了条件对抗轨迹合成(CATS),这是一种基于深度学习的GeoAI方法框架,用于隐私保护轨迹数据的生产和发布。CATS将K匿名化人类运动轨迹的时间和地点分布提供分布级别的强隐私保护。通过利用条件对抗训练基于K匿名化人类运动矩阵、利用注意力机制轨迹全球上下文学习,以及相邻轨迹点经常的二分类图匹配,CATS可以从条件采样位置重构轨迹拓扑,并生成高质量的个人级合成轨迹数据,作为隐私保护轨迹数据发布的原始数据的补充或替代品。对超过900,000GPS轨迹的实验结果显示,我们的方法和 baseline方法在隐私保护、时间和空间特征保留以及后续 utility方面表现更好,这为使用生成AI技术进行隐私保护人类移动研究带来了新的见解,并探索GIScience中的数据伦理问题。
https://arxiv.org/abs/2309.11587
Text-conditioned image generation models have recently achieved astonishing image quality and alignment results. Consequently, they are employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content. As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks. Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
文本条件生成模型最近取得了惊人的图像质量和对齐结果,因此被广泛应用于越来越多的应用程序中。由于它们是高度数据驱动的,依赖于随机从互联网上 scraping的数十亿规模的 datasets,所以也生成了不安全的内容。作为对dversarial Nibbler挑战的贡献,我们从现有的安全基准数据集中提炼了超过1,000个潜在的dversarial输入。我们对收集的提示和相应的图像进行了分析,展示了输入过滤器的脆弱性,并进一步揭示了当前生成图像模型中的系统安全性问题。
https://arxiv.org/abs/2309.11575
Automatic Speech Recognition systems have been shown to be vulnerable to adversarial attacks that manipulate the command executed on the device. Recent research has focused on exploring methods to create such attacks, however, some issues relating to Over-The-Air (OTA) attacks have not been properly addressed. In our work, we examine the needed properties of robust attacks compatible with the OTA model, and we design a method of generating attacks with arbitrary such desired properties, namely the invariance to synchronization, and the robustness to filtering: this allows a Denial-of-Service (DoS) attack against ASR systems. We achieve these characteristics by constructing attacks in a modified frequency domain through an inverse Fourier transform. We evaluate our method on standard keyword classification tasks and analyze it in OTA, and we analyze the properties of the cross-domain attacks to explain the efficiency of the approach.
自动语音识别系统已被证明容易受到操纵设备命令的dversarial攻击。最近的研究主要关注了创造这些攻击的方法,但是与空中攻击(OTA攻击)相关的一些问题并没有得到适当的解决。在我们的研究中,我们检查了与OTA模型兼容的稳健攻击所需的特性,并设计了一种能够生成任意具有这些要求的攻击的方法,即对同步的不变性和滤波的稳健性,这允许对语音识别系统进行DoS攻击。我们通过使用逆傅里叶变换修改频率域来构建攻击。我们评估了我们的方法的标准关键字分类任务,并在OTA中进行分析,并分析跨域攻击的特性,以解释方法的效率。
https://arxiv.org/abs/2309.11462