The rapid development of AR/VR brings tremendous demands for 3D content. While the widely-used Computer-Aided Design (CAD) method requires a time-consuming and labor-intensive modeling process, sketch-based 3D modeling offers a potential solution as a natural form of computer-human interaction. However, the sparsity and ambiguity of sketches make it challenging to generate high-fidelity content reflecting creators' ideas. Precise drawing from multiple views or strategic step-by-step drawings is often required to tackle the challenge but is not friendly to novice users. In this work, we introduce a novel end-to-end approach, Deep3DSketch+, which performs 3D modeling using only a single free-hand sketch without inputting multiple sketches or view information. Specifically, we introduce a lightweight generation network for efficient inference in real-time and a structural-aware adversarial training approach with a Stroke Enhancement Module (SEM) to capture the structural information to facilitate learning of the realistic and fine-detailed shape structures for high-fidelity performance. Extensive experiments demonstrated the effectiveness of our approach with the state-of-the-art (SOTA) performance on both synthetic and real datasets.
增强现实(AR)和虚拟现实(VR)的迅速发展,对3D内容的需求急剧增加。虽然广泛使用的计算机辅助设计(CAD)方法需要进行耗时且劳动力密集型的建模过程,但基于 Sketch 的3D建模作为一种自然计算机-人类交互的形式,提供了一个潜在的解决方案。然而, Sketch 的稀疏和歧义使得生成高保真的内容非常困难,通常需要进行精确的多视图绘图或关键步骤的 strategic 绘图,但这不适用于初学者。在这个项目中,我们介绍了一种全新的端到端方法 Deep3DSketch+,它使用单个自由手绘 Sketch 来进行3D建模,而不需要输入多个 Sketch 或视图信息。具体来说,我们介绍了一种轻量级的生成网络,用于实时高效推理,并介绍了一种结构aware的对抗训练方法,以及一个 stroke 增强模块(SEM),以捕获结构信息,以便于学习 realistic 和精细的形状结构,以获得高保真的性能。广泛的实验证明了我们的方法在合成和真实数据集上具有最先进的性能(SOTA)。
The reconstruction kernel in computed tomography (CT) generation determines the texture of the image. Consistency in reconstruction kernels is important as the underlying CT texture can impact measurements during quantitative image analysis. Harmonization (i.e., kernel conversion) minimizes differences in measurements due to inconsistent reconstruction kernels. Existing methods investigate harmonization of CT scans in single or multiple manufacturers. However, these methods require paired scans of hard and soft reconstruction kernels that are spatially and anatomically aligned. Additionally, a large number of models need to be trained across different kernel pairs within manufacturers. In this study, we adopt an unpaired image translation approach to investigate harmonization between and across reconstruction kernels from different manufacturers by constructing a multipath cycle generative adversarial network (GAN). We use hard and soft reconstruction kernels from the Siemens and GE vendors from the National Lung Screening Trial dataset. We use 50 scans from each reconstruction kernel and train a multipath cycle GAN. To evaluate the effect of harmonization on the reconstruction kernels, we harmonize 50 scans each from Siemens hard kernel, GE soft kernel and GE hard kernel to a reference Siemens soft kernel (B30f) and evaluate percent emphysema. We fit a linear model by considering the age, smoking status, sex and vendor and perform an analysis of variance (ANOVA) on the emphysema scores. Our approach minimizes differences in emphysema measurement and highlights the impact of age, sex, smoking status and vendor on emphysema quantification.
在计算机断层扫描(CT)生成中,重建内核一致性至关重要,因为 underlying CT texture 在 quantitative image analysis 中可能会影响测量结果。一致性(即内核转换)最小化由于不一致的重建内核引起的测量差异。现有方法研究在一家或多家制造商中一致性 CT 扫描。但是,这些方法需要具有空间和行为上的匹配的硬和软的重建内核的配对扫描。此外,需要在制造商内部不同内核配对之间训练大量模型。在本研究中,我们采用一个无配对的图像转换方法,以研究来自不同制造商的重建内核之间的一致性,并通过构建多路径循环生成对抗网络(GAN)来构建路径循环生成器。我们使用来自国家肺筛检试验数据集的西门子和GE的硬和软的重建内核。我们使用每个重建内核的 50 次扫描训练路径循环生成器。为了评估一致性对重建内核的影响,我们每个从西门子硬内核、GE软内核和GE硬内核中将 50 次扫描 harmonize 到西门子软内核(B30f)上并评估微血管计数。我们考虑年龄、吸烟状况、性别和供应商等因素,并使用线性模型进行方差分析,以评估微血管计数结果的精度。我们的方法最小化了微血管测量的差异,并强调年龄、性别、吸烟状况和供应商对微血管计数量化的影响。
Keyword spotting (KWS) refers to the task of identifying a set of predefined words in audio streams. With the advances seen recently with deep neural networks, it has become a popular technology to activate and control small devices, such as voice assistants. Relying on such models for edge devices, however, can be challenging due to hardware constraints. Moreover, as adversarial attacks have increased against voice-based technologies, developing solutions robust to such attacks has become crucial. In this work, we propose VIC-KD, a robust distillation recipe for model compression and adversarial robustness. Using self-supervised speech representations, we show that imposing geometric priors to the latent representations of both Teacher and Student models leads to more robust target models. Experiments on the Google Speech Commands datasets show that the proposed methodology improves upon current state-of-the-art robust distillation methods, such as ARD and RSLAD, by 12% and 8% in robust accuracy, respectively.
关键字检测(KWS)是指识别音频流中的预先定义词汇的任务。随着深度学习网络的最新进展,它已经成为激活和控制小型设备的流行技术,例如语音助手。然而,依靠此类模型来处理边缘设备可能会由于硬件限制而面临挑战。此外,随着对基于语音技术的对抗攻击的增加,开发对此类攻击具有鲁棒性的解决方案变得越来越重要。在这个研究中,我们提出了VIC-KD,一个模型压缩和对抗鲁棒性的鲁棒分岔方法。通过使用自监督语音表示,我们证明了在教师和学生模型的潜在表示中添加几何先验可以生成更加鲁棒的目标模型。在Google Speech 命令数据集上的实验表明,该方法在鲁棒精度方面相对于当前先进的鲁棒分岔方法如ard和RSLAD分别提高了12%和8%。
Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOSNET. During training, our model learns a shared and discriminative embedding space while creating a pseudo open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-NET can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-NET through extensive experimentation
有限次学习在解决目标查询集合中 novel 类的新样本以及在不同域之间的视觉转换方面取得了令人印象深刻的进展。然而,现有技术在域转换下识别目标异常样本方面存在缺陷,通过学习从源域中拒绝源域中的伪异常样本,导致对两个问题的不完整解决方案。为了全面解决这些挑战,我们提出了一种名为“域自适应有限次开放集识别”(DA-FSOS)的新方法,并介绍了名为 DAFOSNET 的元学习架构。在训练期间,我们的模型学习一个共享且具有区别性的嵌入空间,同时创建一个伪开放空间的决策边界,给定一个完全监督的源域和一个标签独立的有限次目标域。为了增强数据密度,我们使用具有可调节噪声均值的两个条件对抗网络,增加两个域的关闭和伪开放空间。此外,我们提出了一个域特定的批量归一化类原型对齐策略,以全球对齐两个域,同时通过新度量目标保证类分类性。我们的训练方法确保了 DAFOS-NET 可以在目标域中的新场景下泛化良好。基于 Office-Home、迷你 ImageNet/CUB 和 DomainNet 数据集,我们提出了三个基准指标,用于 DA-FSOS,并通过广泛的实验证明了 DAFOS-NET 的效力。
Robot multimodal locomotion encompasses the ability to transition between walking and flying, representing a significant challenge in robotics. This work presents an approach that enables automatic smooth transitions between legged and aerial locomotion. Leveraging the concept of Adversarial Motion Priors, our method allows the robot to imitate motion datasets and accomplish the desired task without the need for complex reward functions. The robot learns walking patterns from human-like gaits and aerial locomotion patterns from motions obtained using trajectory optimization. Through this process, the robot adapts the locomotion scheme based on environmental feedback using reinforcement learning, with the spontaneous emergence of mode-switching behavior. The results highlight the potential for achieving multimodal locomotion in aerial humanoid robotics through automatic control of walking and flying modes, paving the way for applications in diverse domains such as search and rescue, surveillance, and exploration missions. This research contributes to advancing the capabilities of aerial humanoid robots in terms of versatile locomotion in various environments.
Barren plateaus are a central bottleneck in the scalability of variational quantum algorithms (VQAs), and are known to arise in various ways, from circuit depth and hardware noise to global observables. However, a caveat of most existing results is the requirement of t-design circuit assumptions that are typically not satisfied in practice. In this work, we loosen these assumptions altogether and derive tight upper and lower bounds on gradient concentration, for a large class of parameterized quantum circuits and arbitrary observables. By requiring only a couple of design choices that are constructive and easily verified, our results can readily be leveraged to rule out barren plateaus for explicit circuits and mixed observables, namely, observables containing a non-vanishing local term. This insight has direct implications for hybrid Quantum Generative Adversarial Networks (qGANs), a generative model that can be reformulated as a VQA with an observable composed of local and global terms. We prove that designing the discriminator appropriately leads to 1-local weights that stay constant in the number of qubits, regardless of discriminator depth. Combined with our first contribution, this implies that qGANs with shallow generators can be trained at scale without suffering from barren plateaus -- making them a promising candidate for applications in generative quantum machine learning. We demonstrate this result by training a qGAN to learn a 2D mixture of Gaussian distributions with up to 16 qubits, and provide numerical evidence that global contributions to the gradient, while initially exponentially small, may kick in substantially over the course of training.
荒芜的凸起是Variational Quantum 算法(VQA) scalability 的核心瓶颈,它们通常以不同的方式出现,从电路深度和硬件噪声到全局观测器。然而,大多数现有结果的一个缺点是要求 t 设计电路假设,这些假设在实际应用中通常无法满足。在本文中,我们放松了这些假设,并推导了梯度聚集的 tight upper and lower bounds,适用于大量参数化的量子电路和任意观测器。只需要几个 constructive 且容易验证的设计选择,我们就能够利用结果来排除 explicit 电路和混合观测器的荒芜凸起,即包含局部变量的观测器。这一见解直接适用于混合量子生成对抗网络(qGANs),一种可以重新表述为 VQA 的生成模型,观测器由 local 和全局变量组成。我们证明,适当地设计分岔器会导致 1 局部权重,在qubit 数量不变的情况下,无论分岔器深度如何,都保持恒定。与我们的第一项贡献相结合,这意味着 qGAN 的浅生成器可以在规模上训练,而无需经历荒芜凸起,使其成为生成量子机器学习应用程序的有前途的选择。我们训练了一个 qGAN 来学习一个 2D 混合高斯分布,高达 16 qubit,并提供了数值证据, global 对梯度的贡献,尽管开始时呈指数级小,可能在训练过程中显著增加。
As Machine Learning (ML) is increasingly used in solving various tasks in real-world applications, it is crucial to ensure that ML algorithms are robust to any potential worst-case noises, adversarial attacks, and highly unusual situations when they are designed. Studying ML robustness will significantly help in the design of ML algorithms. In this paper, we investigate ML robustness using adversarial training in centralized and decentralized environments, where ML training and testing are conducted in one or multiple computers. In the centralized environment, we achieve a test accuracy of 65.41% and 83.0% when classifying adversarial examples generated by Fast Gradient Sign Method and DeepFool, respectively. Comparing to existing studies, these results demonstrate an improvement of 18.41% for FGSM and 47% for DeepFool. In the decentralized environment, we study Federated learning (FL) robustness by using adversarial training with independent and identically distributed (IID) and non-IID data, respectively, where CIFAR-10 is used in this research. In the IID data case, our experimental results demonstrate that we can achieve such a robust accuracy that it is comparable to the one obtained in the centralized environment. Moreover, in the non-IID data case, the natural accuracy drops from 66.23% to 57.82%, and the robust accuracy decreases by 25% and 23.4% in C&W and Projected Gradient Descent (PGD) attacks, compared to the IID data case, respectively. We further propose an IID data-sharing approach, which allows for increasing the natural accuracy to 85.04% and the robust accuracy from 57% to 72% in C&W attacks and from 59% to 67% in PGD attacks.
机器学习(ML)在实际应用中越来越用于解决各种任务,因此确保机器学习算法在设计和开发时能够抵御潜在的最佳情况噪声、对抗攻击和非常规情况是至关重要的。研究机器学习鲁棒性将极大地有助于机器学习算法的设计。在本文中,我们使用对抗训练在集中式和分布式环境中研究机器学习鲁棒性。在集中式环境中,我们使用 Fast Gradient Sign Method 和 DeepFool 生成的对抗样本进行分类,分别得出测试准确率为65.41%和83.0%。与现有研究相比,这些结果表明 FGSM 和 DeepFool 的测试准确率都有显著提高。在分布式环境中,我们使用独立和同分布(IID)数据和非 IID 数据分别研究联邦学习(FL)的鲁棒性,其中 CIFAR-10 是本研究所使用的数据集。在 IID 数据集中,我们的实验结果表明,我们能够实现类似于集中式环境所获得的鲁棒精度。此外,在非 IID 数据集中,自然精度从66.23%降至57.82%,而对抗精度从25%降至23.4%,C&W 和 projected梯度下降(PGD)攻击相对于 IID 数据集中的抗攻击精度下降了25%和23.4%。我们还提出了 IID 数据共享方法,该方法允许将自然精度提高至85.04%,C&W 攻击的抗攻击精度从57%至72%,而 PGD 攻击的抗攻击精度从59%至67%。
In surveillance, accurately recognizing license plates is hindered by their often low quality and small dimensions, compromising recognition precision. Despite advancements in AI-based image super-resolution, methods like Convolutional Neural Networks (CNNs) and Generative Adversarial Networks (GANs) still fall short in enhancing license plate images. This study leverages the cutting-edge diffusion model, which has consistently outperformed other deep learning techniques in image restoration. By training this model using a curated dataset of Saudi license plates, both in low and high resolutions, we discovered the diffusion model's superior efficacy. The method achieves a 12.55\% and 37.32% improvement in Peak Signal-to-Noise Ratio (PSNR) over SwinIR and ESRGAN, respectively. Moreover, our method surpasses these techniques in terms of Structural Similarity Index (SSIM), registering a 4.89% and 17.66% improvement over SwinIR and ESRGAN, respectively. Furthermore, 92% of human evaluators preferred our images over those from other algorithms. In essence, this research presents a pioneering solution for license plate super-resolution, with tangible potential for surveillance systems.
在监控中,准确识别车牌常常由于它们的质量较低和尺寸较小而受到限制,从而影响识别精度。尽管基于人工智能的图像超分辨率技术取得了进展,但像卷积神经网络(CNNs)和生成对抗网络(GANs)等方法在增强车牌图像方面仍无法满足要求。本研究利用最先进的扩散模型,该模型在图像恢复方面一直比其他深度学习技术表现更好。通过使用沙特车牌的 curated 数据集,以低和高分辨率两种形式训练该模型,我们发现了扩散模型的优越性。方法在峰值信号-噪声比(PSNR)方面实现了12.55\%和37.32%的 improvement,分别比 SwinIR 和ESRGAN 提高了37.32%和12.55%。此外,我们的方法和这些技术在结构相似性指数(SSIM)方面超过了它们,分别提高了4.89%和17.66%。此外,92%的人类评估者认为我们的图像比来自其他算法的图像更喜欢。本研究提出了车牌超分辨率的开创性解决方案,对于监控系统具有实际潜力。
Instruction-tuned Large Language Models (It-LLMs) have been exhibiting outstanding abilities to reason around cognitive states, intentions, and reactions of all people involved, letting humans guide and comprehend day-to-day social interactions effectively. In fact, several multiple-choice questions (MCQ) benchmarks have been proposed to construct solid assessments of the models' abilities. However, earlier works are demonstrating the presence of inherent "order bias" in It-LLMs, posing challenges to the appropriate evaluation. In this paper, we investigate It-LLMs' resilience abilities towards a series of probing tests using four MCQ benchmarks. Introducing adversarial examples, we show a significant performance gap, mainly when varying the order of the choices, which reveals a selection bias and brings into discussion reasoning abilities. Following a correlation between first positions and model choices due to positional bias, we hypothesized the presence of structural heuristics in the decision-making process of the It-LLMs, strengthened by including significant examples in few-shot scenarios. Finally, by using the Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate the bias by obtaining more robust models.
Prompt Tuning is a popular parameter-efficient finetuning method for pre-trained large language models (PLMs). Recently, based on experiments with RoBERTa, it has been suggested that Prompt Tuning activates specific neurons in the transformer's feed-forward networks, that are highly predictive and selective for the given task. In this paper, we study the robustness of Prompt Tuning in relation to these "skill neurons", using RoBERTa and T5. We show that prompts tuned for a specific task are transferable to tasks of the same type but are not very robust to adversarial data, with higher robustness for T5 than RoBERTa. At the same time, we replicate the existence of skill neurons in RoBERTa and further show that skill neurons also seem to exist in T5. Interestingly, the skill neurons of T5 determined on non-adversarial data are also among the most predictive neurons on the adversarial data, which is not the case for RoBERTa. We conclude that higher adversarial robustness may be related to a model's ability to activate the relevant skill neurons on adversarial data.
Prompt Tuning 是一种常用于训练大型语言模型(PLM)的高效参数微调方法。最近,基于对罗BERTa的实验,有研究表明,Prompt Tuning 激活了Transformer 的反馈网络中的特定神经元,这些神经元对于给定任务具有很高的预测性和选择性。在本文中,我们使用罗BERTa和T5来研究Prompt Tuning 与这些“技能神经元”的鲁棒性关系。我们显示,对于特定的任务,Prompt Tuning 可以将其迁移到相同类型的任务,但它们对于对抗数据不太具有鲁棒性。T5的鲁棒性比罗BERTa更高。同时,我们复制了罗BERTa中技能神经元的存在,并进一步显示,技能神经元在非对抗数据上的预测性比其他抗干扰数据上的神经元更高,这与罗BERTa不同。我们得出结论,更高的对抗鲁棒性可能与模型在对抗数据上激活相关技能神经元的能力有关。
Biomedical image datasets can be imbalanced due to the rarity of targeted diseases. Generative Adversarial Networks play a key role in addressing this imbalance by enabling the generation of synthetic images to augment datasets. It is important to generate synthetic images that incorporate a diverse range of features to accurately represent the distribution of features present in the training imagery. Furthermore, the absence of diverse features in synthetic images can degrade the performance of machine learning classifiers. The mode collapse problem impacts Generative Adversarial Networks' capacity to generate diversified images. Mode collapse comes in two varieties: intra-class and inter-class. In this paper, both varieties of the mode collapse problem are investigated, and their subsequent impact on the diversity of synthetic X-ray images is evaluated. This work contributes an empirical demonstration of the benefits of integrating the adaptive input-image normalization with the Deep Convolutional GAN and Auxiliary Classifier GAN to alleviate the mode collapse problems. Synthetically generated images are utilized for data augmentation and training a Vision Transformer model. The classification performance of the model is evaluated using accuracy, recall, and precision scores. Results demonstrate that the DCGAN and the ACGAN with adaptive input-image normalization outperform the DCGAN and ACGAN with un-normalized X-ray images as evidenced by the superior diversity scores and classification scores.
Text-guided image generation aimed to generate desired images conditioned on given texts, while text-guided image manipulation refers to semantically edit parts of a given image based on specified texts. For these two similar tasks, the key point is to ensure image fidelity as well as semantic consistency. Many previous approaches require complex multi-stage generation and adversarial training, while struggling to provide a unified framework for both tasks. In this work, we propose TextCLIP, a unified framework for text-guided image generation and manipulation without adversarial training. The proposed method accepts input from images or random noise corresponding to these two different tasks, and under the condition of the specific texts, a carefully designed mapping network that exploits the powerful generative capabilities of StyleGAN and the text image representation capabilities of Contrastive Language-Image Pre-training (CLIP) generates images of up to $1024\times1024$ resolution that can currently be generated. Extensive experiments on the Multi-modal CelebA-HQ dataset have demonstrated that our proposed method outperforms existing state-of-the-art methods, both on text-guided generation tasks and manipulation tasks.
本文提出了TextCLIP,一个无对抗训练的文本指导图像生成和操纵的统一框架。该框架接受对应这两个不同任务的图像或随机噪声输入,并利用StyleGAN的强大生成能力和Contrastive Language-Image Pre-training(CLIP)的文本图像表示能力,以生成分辨率为$1024\times1024$的图像。在多模态CelebA-HQ数据集上进行广泛的实验表明,我们提出的方法在文本指导生成任务和操纵任务中都优于现有的最先进的方法。
Safety controllers is widely used to achieve safe reinforcement learning. Most methods that apply a safety controller are using handcrafted safety constraints to construct the safety controller. However, when the environment dynamics are sophisticated, handcrafted safety constraints become unavailable. Therefore, it worth to research on constructing safety controllers by learning algorithms. We propose a three-stage architecture for safe reinforcement learning, namely TU-Recovery Architecture. A safety critic and a recovery policy is learned before task training. They form a safety controller to ensure safety in task training. Then a phenomenon induced by disagreement between task policy and recovery policy, called adversarial phenomenon, which reduces learning efficiency and model performance, is described. Auxiliary reward is proposed to mitigate adversarial phenomenon, while help the task policy to learn to recover from high-risk states. A series of experiments are conducted in a robot navigation environment. Experiments demonstrate that TU-Recovery outperforms unconstrained counterpart in both reward gaining and constraint violations during task training, and auxiliary reward further improve TU-Recovery in reward-to-cost ratio by significantly reduce constraint violations.
安全性控制器被广泛应用来实现安全 reinforcement learning。大多数应用安全性控制器的方法都使用手工制成的安全限制来构建安全性控制器。然而,当环境动态变得复杂时,手工制成的安全限制变得不可用。因此,研究使用学习算法构建安全性控制器非常值得。我们提出了一种安全强化学习的三阶段架构,即 TU- Recovery Architecture。在任务训练之前,我们需要学习安全性批判和恢复策略,组成了安全性控制器,以确保任务训练中的安全。然后,由于任务政策和恢复政策之间的不同意而产生的一种现象,被称为对抗性现象,它会降低学习效率和模型表现。我们提出了一种辅助奖励来缓解对抗性现象,同时帮助任务政策从高风险状态中学习恢复。一系列实验在机器人导航环境中进行。实验表明,在任务训练期间,TU- Recovery 在奖励获得和约束违反方面比没有限制的对照方案表现更好,辅助奖励进一步改善了 TU- Recovery 在奖励与成本之间的奖励比例,通过显著减少约束违反。
We present a novel adversarial model for authentication systems that use gait patterns recorded by the inertial measurement unit (IMU) built into smartphones. The attack idea is inspired by and named after the concept of a dictionary attack on knowledge (PIN or password) based authentication systems. In particular, this work investigates whether it is possible to build a dictionary of IMUGait patterns and use it to launch an attack or find an imitator who can actively reproduce IMUGait patterns that match the target's IMUGait pattern. Nine physically and demographically diverse individuals walked at various levels of four predefined controllable and adaptable gait factors (speed, step length, step width, and thigh-lift), producing 178 unique IMUGait patterns. Each pattern attacked a wide variety of user authentication models. The deeper analysis of error rates (before and after the attack) challenges the belief that authentication systems based on IMUGait patterns are the most difficult to spoof; further research is needed on adversarial models and associated countermeasures.
Multimodal Large Language Models (MLLMs) that integrate text and other modalities (especially vision) have achieved unprecedented performance in various multimodal tasks. However, due to the unsolved adversarial robustness problem of vision models, MLLMs can have more severe safety and security risks by introducing the vision inputs. In this work, we study the adversarial robustness of Google's Bard, a competitive chatbot to ChatGPT that released its multimodal capability recently, to better understand the vulnerabilities of commercial MLLMs. By attacking white-box surrogate vision encoders or MLLMs, the generated adversarial examples can mislead Bard to output wrong image descriptions with a 22% success rate based solely on the transferability. We show that the adversarial examples can also attack other MLLMs, e.g., a 26% attack success rate against Bing Chat and a 86% attack success rate against ERNIE bot. Moreover, we identify two defense mechanisms of Bard, including face detection and toxicity detection of images. We design corresponding attacks to evade these defenses, demonstrating that the current defenses of Bard are also vulnerable. We hope this work can deepen our understanding on the robustness of MLLMs and facilitate future research on defenses. Our code is available at this https URL.
综合文本和其他modality(特别是视觉)的大型语言模型(MLLM),在多种modal任务中取得了前所未有的表现。然而,由于视觉模型未能解决的攻击鲁棒性问题,引入视觉输入可能会带来更加严重的安全和安全风险。在这项工作中,我们研究了Google的Bard(最近发布的 multimodal能力竞争的聊天机器人ChatGPT)的攻击鲁棒性,以更好地理解商业MLLM的漏洞。通过攻击白盒视觉编码器或MLLM,生成的dversarial examples可以误导Bard输出错误的图像描述,仅通过转移性成功率为22%。我们证明,dversarial examples也可以攻击其他MLLM,例如,对 Bing Chat 的攻击成功率为26%,对ERNIE机器人的攻击成功率为86%。此外,我们识别了 Bard 的两个防御机制,包括图像面部检测和毒性检测。我们设计相应的攻击来规避这些防御,表明 Bard 当前防御机制也薄弱环节。我们希望这项工作可以加深我们对MLLM的鲁棒性的理解,并促进未来的防御研究。我们的代码在这个httpsURL上可用。
Unsupervised domain adaptation (UDA) is an effective approach to handle the lack of annotations in the target domain for the semantic segmentation task. In this work, we consider a more practical UDA setting where the target domain contains sequential frames of the unlabeled videos which are easy to collect in practice. A recent study suggests self-supervised learning of the object motion from unlabeled videos with geometric constraints. We design a motion-guided domain adaptive semantic segmentation framework (MoDA), that utilizes self-supervised object motion to learn effective representations in the target domain. MoDA differs from previous methods that use temporal consistency regularization for the target domain frames. Instead, MoDA deals separately with the domain alignment on the foreground and background categories using different strategies. Specifically, MoDA contains foreground object discovery and foreground semantic mining to align the foreground domain gaps by taking the instance-level guidance from the object motion. Additionally, MoDA includes background adversarial training which contains a background category-specific discriminator to handle the background domain gaps. Experimental results on multiple benchmarks highlight the effectiveness of MoDA against existing approaches in the domain adaptive image segmentation and domain adaptive video segmentation. Moreover, MoDA is versatile and can be used in conjunction with existing state-of-the-art approaches to further improve performance.
As 3D human pose estimation can now be achieved with very high accuracy in the supervised learning scenario, tackling the case where 3D pose annotations are not available has received increasing attention. In particular, several methods have proposed to learn image representations in a self-supervised fashion so as to disentangle the appearance information from the pose one. The methods then only need a small amount of supervised data to train a pose regressor using the pose-related latent vector as input, as it should be free of appearance information. In this paper, we carry out in-depth analysis to understand to what degree the state-of-the-art disentangled representation learning methods truly separate the appearance information from the pose one. First, we study disentanglement from the perspective of the self-supervised network, via diverse image synthesis experiments. Second, we investigate disentanglement with respect to the 3D pose regressor following an adversarial attack perspective. Specifically, we design an adversarial strategy focusing on generating natural appearance changes of the subject, and against which we could expect a disentangled network to be robust. Altogether, our analyses show that disentanglement in the three state-of-the-art disentangled representation learning frameworks if far from complete, and that their pose codes contain significant appearance information. We believe that our approach provides a valuable testbed to evaluate the degree of disentanglement of pose from appearance in self-supervised 3D human pose estimation.
The prevalence of ubiquitous location-aware devices and mobile Internet enables us to collect massive individual-level trajectory dataset from users. Such trajectory big data bring new opportunities to human mobility research but also raise public concerns with regard to location privacy. In this work, we present the Conditional Adversarial Trajectory Synthesis (CATS), a deep-learning-based GeoAI methodological framework for privacy-preserving trajectory data generation and publication. CATS applies K-anonymity to the underlying spatiotemporal distributions of human movements, which provides a distributional-level strong privacy guarantee. By leveraging conditional adversarial training on K-anonymized human mobility matrices, trajectory global context learning using the attention-based mechanism, and recurrent bipartite graph matching of adjacent trajectory points, CATS is able to reconstruct trajectory topology from conditionally sampled locations and generate high-quality individual-level synthetic trajectory data, which can serve as supplements or alternatives to raw data for privacy-preserving trajectory data publication. The experiment results on over 90k GPS trajectories show that our method has a better performance in privacy preservation, spatiotemporal characteristic preservation, and downstream utility compared with baseline methods, which brings new insights into privacy-preserving human mobility research using generative AI techniques and explores data ethics issues in GIScience.
普遍的地理位置感知设备和移动互联网使得我们可以从用户那里收集大量的个人级轨迹数据集。这些轨迹大数据为人类移动研究带来了新的机会,但也引发了关于位置隐私的公众关注。在本研究中,我们提出了条件对抗轨迹合成(CATS),这是一种基于深度学习的GeoAI方法框架,用于隐私保护轨迹数据的生产和发布。CATS将K匿名化人类运动轨迹的时间和地点分布提供分布级别的强隐私保护。通过利用条件对抗训练基于K匿名化人类运动矩阵、利用注意力机制轨迹全球上下文学习,以及相邻轨迹点经常的二分类图匹配,CATS可以从条件采样位置重构轨迹拓扑,并生成高质量的个人级合成轨迹数据,作为隐私保护轨迹数据发布的原始数据的补充或替代品。对超过900,000GPS轨迹的实验结果显示,我们的方法和 baseline方法在隐私保护、时间和空间特征保留以及后续 utility方面表现更好,这为使用生成AI技术进行隐私保护人类移动研究带来了新的见解,并探索GIScience中的数据伦理问题。
Text-conditioned image generation models have recently achieved astonishing image quality and alignment results. Consequently, they are employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the web, they also produce unsafe content. As a contribution to the Adversarial Nibbler challenge, we distill a large set of over 1,000 potential adversarial inputs from existing safety benchmarks. Our analysis of the gathered prompts and corresponding images demonstrates the fragility of input filters and provides further insights into systematic safety issues in current generative image models.
文本条件生成模型最近取得了惊人的图像质量和对齐结果,因此被广泛应用于越来越多的应用程序中。由于它们是高度数据驱动的,依赖于随机从互联网上 scraping的数十亿规模的 datasets,所以也生成了不安全的内容。作为对dversarial Nibbler挑战的贡献,我们从现有的安全基准数据集中提炼了超过1,000个潜在的dversarial输入。我们对收集的提示和相应的图像进行了分析,展示了输入过滤器的脆弱性,并进一步揭示了当前生成图像模型中的系统安全性问题。
Automatic Speech Recognition systems have been shown to be vulnerable to adversarial attacks that manipulate the command executed on the device. Recent research has focused on exploring methods to create such attacks, however, some issues relating to Over-The-Air (OTA) attacks have not been properly addressed. In our work, we examine the needed properties of robust attacks compatible with the OTA model, and we design a method of generating attacks with arbitrary such desired properties, namely the invariance to synchronization, and the robustness to filtering: this allows a Denial-of-Service (DoS) attack against ASR systems. We achieve these characteristics by constructing attacks in a modified frequency domain through an inverse Fourier transform. We evaluate our method on standard keyword classification tasks and analyze it in OTA, and we analyze the properties of the cross-domain attacks to explain the efficiency of the approach.