Recent advancements in Text-to-image (T2I) generation have witnessed a shift from adapting text to fixed backgrounds to creating images around text. Traditional approaches are often limited to generate layouts within static images for effective text placement. Our proposed approach, TextCenGen, introduces a dynamic adaptation of the blank region for text-friendly image generation, emphasizing text-centric design and visual harmony generation. Our method employs force-directed attention guidance in T2I models to generate images that strategically reserve whitespace for pre-defined text areas, even for text or icons at the golden ratio. Observing how cross-attention maps affect object placement, we detect and repel conflicting objects using a force-directed graph approach, combined with a Spatial Excluding Cross-Attention Constraint for smooth attention in whitespace areas. As a novel task in graphic design, experiments indicate that TextCenGen outperforms existing methods with more harmonious compositions. Furthermore, our method significantly enhances T2I model outcomes on our specially collected prompt datasets, catering to varied text positions. These results demonstrate the efficacy of TextCenGen in creating more harmonious and integrated text-image compositions.
近年来,在文本转图像(T2I)生成方面的进步见证了从适应文本到固定背景的转变,再到围绕文本创建图像。传统方法通常局限于生成静态图像中的文本布局。我们提出的方法,TextCenGen,引入了适应空白区域的动态调整,强调了以文本为中心的设计和视觉和谐生成。我们的方法采用力的指导在T2I模型中生成具有战略性地保留预定义文本区域的图像,即使文本或图标位于黄金比例。通过观察跨注意图如何影响物体放置,我们使用力的图方法检测并排斥相互冲突的物体,并结合空间排除跨注意约束来优化在空白区域中的注意力。作为图形设计领域的新颖任务,实验表明TextCenGen比现有方法具有更和谐的创作效果。此外,我们的方法显著增强了T2I模型在我们专门收集的提示数据上的表现,适用于各种文本位置。这些结果证明了TextCenGen在创建更和谐和整合的文本图像组合方面的有效性。
https://arxiv.org/abs/2404.11824
Quantization lowers memory usage, computational requirements, and latency by utilizing fewer bits to represent model weights and activations. In this work, we investigate the generalization properties of quantized neural networks, a characteristic that has received little attention despite its implications on model performance. In particular, first, we develop a theoretical model for quantization in neural networks and demonstrate how quantization functions as a form of regularization. Second, motivated by recent work connecting the sharpness of the loss landscape and generalization, we derive an approximate bound for the generalization of quantized models conditioned on the amount of quantization noise. We then validate our hypothesis by experimenting with over 2000 models trained on CIFAR-10, CIFAR-100, and ImageNet datasets on convolutional and transformer-based models.
量化降低内存使用、计算要求和延迟,通过使用更少的比特来表示模型权重和激活。在这项工作中,我们研究了量化神经网络的泛化特性,尽管这对模型性能有着深刻的意义,但这个特性并未受到太多的关注。特别是,我们为神经网络的量化开发了一个理论模型,并证明了量化作为一种正则化形式的作用。第二,为了研究最近工作连接损失函数的尖度与泛化之间的关系,我们推导了量化模型的泛化近界,基于量化噪声的数量。然后,我们通过在CIFAR-10、CIFAR-100和ImageNet数据集上使用卷积和Transformer基模型训练超过2000个模型来验证我们的假设。
https://arxiv.org/abs/2404.11769
The emergence of attention-based transformer models has led to their extensive use in various tasks, due to their superior generalization and transfer properties. Recent research has demonstrated that such models, when prompted appropriately, are excellent for few-shot inference. However, such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work, we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples, but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally, we introduce a unidirectional causal attention mechanism between the novel prompts, learned with limited examples, and the base prompts, learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall, this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-$20^i$ and Pascal-$5^i$, without the need for test-time optimization (or transduction). Furthermore, test-time optimization leveraging unlabelled test data can be used to improve the prompts, which we refer to as transductive prompt tuning.
基于注意力的Transformer模型的出现使得它们在各种任务中得到了广泛的应用,因为它们具有卓越的泛化和传输特性。最近的研究表明,当适当地提示下,这种模型在少样本推理任务中表现出色。然而,在类似于语义分割等密集预测任务中,这种技术尚未得到充分的探索。在这项工作中,我们研究了使用学习过的视觉提示 prompt a transformer-decoder 在GFSS(泛化少样本分割)任务中的效果。我们的目标是在既不丢失对少量示例的新兴类别的表现,也不影响基本类别的表现的情况下实现强大的性能。我们提出了通过学习有限示例的视觉提示来学习的方法。这些学习过的视觉提示被用于提示多尺度Transformer解码器,以实现准确的密集预测。此外,我们在新提示和基本提示之间引入了一种单向因果注意力机制。这种机制在丰富新提示的同时,没有削弱基本类别的表现。总体而言,这种提示方式有助于我们在两个不同的基准数据集上实现GFSS的尖端性能:COCO-$20^i$和Pascal-$5^i$,而无需进行测试时间的优化(或转换)。此外,利用未标记测试数据进行测试时间的优化可以进一步改善提示,我们称之为Transductive prompt tuning。
https://arxiv.org/abs/2404.11732
Endotracheal intubation (ETI) is an emergency procedure performed in civilian and combat casualty care settings to establish an airway. Objective and automated assessment of ETI skills is essential for the training and certification of healthcare providers. However, the current approach is based on manual feedback by an expert, which is subjective, time- and resource-intensive, and is prone to poor inter-rater reliability and halo effects. This work proposes a framework to evaluate ETI skills using single and multi-view videos. The framework consists of two stages. First, a 2D convolutional autoencoder (AE) and a pre-trained self-supervision network extract features from videos. Second, a 1D convolutional enhanced with a cross-view attention module takes the features from the AE as input and outputs predictions for skill evaluation. The ETI datasets were collected in two phases. In the first phase, ETI is performed by two subject cohorts: Experts and Novices. In the second phase, novice subjects perform ETI under time pressure, and the outcome is either Successful or Unsuccessful. A third dataset of videos from a single head-mounted camera for Experts and Novices is also analyzed. The study achieved an accuracy of 100% in identifying Expert/Novice trials in the initial phase. In the second phase, the model showed 85% accuracy in classifying Successful/Unsuccessful procedures. Using head-mounted cameras alone, the model showed a 96% accuracy on Expert and Novice classification while maintaining an accuracy of 85% on classifying successful and unsuccessful. In addition, GradCAMs are presented to explain the differences between Expert and Novice behavior and Successful and Unsuccessful trials. The approach offers a reliable and objective method for automated assessment of ETI skills.
内窥镜引导(ETI)是一种在民事和军事医疗设施中进行的紧急措施,旨在建立气道。ETI技能的客观和自动评估对于医疗保健提供者的培训和认证至关重要。然而,目前的做法基于专家手动反馈,这是主观、时间和资源密集的,并且容易产生评分者一致性和晕轮效应。本研究提出了使用单视和多视角视频评估ETI技能的框架。该框架包括两个阶段。第一阶段,2D卷积自动编码器(AE)和预训练的自监督网络从视频中提取特征。第二阶段,一个具有跨视注意力模块的1D卷积将AE的特征作为输入并输出技能评估的预测。ETI数据集分为两个阶段收集。在第一阶段,由专家和新手组成的两个受试者组执行ETI。在第二阶段,新手受试者在时间压力下执行ETI,结果是成功或失败。此外,还对专家和新手使用单个头戴式摄像机收集的视频数据进行了分析。研究在初始阶段实现了100%的准确率来识别专家/新手试验。在第二阶段,模型在分类成功/失败程序方面展示了85%的准确性。使用仅头部佩戴式摄像机,模型在专家和新手分类上的准确率分别为96%,而分类成功和失败时的准确率分别为85%。此外,还提出了GradCAM,用于解释专家和新手行为以及成功/失败试验之间的差异。该方法提供了一种可靠且客观的自动评估ETI技能的方法。
https://arxiv.org/abs/2404.11727
We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts mechanism utilized in large language models (LLMs), MoA distributes the generation workload between two attention pathways: a personalized branch and a non-personalized prior branch. MoA is designed to retain the original model's prior by fixing its attention layers in the prior branch, while minimally intervening in the generation process with the personalized branch that learns to embed subjects in the layout and context generated by the prior branch. A novel routing mechanism manages the distribution of pixels in each layer across these branches to optimize the blend of personalized and generic content creation. Once trained, MoA facilitates the creation of high-quality, personalized images featuring multiple subjects with compositions and interactions as diverse as those generated by the original model. Crucially, MoA enhances the distinction between the model's pre-existing capability and the newly augmented personalized intervention, thereby offering a more disentangled subject-context control that was previously unattainable. Project page: this https URL
我们提出了一种名为混合注意(MoA)的新架构,用于对文本到图像扩散模型的个性化。它受到大型语言模型(LLMs)中使用的混合专家机制的启发。MoA 将生成任务分配给两个关注路径:一个个性化的分支和一个非个性化的先验分支。MoA 通过在先验分支的注意力层中固定注意力层,最小化对生成过程的干预,保留原始模型的先验。通过设计一个新路由机制,管理每个层中像素的分布,以优化个性化与通用内容创作的融合。一旦训练完成,MoA 有助于创建具有丰富主题和互动的个性化图像,其构成和交互与原始模型生成的图像的质量相当。关键的是,MoA 增强了模型预先存在的功能与新增加的个性化干预之间的区别,从而提供了更复杂的主题上下文控制,这是 previously不可达的。项目页面:这个链接:<https://this.html>
https://arxiv.org/abs/2404.11565
How does the internal computation of a machine learning model transform inputs into predictions? In this paper, we introduce a task called component modeling that aims to address this question. The goal of component modeling is to decompose an ML model's prediction in terms of its components -- simple functions (e.g., convolution filters, attention heads) that are the "building blocks" of model computation. We focus on a special case of this task, component attribution, where the goal is to estimate the counterfactual impact of individual components on a given prediction. We then present COAR, a scalable algorithm for estimating component attributions; we demonstrate its effectiveness across models, datasets, and modalities. Finally, we show that component attributions estimated with COAR directly enable model editing across five tasks, namely: fixing model errors, ``forgetting'' specific classes, boosting subpopulation robustness, localizing backdoor attacks, and improving robustness to typographic attacks. We provide code for COAR at this https URL .
机器学习模型如何将输入转换为预测?在本文中,我们引入了一个名为组件建模的任务,旨在回答这个问题。组件建模的目标是分解一个机器学习模型的预测,将其分解为其组件——模型计算中的简单函数(例如卷积滤波器、注意头)作为“构建块”。我们关注组件归因任务的特殊情况,即估计单个组件对给定预测的逆事实影响。然后,我们介绍了COAR,一种可扩展的算法,用于估计组件归因;我们证明了其在模型、数据集和维度上的有效性。最后,我们证明了使用COAR估计的组件归因可以直接使模型在五个任务上进行编辑:修复模型错误、忘记特定类、增强亚集的鲁棒性、定位后门攻击和提高对印刷攻击的鲁棒性。我们在这个链接处提供了COAR的代码:https://www.acm.org/dl/2022.0221000 。
https://arxiv.org/abs/2404.11534
Robotic systems are becoming pervasive and adopted in increasingly many domains, such as manufacturing, healthcare, and space exploration. To this end, engineering software has emerged as a crucial discipline for building maintainable and reusable robotic systems. Robotics software engineering research has received increasing attention, fostering autonomy as a fundamental goal. However, robotics developers are still challenged trying to achieve this goal given that simulation is not able to deliver solutions to realistically emulate real-world phenomena. Robots also need to operate in unpredictable and uncontrollable environments, which require safe and trustworthy self-adaptation capabilities implemented in software. Typical techniques to address the challenges are runtime verification, field-based testing, and mitigation techniques that enable fail-safe solutions. However, there is no clear guidance to architect ROS-based systems to enable and facilitate runtime verification and field-based testing. This paper aims to fill in this gap by providing guidelines that can help developers and QA teams when developing, verifying or testing their robots in the field. These guidelines are carefully tailored to address the challenges and requirements of testing robotics systems in real-world scenarios. We conducted a literature review on studies addressing runtime verification and field-based testing for robotic systems, mined ROS-based application repositories, and validated the applicability, clarity, and usefulness via two questionnaires with 55 answers. We contribute 20 guidelines formulated for researchers and practitioners in robotic software engineering. Finally, we map our guidelines to open challenges thus far in runtime verification and field-based testing for ROS-based systems and, we outline promising research directions in the field.
机器人系统正在变得无处不在并逐渐应用于越来越多的领域,如制造业、医疗保健和太空探索。因此,工程软件已成为构建可维护和可重用的机器人系统的关键学科。机器人软件工程研究受到了越来越多的关注,推动了自主作为一个基本目标。然而,机器人开发人员仍然面临着实现这一目标的压力,因为仿真无法解决现实世界现象。机器人还需要在不可预测和不可控的环境中操作,这需要软件中实现安全可靠的自我适应能力。解决这些挑战的典型方法包括运行时验证、现场测试和缓解技术,以实现容错解决方案。然而,在构建基于ROS的系统时,缺乏明确的指导以帮助开发人员和测试团队在现场开发、验证或测试机器人。本文旨在填补这一空白,通过提供有助于开发人员在现场开发、验证或测试机器人时使用的指南来填补这一空白。这些指南是针对现实世界场景中测试机器人系统的挑战和要求的精心制定的。我们对基于ROS的机器人系统的运行时验证和现场测试的研究进行了文献综述,挖掘了ROS基于应用程序的存储库,并通过两个问卷回答了55个问题。我们为研究人员和实践者提供了20个关于机器人软件工程研究的指南。最后,我们将这些指南与ROS基于系统的前沿挑战进行了映射,并为该领域勾勒出有前景的研究方向。
https://arxiv.org/abs/2404.11498
The accurate wind speed series forecast is very pivotal to security of grid dispatching and the application of wind power. Nevertheless, on account of their nonlinear and non-stationary nature, their short-term forecast is extremely challenging. Therefore, this dissertation raises one short-term wind speed forecast pattern on the foundation of attention with an improved gated recurrent neural network (AtGRU) and a tactic of error correction. That model uses the AtGRU model as the preliminary predictor and the GRU model as the error corrector. At the beginning, SSA (singular spectrum analysis) is employed in previous wind speed series for lessening the noise. Subsequently, historical wind speed series is going to be used for the predictor training. During this process, the prediction can have certain errors. The sequence of these errors processed by variational modal decomposition (VMD) is used to train the corrector of error. The eventual forecast consequence is just the sum of predictor forecast and error corrector. The proposed SSA-AtGRU-VMD-GRU model outperforms the compared models in three case studies on Woodburn, St. Thomas, and Santa Cruz. It is indicated that the model evidently enhances the correction of the wind speed forecast.
准确的风速系列预测对电网调度安全和风能应用具有至关重要的意义。然而,由于其非线性和非平稳性质,其短期预测极其具有挑战性。因此,本文在关注的基础上提出了一种基于改进的门控循环神经网络(AtGRU)的短期风速预测模式,并采用错误纠正策略。该模型使用AtGRU模型作为初步预测器,GRU模型作为错误纠正器。在开始时,采用斯奇谱分析(SSA)来降低噪声。随后,历史风速系列将用于预测训练。在这个过程中,预测可能会出现一定误差。这些误差通过变分模态分解(VMD)处理后的序列用于错误纠正。最终预测结果仅是预测预报和误差纠正之和。与比较模型相比,所提出的SSA-AtGRU-VMD-GRU模型在伍德伯里、圣托马斯和圣克鲁斯三个案例研究中都表现出色。研究结果表明,该模型明显增强了风速预测的纠正。
https://arxiv.org/abs/2404.11422
Recent diffusion probabilistic models (DPM) in the field of pansharpening have been gradually gaining attention and have achieved state-of-the-art (SOTA) performance. In this paper, we identify shortcomings in directly applying DPMs to the task of pansharpening as an inverse problem: 1) initiating sampling directly from Gaussian noise neglects the low-resolution multispectral image (LRMS) as a prior; 2) low sampling efficiency often necessitates a higher number of sampling steps. We first reformulate pansharpening into the stochastic differential equation (SDE) form of an inverse problem. Building upon this, we propose a Schrödinger bridge matching method that addresses both issues. We design an efficient deep neural network architecture tailored for the proposed SB matching. In comparison to the well-established DL-regressive-based framework and the recent DPM framework, our method demonstrates SOTA performance with fewer sampling steps. Moreover, we discuss the relationship between SB matching and other methods based on SDEs and ordinary differential equations (ODEs), as well as its connection with optimal transport. Code will be available.
近年来,在 pansharpening(高对比度增强)领域,概率扩散模型(DPM)逐渐受到关注,并取得了最先进的(SOTA)性能。在本文中,我们指出了直接将 DPM应用于 pansharpening 任务中的不足之处:1)直接从高斯噪声中启动采样忽略了低分辨率多光谱图像(LRMS)作为先验;2)低采样效率通常需要增加采样步骤。首先,我们将 pansharpening 转化为反问题中的随机微分方程(SDE)形式。在此基础上,我们提出了一个 Schrödinger 桥匹配方法来解决这两个问题。我们设计了一个专为 SB 匹配设计的高效的深度神经网络架构。与传统的 DL-基于反向传播(RF)框架和最近的 DPM 框架相比,我们的方法在更少的采样步骤下展示了 SOTA 性能。此外,我们还讨论了基于 SDEs 和 ordinary differential equations (ODEs) 的其他方法之间的关系,以及其与最优传输的关系。代码将可用。
https://arxiv.org/abs/2404.11416
Human motion understanding is a fundamental task with diverse practical applications, facilitated by the availability of large-scale motion capture datasets. Recent studies focus on text-motion tasks, such as text-based motion generation, editing and question answering. In this study, we introduce the novel task of text-based human motion grounding (THMG), aimed at precisely localizing temporal segments corresponding to given textual descriptions within untrimmed motion sequences. Capturing global temporal information is crucial for the THMG task. However, transformer-based models that rely on global temporal self-attention face challenges when handling long untrimmed sequences due to the quadratic computational cost. We address these challenges by proposing Text-controlled Motion Mamba (TM-Mamba), a unified model that integrates temporal global context, language query control, and spatial graph topology with only linear memory cost. The core of the model is a text-controlled selection mechanism which dynamically incorporates global temporal information based on text query. The model is further enhanced to be topology-aware through the integration of relational embeddings. For evaluation, we introduce BABEL-Grounding, the first text-motion dataset that provides detailed textual descriptions of human actions along with their corresponding temporal segments. Extensive evaluations demonstrate the effectiveness of TM-Mamba on BABEL-Grounding.
人类动作理解是一个基本任务,具有多样化的实际应用价值,并得益于大规模运动捕捉数据集的可用性得以实现。最近的研究主要集中在文本运动任务上,如基于文本的运动生成、编辑和问答任务。在这项研究中,我们引入了新的基于文本的人体动作 grounded (THMG) 任务,旨在精确地将给定文本描述中的时间片段定位到未修剪的运动序列中。捕捉全局时间信息对 THMG 任务至关重要。然而,基于全局时间自注意力的Transformer模型在处理长篇未修剪序列时面临挑战,因为其二次计算成本。为了应对这些挑战,我们提出了 Text-controlled Motion Mamba (TM-Mamba),一种集成文本控制、语言查询控制和空间图拓扑结构的统一模型,具有仅线性记忆成本。模型的核心是一个文本控制的选取机制,根据文本查询动态地融入全局时间信息。通过引入关系嵌入,模型进一步增强了拓扑意识。为了评估,我们引入了 BABEL-Grounding,第一个提供人类动作详细文本描述以及相应时间段的文本运动数据集。广泛的评估结果证明了 TM-Mamba 在 BABEL-Grounding 上的有效性。
https://arxiv.org/abs/2404.11375
Aerial unperching of multirotors has received little attention as opposed to perching that has been investigated to elongate operation time. This study presents a new aerial robot capable of both perching and unperching autonomously on/from a ferromagnetic surface during flight, and a switching controller to avoid rotor saturation and mitigate overshoot during transition between free-flight and perching. To enable stable perching and unperching maneuvers on/from a vertical surface, a lightweight ($\approx$ $1$ \si{kg}), fully actuated tiltrotor that can hover at $90^\circ$ pitch angle is first developed. We design a perching/unperching module composed of a single servomotor and a magnet, which is then mounted on the tiltrotor. A switching controller including exclusive control modes for transitions between free-flight and perching is proposed. Lastly, we propose a simple yet effective strategy to ensure robust perching in the presence of measurement and control errors and avoid collisions with the perching site immediately after unperching. We validate the proposed framework in experiments where the tiltrotor successfully performs perching and unperching on/from a vertical surface during flight. We further show effectiveness of the proposed transition mode in the switching controller by ablation studies where large overshoot and even collision with a perching site occur. To the best of the authors' knowledge, this work presents the first autonomous aerial unperching framework using a fully actuated tiltrotor.
飞行中多旋翼的 aerial 静止相对于静止在磁性表面上的研究关注较少。本研究提出了一种能够自主飞行时静止和起飞时静止在磁性表面的全新多旋翼机器人以及一种用于避免转子饱和和减轻起飞和静止之间过渡时的过冲的控制器。为了实现从垂直表面稳定进行静止和起飞,我们首先开发了重量轻(约1 kg)的全电动倾斜旋翼,能够在$90^\circ$的俯仰角下悬停。我们设计了一个由一个伺服电机和一个磁铁组成的静止/起飞模块,并将其安装在倾斜旋翼上。我们提出了一个包括自由飞行和起飞之间过渡 exclusive 控制模式的切换控制器。最后,我们提出了一种简单而有效的策略,以确保在测量和控制误差存在的情况下和在起飞后立即避开静止位置的碰撞。我们在飞行实验中验证了所提出的框架。我们进一步通过断续试验研究了所提出的转换模式的有效性,在试验中,倾斜旋翼在飞行中成功进行了静止和起飞。通过断续试验研究,我们证明了所提出的转换模式的有效性。据作者知识,本研究是第一个使用全电动倾斜旋翼实现自主飞行静止的框架。
https://arxiv.org/abs/2404.11310
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration, but overlook the modeling of close interactions. In this work, we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this, we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically, we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information, our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D, 3DPW, and CHI3D demonstrate that our method outperforms existing approaches. The code is available at \url{this https URL}.
目前的多人人类重建方法主要集中在准确恢复人体的姿势或避免透视,但忽视了近距离互动的建模。在这项工作中,我们解决了从单目视频重构近距离互动人类的问题。这个任务的难点来自于深度模糊和严重的人际遮挡所引起的视觉信息不足。因此,我们提出了利用近体行为和物理知识来弥补视觉信息不足的方法。这基于观察到的人类互动具有特定的模式,具体来说,我们首先基于Vector Quantised-Variational AutoEncoder(VQ-VAE)设计了一个隐含表示,以建模人类互动。然后引入了一个指导扩散模型来消除初始分布中的噪声。我们设计了一个扩散模型为双分支,每个分支代表一个单独的人,以便通过跨注意来建模互动。通过VQ-VAE和学习到的先验信息和物理约束作为附加信息,我们的方法能够估计出既是近体又是物理上合理的姿势。在Hi4D、3DPW和CHI3D等实验中,我们的方法表现优异。代码可在此处访问:\url{这个链接}。
https://arxiv.org/abs/2404.11291
In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
简短的视频和现场直播中,说话声、歌唱声和背景音乐经常重叠并掩盖彼此。这种复杂性使得对音频内容的组织和识别带来了困难,这可能会影响到后续的ASR和音乐理解应用程序。本文提出了一种基于多任务音频源分离(MTASS)的ASR模型,称为JRSV,它同时识别说话和歌唱声音。具体来说,MTASS模块将混合音频分离为不同的说话和歌唱声道,并去除了背景音乐。CTC/attention混合识别模块同时识别这两条轨道。提出了在线去噪以进一步提高识别的鲁棒性。为了评估所提出的方法,构建了一个基准数据集并发布。实验结果表明,JRSV可以在混合音频的每个轨道上显著提高识别准确性。
https://arxiv.org/abs/2404.11275
Transformer-based models have achieved remarkable results in low-level vision tasks including image super-resolution (SR). However, early Transformer-based approaches that rely on self-attention within non-overlapping windows encounter challenges in acquiring global information. To activate more input pixels globally, hybrid attention models have been proposed. Moreover, training by solely minimizing pixel-wise RGB losses, such as L1, have been found inadequate for capturing essential high-frequency details. This paper presents two contributions: i) We introduce convolutional non-local sparse attention (NLSA) blocks to extend the hybrid transformer architecture in order to further enhance its receptive field. ii) We employ wavelet losses to train Transformer models to improve quantitative and subjective performance. While wavelet losses have been explored previously, showing their power in training Transformer-based SR models is novel. Our experimental results demonstrate that the proposed model provides state-of-the-art PSNR results as well as superior visual performance across various benchmark datasets.
基于Transformer的模型在低级视觉任务中已经取得了显著的成果,包括图像超分辨率(SR)。然而,早期基于自注意力的Transformer方法在获取全局信息时遇到了挑战。为了全局激活更多的输入像素,混合注意模型已经被提出。此外,仅仅通过最小化像素级的RGB损失(如L1)进行训练,如之前的尝试,发现不足以捕捉基本的高频细节。本文提出了两个贡献:i)我们引入了卷积非局部稀疏注意(NLSA)模块,以扩展混合Transformer架构,从而进一步增强其接收域。ii)我们使用了谱聚类损失来训练Transformer模型,以提高数量和主观性能。尽管谱聚类损失之前已经被探索过,但它们在训练Transformer-based SR模型方面的表现是新颖的。我们的实验结果表明,与最先进的PSNR结果相比,所提出的模型在各种基准数据集上的视觉表现都更出色。
https://arxiv.org/abs/2404.11273
Recently, backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs). The attacked model behaves normally on benign samples but outputs a specific result when the trigger is present. However, compared with the rocketing progress of backdoor attacks, existing defenses are difficult to deal with these threats effectively or require benign samples to work, which may be unavailable in real scenarios. In this paper, we find that the poisoned samples and benign samples can be distinguished with prediction entropy. This inspires us to propose a novel dual-network training framework: The Victim and The Beneficiary (V&B), which exploits a poisoned model to train a clean model without extra benign samples. Firstly, we sacrifice the Victim network to be a powerful poisoned sample detector by training on suspicious samples. Secondly, we train the Beneficiary network on the credible samples selected by the Victim to inhibit backdoor injection. Thirdly, a semi-supervised suppression strategy is adopted for erasing potential backdoors and improving model performance. Furthermore, to better inhibit missed poisoned samples, we propose a strong data augmentation method, AttentionMix, which works well with our proposed V&B framework. Extensive experiments on two widely used datasets against 6 state-of-the-art attacks demonstrate that our framework is effective in preventing backdoor injection and robust to various attacks while maintaining the performance on benign samples. Our code is available at this https URL.
近年来,后门攻击对深度神经网络(DNNs)的训练过程构成了严重的安全威胁。受攻击的模型在良性样本上表现正常,但当触发器存在时,它输出特定的结果。然而,与后门攻击的迅速进展相比,现有的防御措施很难有效地处理这些威胁,或者需要良性样本才能工作,这在现实场景中可能不可用。在本文中,我们发现可以通过预测熵区分有毒样本和良性样本。这激发了我们提出一种新颖的双网络训练框架:受害者(V)和受益者(B),它利用有毒模型来训练一个干净的模型,无需额外良性样本。首先,我们通过在可疑样本上训练V来使其成为强大的有毒样本检测器。其次,我们在V选择的可靠样本上训练B来抑制后门注入。第三,采用半监督抑制策略来消除潜在的后门并提高模型性能。此外,为了更好地抑制未检测到的有毒样本,我们提出了一个强大的数据增强方法——AttentionMix,它与我们的V&B框架配合良好。在两个广泛使用的数据集上,对6个最先进的攻击进行的大量实验证明,我们的框架在防止后门注入和应对各种攻击方面非常有效,同时保持良性样本上的性能。我们的代码可在此处下载:https://www.example.com/。
https://arxiv.org/abs/2404.11265
Large Language Models (LLMs) are trained on large-scale web data, which makes it difficult to grasp the contribution of each text. This poses the risk of leaking inappropriate data such as benchmarks, personal information, and copyrighted texts in the training data. Membership Inference Attacks (MIA), which determine whether a given text is included in the model's training data, have been attracting attention. Previous studies of MIAs revealed that likelihood-based classification is effective for detecting leaks in LLMs. However, the existing methods cannot be applied to some proprietary models like ChatGPT or Claude 3 because the likelihood is unavailable to the user. In this study, we propose a Sampling-based Pseudo-Likelihood (\textbf{SPL}) method for MIA (\textbf{SaMIA}) that calculates SPL using only the text generated by an LLM to detect leaks. The SaMIA treats the target text as the reference text and multiple outputs from the LLM as text samples, calculates the degree of $n$-gram match as SPL, and determines the membership of the text in the training data. Even without likelihoods, SaMIA performed on par with existing likelihood-based methods.
大规模语言模型(LLMs)在大型网络数据集上进行训练,这使得难以理解每个文本的贡献。这可能导致训练数据中泄露不适当的数据,例如基准数据、个人信息和受版权保护的文本。元推理攻击(MIA)已引起关注。以前的研究表明,可能性基础分类对于检测LLM中的泄漏很有成效。然而,对于一些专有模型如ChatGPT或Claude 3,由于无法访问可能性,现有方法无法应用。在本文中,我们提出了一种基于采样的伪似然度(SPL)方法(SaMIA)用于元推理攻击(SaMIA),该方法仅基于LLM生成的文本计算SPL,将目标文本视为参考文本,将LLM的多个输出视为文本样本,计算n-gram匹配的度量,并确定文本在训练数据中的成员资格。即使没有概率,SaMIA的表现与现有概率方法相当。
https://arxiv.org/abs/2404.11262
Gesture recognition based on surface electromyography (sEMG) has been gaining importance in many 3D Interactive Scenes. However, sEMG is easily influenced by various forms of noise in real-world environments, leading to challenges in providing long-term stable interactions through sEMG. Existing methods often struggle to enhance model noise resilience through various predefined data augmentation techniques. In this work, we revisit the problem from a short term enhancement perspective to improve precision and robustness against various common noisy scenarios with learnable denoise using sEMG intrinsic pattern information and sliding-window attention. We propose a Short Term Enhancement Module(STEM) which can be easily integrated with various models. STEM offers several benefits: 1) Learnable denoise, enabling noise reduction without manual data augmentation; 2) Scalability, adaptable to various models; and 3) Cost-effectiveness, achieving short-term enhancement through minimal weight-sharing in an efficient attention mechanism. In particular, we incorporate STEM into a transformer, creating the Short Term Enhanced Transformer (STET). Compared with best-competing approaches, the impact of noise on STET is reduced by more than 20%. We also report promising results on both classification and regression datasets and demonstrate that STEM generalizes across different gesture recognition tasks.
基于表面电生理(sEMG)的手势识别在许多三维交互场景中越来越重要。然而,sEMG很容易受到现实环境中各种形式的噪声影响,导致通过sEMG提供长期稳定交互存在挑战。现有的方法通常很难通过各种预定义的数据增强技术增强模型的噪声韧性。在这项工作中,我们从短期增强的角度重新审视问题,以改善精度和对各种常见噪声场景的鲁棒性,使用sEMG固有模式信息和滑动窗口注意力进行学习去噪。我们提出了一个短期增强模块(STEM),可以轻松地与各种模型集成。STEM带来几个优点:1)可学习去噪,无需手动数据增强;2)可扩展性,适用于各种模型;3)性价比高,通过高效的注意力机制实现短期增强。特别地,我们将STEM集成到Transformer中,创建了短期增强Transformer(STET)。与最佳竞争方法相比,STET受到的噪声影响降低了20%以上。我们还报道了在分类和回归数据集上的积极结果,并证明了STEM在各种手势识别任务上通用。
https://arxiv.org/abs/2404.11213
Reliable prediction of vehicle trajectories at signalized intersections is crucial to urban traffic management and autonomous driving systems. However, it presents unique challenges, due to the complex roadway layout at intersections, involvement of traffic signal controls, and interactions among different types of road users. To address these issues, we present in this paper a novel model called Knowledge-Informed Generative Adversarial Network (KI-GAN), which integrates both traffic signal information and multi-vehicle interactions to predict vehicle trajectories accurately. Additionally, we propose a specialized attention pooling method that accounts for vehicle orientation and proximity at intersections. Based on the SinD dataset, our KI-GAN model is able to achieve an Average Displacement Error (ADE) of 0.05 and a Final Displacement Error (FDE) of 0.12 for a 6-second observation and 6-second prediction cycle. When the prediction window is extended to 9 seconds, the ADE and FDE values are further reduced to 0.11 and 0.26, respectively. These results demonstrate the effectiveness of the proposed KI-GAN model in vehicle trajectory prediction under complex scenarios at signalized intersections, which represents a significant advancement in the target field.
在信号控制的交叉路口进行车辆轨迹可靠的预测对城市交通管理和自动驾驶系统至关重要。然而,由于交叉路口复杂的道路布局、交通信号控制和不同类型道路用户之间的交互,它带来了独特的挑战。为解决这些问题,本文提出了一种名为知识引导生成对抗网络(KI-GAN)的新模型,它将交通信号信息和多车交互合并,准确预测车辆轨迹。此外,我们还提出了一种考虑交叉路口车辆方向和接近度的专用关注池化方法。基于SinD数据集,我们的KI-GAN模型在6秒观察和6秒预测周期内能够实现平均位移误差(ADE)为0.05,最终位移误差(FDE)为0.12。当预测窗口延长至9秒时,ADE和FDE值进一步降低至0.11和0.26。这些结果证明了在复杂场景下,所提出的KI-GAN模型在交叉路口车辆轨迹预测方面的有效性,这标志着该领域的重要进展。
https://arxiv.org/abs/2404.11181
Constructing vectorized high-definition maps from surround-view cameras has garnered significant attention in recent years. However, the commonly employed multi-stage sequential workflow in prevailing approaches often leads to the loss of early-stage information, particularly in perspective-view features. Usually, such loss is observed as an instance missing or shape mismatching in the final birds-eye-view predictions. To address this concern, we propose a novel approach, namely \textbf{HybriMap}, which effectively exploits clues from hybrid features to ensure the delivery of valuable information. Specifically, we design the Dual Enhancement Module, to enable both explicit integration and implicit modification under the guidance of hybrid features. Additionally, the perspective keypoints are utilized as supervision, further directing the feature enhancement process. Extensive experiments conducted on existing benchmarks have demonstrated the state-of-the-art performance of our proposed approach.
近年来,从环绕视像相机的构建矢量化高清晰度地图引起了广泛关注。然而,现有的方法中通常采用的多阶段序列工作流程往往会导致在平视 view 特征中丢失早期的信息,尤其是在视角 view 的特征中。通常,这种损失表现为最终鸟瞰预测中实例缺失或形状不匹配。为了应对这种担忧,我们提出了一个新颖的方法,即 \textbf{HybriMap},它有效利用混合特征的线索来确保传递有价值的信息。具体来说,我们设计了一个双增强模块,在混合特征的指导下实现明确的集成和隐含修改。此外,将视角关键点作为监督,进一步指导特征增强过程。对现有基准进行的大量实验证明了我们所提出方法的优越性能。
https://arxiv.org/abs/2404.11155
ICD(International Classification of Diseases) coding involves assigning ICD codes to patients visit based on their medical notes. ICD coding is a challenging multilabel text classification problem due to noisy medical document inputs. Recent advancements in automated ICD coding have enhanced performance by integrating additional data and knowledge bases with the encoding of medical notes and codes. However, most of them ignore the code hierarchy, leading to improper code assignments. To address these problems, we propose a novel framework based on associated and hierarchical code description distillation (AHDD) for better code representation learning and avoidance of improper code assignment.we utilize the code description and the hierarchical structure inherent to the ICD codes. Therefore, in this paper, we leverage the code description and the hierarchical structure inherent to the ICD codes. The code description is also applied to aware the attention layer and output layer. Experimental results on the benchmark dataset show the superiority of the proposed framework over several state-of-the-art baselines.
ICD(国际疾病分类)编码是将患者访问分配给他们的医疗记录的ICD代码的过程。由于嘈杂的医疗文件输入,ICD编码是一个具有多个标签的多标签文本分类问题。最近,自动ICD编码通过将额外数据和知识库与医疗记录的编码相结合来提高性能。然而,大多数忽略代码层次结构,导致不当的代码分配。为了应对这些问题,我们提出了一个基于相关和分层代码描述蒸馏(AHDD)的新框架,以进行更好的代码表示学习和避免不当代码分配。我们利用了ICD代码固有的代码描述和层次结构。因此,在本文中,我们利用了ICD代码的代码描述和层次结构。将代码描述还应用于注意层和输出层,以增强模型的关注度。基准数据集上的实验结果表明,与最先进的基线相比,所提出的框架具有优越性。
https://arxiv.org/abs/2404.11132