Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at this https URL
模仿学习已被证明是一种强大的训练复杂视觉运动策略的有力工具。然而,当前的方法通常需要几百到数千个专家演示才能处理高维视觉观察数据。导致这种低数据效率的一个关键原因是,视觉表示主要是通过行为克隆目标直接训练,或者在非域数据上预训练。在这项工作中,我们提出了DynaMo,一种新的在域自监督学习方法,用于学习视觉表示。给定一组专家演示,我们通过一系列图像嵌入共同学习一个潜在的逆动态模型和一个前动态模型,预测下一个帧在潜在空间中,无需增强、对比采样或访问地面真实动作。重要的是,DynaMo不需要任何非域数据,如互联网数据或跨嵌入数据。在六个模拟和现实环境上,我们证明了使用DynaMo学习到的表示能够显著提高下游模仿学习性能,以及预训练表示。使用DynaMo的优势在行为Transformer、扩散策略、MLP和最近邻策略的类别中保持不变。最后,我们抽象了DynaMo的关键组件,并测量了其对下游策略性能的影响。机器人视频最好在以下链接查看:
https://arxiv.org/abs/2409.12192
We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{this https URL}.
我们推出了Qwen2-VL系列,这是对之前Qwen-VL模型的先进升级,重新定义了在视觉处理中预先确定分辨率的经典方法。Qwen2-VL采用了朴素动态分辨率机制,使模型能够动态处理不同分辨率的图像,将其转化为视觉token的数目。这种方法使得模型能够生成更高效和准确的视觉表示,与人类感知过程密切相关。模型还集成了多模态旋转位置嵌入(M-RoPE),促进在文本、图像和视频之间有效融合位置信息。我们采用统一的方法处理图像和视频,增强了模型的视觉感知能力。为了探索大型多模态模型的潜力,Qwen2-VL研究了大型视觉语言模型(LVLM)的缩放定律。通过缩放模型大小(以2B、8B和72B参数版本为基础)和训练数据量,Qwen2-VL系列在各种多模态基准测试中实现了高度竞争性的性能。值得注意的是,Qwen2-VL-72B模型在各种多模态基准测试中的结果与诸如GPT-4和Claude3.5-Sonnet等领先模型相当,甚至超过了其他通用模型。代码可在此处访问:\url{此链接}
https://arxiv.org/abs/2409.12191
In tackling the challenge of Multi-Document Summarization (MDS), numerous methods have been proposed, spanning both extractive and abstractive summarization techniques. However, each approach has its own limitations, making it less effective to rely solely on either one. An emerging and promising strategy involves a synergistic fusion of extractive and abstractive summarization methods. Despite the plethora of studies in this domain, research on the combined methodology remains scarce, particularly in the context of Vietnamese language processing. This paper presents a novel Vietnamese MDS framework leveraging a two-component pipeline architecture that integrates extractive and abstractive techniques. The first component employs an extractive approach to identify key sentences within each document. This is achieved by a modification of the pre-trained BERT network, which derives semantically meaningful phrase embeddings using siamese and triplet network structures. The second component utilizes the VBD-LLaMA2-7B-50b model for abstractive summarization, ultimately generating the final summary document. Our proposed framework demonstrates a positive performance, attaining ROUGE-2 scores of 39.6% on the VN-MDS dataset and outperforming the state-of-the-art baselines.
在解决多文档摘要(MDS)的挑战方面,已经提出了许多方法,包括提取式和抽象式摘要方法。然而,每种方法都有其局限性,因此仅依赖其中一种并不有效。一种新兴且具有前景的策略涉及提取式和抽象式摘要方法的协同融合。尽管该领域有很多研究,但关于综合方法的研究仍然很少,尤其是在越南语处理方面。本文介绍了一种利用两个组件的管道架构的新颖越南MDS框架,该框架集成了提取式和抽象式技术。第一个组件采用提取式方法来确定每个文档中的关键句子。这是通过修改预训练的BERT网络来实现的,该网络使用同义词和三元组网络结构生成具有语义意义的短语嵌入。第二个组件使用VBD-LLaMA2-7B-50b模型进行摘要式摘要,最终生成摘要文档。我们提出的框架表现出良好的性能,在VN-MDS数据集上的ROUGE-2得分为39.6%,并超过了最先进的基线。
https://arxiv.org/abs/2409.12134
Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading to reduced watermark imperceptibility, extraction accuracy, and capacity. To address these issues, we propose WMCodec, the first neural speech codec to jointly train compression-reconstruction and watermark embedding-extraction in an end-to-end manner, optimizing both imperceptibility and extractability of the watermark. Furthermore, We design an iterative Attention Imprint Unit (AIU) for deeper feature integration of watermark and speech, reducing the impact of quantization noise on the watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec in most quality metrics for watermark imperceptibility and consistently exceeds both AudioSeal with Encodec and reinforced TraceableSpeech in extraction accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16 bps, WMCodec maintains over 99% extraction accuracy under common attacks, demonstrating strong robustness.
近年来,语音伪造技术的进步使得在神经语音编码器中需要更强的验证机制来确保真实性。目前的方法在压缩之前嵌入数字水印,然后从重构语音中提取它们进行验证,但面临诸如水印和编码器的单独训练过程以及缺乏跨模态信息整合等问题,导致水印的不感知性、提取精度和容量降低。为了应对这些问题,我们提出了WMCodec,第一个在端到端方式下共同训练压缩和编码的水印嵌入的神经语音编码器,通过优化水印的感知度和提取性来提高其性能。此外,我们设计了一个递归注意印迹单元(AIU)来融合水印和语音的特征,减少量化噪声对水印的影响。实验结果表明,WMCodec在大多数质量指标上超过了AudioSeal with Encodec,并且 consistently超过了AudioSeal with Encodec和强化可追溯语音。在带宽为6 kbps,水印容量为16 bps的情况下,WMCodec在普通攻击下保持了超过99%的提取精度,证明了其强大的鲁棒性。
https://arxiv.org/abs/2409.12121
Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.
理解人类如何处理视觉信息是揭开大脑活动背后的关键步骤之一。最近,这种好奇心激励了fMRI到图像重构任务;它旨在根据视觉刺激重建相应的视觉刺激。令人惊讶的是,利用像Latent Diffusion Model(LDM)这样的强大生成模型在重构视觉数据集中的复杂视觉刺激取得了很好的效果。尽管这些重构具有令人印象深刻的结构保真度,但它们通常缺乏小物件的细节、模糊的形状和语义细微差别。因此,引入额外的语义知识(不仅仅是视觉信息)变得至关重要。 鉴于这一点,我们研究了现代LDMs如何有效结合多模态指导(文本指导、视觉指导和图像布局)进行结构化和语义性图像生成。具体来说,受到两个流假设的启发,即感知和语义信息在不同的脑区处理,我们的Brain-Streams框架将fMRI信号从这些脑区映射到适当的嵌入。这意味着通过从语义信息区域提取文本指导,从感知信息区域提取视觉指导,Brain-Streams为LDMs提供准确的多模态指导。 我们在包括自然图像刺激和fMRI数据的实际fMRI数据集上验证了Brain-Streams的重建能力。我们既以数量方式验证了它的重建效果,也以质量方式验证了它的重建效果。
https://arxiv.org/abs/2409.12099
As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}
随着像CLIP这样的强大预训练视觉语言模型(VLMs)越来越受到关注,许多研究试图将VLMs用于下游任务。在这些研究中,提示学习被证明是一种有效的适应新任务的方法,只需几个参数。然而,当前的提示学习方法面临两个挑战:首先,单个软提示很难捕捉数据集中多样化的风格和模式;其次,微调软提示很容易过拟合。为了应对这些挑战,我们提出了一个结合软提示学习的方法,包括一个路由模块。这个模块能够捕捉数据集的多样风格,动态地为每个实例选择最合适的提示。此外,我们还引入了一种新颖的门控机制,确保路由器根据提示的相似性选择提示,这保留来自硬提示的知识,并提高了选择精度。我们还在每个软提示上应用了语义分组文本级监督,为每个组分配一个自定义模板的token嵌入,并在结果文本特征和硬提示编码文本特征之间应用对比损失。这种监督确保了从软提示中提取的文本特征接近其对应硬提示,保留初始知识并减轻过拟合。我们的方法在11个数据集上的验证表明,与现有基线相比,在几 shot学习、领域泛化和新基点映射场景中取得了显著的改进。代码将在\url{https://anonymous.4open.science/r/mocoop-6387}中提供。
https://arxiv.org/abs/2409.12011
Vector embeddings derived from large language models (LLMs) show promise in capturing latent information from the literature. Interestingly, these can be integrated into material embeddings, potentially useful for data-driven predictions of materials properties. We investigate the extent to which LLM-derived vectors capture the desired information and their potential to provide insights into material properties without additional training. Our findings indicate that, although LLMs can be used to generate representations reflecting certain property information, extracting the embeddings requires identifying the optimal contextual clues and appropriate comparators. Despite this restriction, it appears that LLMs still have the potential to be useful in generating meaningful materials-science representations.
翻译:从大型语言模型(LLMs)中得到的向量嵌入显示出在捕捉文献中的潜在信息方面具有前景。有趣的是,这些向量可以整合到材料嵌入中,潜在地对材料的性质进行数据驱动预测。我们研究了LLM派生的向量是否捕获了所需的信息以及它们在不进行额外训练时提供对材料性质洞察的可能性。我们的研究结果表明,尽管LLMs可用于生成反映某些属性信息的表示,但提取向量需要确定最佳的上下文线索和适当的比较者。尽管有这个限制,似乎LLM仍然具有在生成有意义的材料科学表示方面的潜在用途。
https://arxiv.org/abs/2409.11971
Personalization plays a critical role in numerous language tasks and applications, since users with the same requirements may prefer diverse outputs based on their individual interests. This has led to the development of various personalized approaches aimed at adapting large language models (LLMs) to generate customized outputs aligned with user preferences. Some of them involve fine-tuning a unique personalized LLM for each user, which is too expensive for widespread application. Alternative approaches introduce personalization information in a plug-and-play manner by retrieving the user's relevant historical texts as demonstrations. However, this retrieval-based strategy may break the continuity of the user history and fail to capture the user's overall styles and patterns, hence leading to sub-optimal performance. To address these challenges, we propose a novel personalized LLM model, \ours{}. It constructs a user-specific embedding for each individual by modeling all her historical contexts through a lightweight plug-in user embedder module. By attaching this embedding to the task input, LLMs can better understand and capture user habits and preferences, thereby producing more personalized outputs without tuning their own parameters. Extensive experiments on various tasks in the language model personalization (LaMP) benchmark demonstrate that the proposed model significantly outperforms existing personalized LLM approaches.
个性化在许多语言任务和应用中扮演着关键角色,因为具有相同需求的用户可能会根据个人兴趣对不同的输出产生偏好。这导致开发了各种旨在将大型语言模型(LLMs)个性化以生成与用户偏好相符的定制输出的方法。其中一些方法涉及对每个用户进行微调独特的个性化LLM,这过于昂贵,不适合大规模应用。其他方法通过检索用户的相关历史文本作为示例,以直观的方式引入个性化信息。然而,这种检索式策略可能破坏用户历史的连续性,并无法捕捉用户的整体风格和模式,从而导致 sub-optimal 性能。为了应对这些挑战,我们提出了一个新颖的个性化LLM模型\ours{}。它通过轻量级的插件用户嵌入器模块建模每个用户的个人嵌入,并将此嵌入连接到任务输入。LLMs可以更好地理解并捕捉用户的习惯和喜好,从而在不需要调整其参数的情况下生成更个性化的输出。在语言模型个性化(LaMP)基准的各个任务上进行的大量实验证明,与现有个性化LLM方法相比,所提出的模型具有显著的优越性。
https://arxiv.org/abs/2409.11901
Over the past decade, there has been a steady advancement in enhancing face recognition algorithms leveraging advanced machine learning methods. The role of the loss function is pivotal in addressing face verification problems and playing a game-changing role. These loss functions have mainly explored variations among intra-class or inter-class separation. This research examines the natural phenomenon of facial symmetry in the face verification problem. The symmetry between the left and right hemi faces has been widely used in many research areas in recent decades. This paper adopts this simple approach judiciously by splitting the face image vertically into two halves. With the assumption that the natural phenomena of facial symmetry can enhance face verification methodology, we hypothesize that the two output embedding vectors of split faces must project close to each other in the output embedding space. Inspired by this concept, we penalize the network based on the disparity of embedding of the symmetrical pair of split faces. Symmetrical loss has the potential to minimize minor asymmetric features due to facial expression and lightning conditions, hence significantly increasing the inter-class variance among the classes and leading to more reliable face embedding. This loss function propels any network to outperform its baseline performance across all existing network architectures and configurations, enabling us to achieve SoTA results.
在过去的十年里,利用先进的机器学习方法加强面部识别算法已经取得了持续的进步。损失函数在解决面部验证问题和发挥游戏般重要作用方面至关重要。这些损失函数主要研究了类内或类间分离的差异。本文研究了面部验证问题中的自然现象——左半脸和右半脸的对称性。近年来,左半脸和右半脸的对称性在许多研究领域得到了广泛应用。本文谨慎采用这一简单方法,将面部图像沿垂直方向分割成两个半部分。假设自然面部对称现象可以增强面部验证方法,我们假设分割后的左右半脸在输出嵌入空间中靠近彼此。受到这一概念的启发,我们根据对称性对网络进行惩罚。对称损失有可能最小化由于表情和闪电等面部情况导致的微小不对称特征,从而显著增加类间方差,使面部嵌入更加可靠。这种损失函数推动任何网络在所有现有网络架构和配置上超越基线性能,使我们能够实现SOTA结果。
https://arxiv.org/abs/2409.11816
Identification of suspects based on partial and smudged fingerprints, commonly referred to as fingermarks or latent fingerprints, presents a significant challenge in the field of fingerprint recognition. Although fixed-length embeddings have shown effectiveness in recognising rolled and slap fingerprints, the methods for matching latent fingerprints have primarily centred around local minutiae-based embeddings, failing to fully exploit global representations for matching purposes. Consequently, enhancing latent fingerprints becomes critical to ensuring robust identification for forensic investigations. Current approaches often prioritise restoring ridge patterns, overlooking the fine-macroeconomic details crucial for accurate fingerprint recognition. To address this, we propose a novel approach that uses generative adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE) through a structured approach to fingerprint generation. By directly optimising the minutiae information during the generation process, the model produces enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth instances. This leads to a significant improvement in identification performance. Our framework integrates minutiae locations and orientation fields, ensuring the preservation of both local and structural fingerprint features. Extensive evaluations conducted on two publicly available datasets demonstrate our method's dominance over existing state-of-the-art techniques, highlighting its potential to significantly enhance latent fingerprint recognition accuracy in forensic applications.
根据部分和模糊指纹的识别,通常称为指纹或潜在指纹,在指纹识别领域带来了巨大的挑战。尽管固定长度的嵌入已经在识别 rolled 和 slap 指纹方面表现出效果,但是匹配潜在指纹的方法主要集中于基于局部最小凹陷的嵌入,未能充分利用全局表示进行匹配。因此,为了确保在法医调查中具有稳健的识别能力,增强潜在指纹变得至关重要。现有方法通常优先恢复轮廓模式,而忽略了准确指纹识别所必需的微观经济细节。为解决这个问题,我们提出了一种新方法,通过使用生成对抗网络(GANs)通过指纹生成的结构化方法重新定义潜在指纹增强(LFE)。通过在生成过程中直接优化最小凹陷信息,模型产生具有卓越忠实度到真实实例的增强潜在指纹。这导致识别性能显著提高。我们对两个公开可用的数据集进行了广泛的评估,证明我们的方法在现有技术水平上具有优势,其潜在用途是在法医应用程序中显著增强潜在指纹识别准确性。
https://arxiv.org/abs/2409.11802
Achieving human-like memory recall in artificial systems remains a challenging frontier in computer vision. Humans demonstrate remarkable ability to recall images after a single exposure, even after being shown thousands of images. However, this capacity diminishes significantly when confronted with non-natural stimuli such as random textures. In this paper, we present a method inspired by human memory processes to bridge this gap between artificial and biological memory systems. Our approach focuses on encoding images to mimic the high-level information retained by the human brain, rather than storing raw pixel data. By adding noise to images before encoding, we introduce variability akin to the non-deterministic nature of human memory encoding. Leveraging pre-trained models' embedding layers, we explore how different architectures encode images and their impact on memory recall. Our method achieves impressive results, with 97% accuracy on natural images and near-random performance (52%) on textures. We provide insights into the encoding process and its implications for machine learning memory systems, shedding light on the parallels between human and artificial intelligence memory mechanisms.
在人工智能领域,实现人类似记忆回想起 来仍然是一个具有挑战性的前沿。人类在单次暴露后,即使被展示了成千上万的图像,也能够回忆起图像。然而,当面临非自然刺激,如随机纹理时,这种能力会显著减弱。在本文中,我们提出了一种受到人类记忆过程启发的方法,以弥合人工智能和生物记忆系统之间的差距。我们的方法专注于将图像编码为模拟人类大脑保留的高级信息,而不是存储原始像素数据。通过在编码之前对图像添加噪声,我们引入了类似于人类记忆编码的非确定性。利用预训练模型的嵌入层,我们探讨了不同架构如何编码图像及其对记忆回想起的影响。我们的方法在自然图像上的准确率为97%,而在纹理上的准确率也非常接近于随机(52%)。我们提供了关于编码过程的见解,以及对机器学习记忆系统的影响,揭示了人类和人工智能记忆机制之间的相似之处。
https://arxiv.org/abs/2409.11750
Multi-camera perception methods in Bird's-Eye-View (BEV) have gained wide application in autonomous driving. However, due to the differences between roadside and vehicle-side scenarios, there currently lacks a multi-camera BEV solution in roadside. This paper systematically analyzes the key challenges in multi-camera BEV perception for roadside scenarios compared to vehicle-side. These challenges include the diversity in camera poses, the uncertainty in Camera numbers, the sparsity in perception regions, and the ambiguity in orientation angles. In response, we introduce RopeBEV, the first dense multi-camera BEV approach. RopeBEV introduces BEV augmentation to address the training balance issues caused by diverse camera poses. By incorporating CamMask and ROIMask (Region of Interest Mask), it supports variable camera numbers and sparse perception, respectively. Finally, camera rotation embedding is utilized to resolve orientation ambiguity. Our method ranks 1st on the real-world highway dataset RoScenes and demonstrates its practical value on a private urban dataset that covers more than 50 intersections and 600 cameras.
总的来说,本文系统地分析了多相机视野(BEV)在道路场景中与车辆侧场景之间的关键挑战。这些挑战包括相机姿态的多样性、相机数量的不确定性、感知区域的稀疏性和方向角的模糊性。为了应对这些挑战,我们引入了RopeBEV,这是第一个密集的多相机BEV方法。RopeBEV通过引入BEV增强来解决由于不同相机姿态而引起的训练平衡问题。通过包含CamMask和ROIMask(区域兴趣掩码),它支持变量的相机数量和稀疏的感知。最后,通过使用相机旋转嵌入来解决方向角的模糊性。在现实世界的highway数据集RoScenes上,我们的方法排名第1,证明了其在覆盖超过50个交叉口和600个摄像机的大型城市数据集上的实际价值。
https://arxiv.org/abs/2409.11706
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation. Building on the success of text-based LLMs, recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance in automatic speech recognition (ASR) and automatic speech translation (AST). In this work, we propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM. The Speech-LLM model consists of a speech encoder and an encoder-decoder structure Megatron-T5. By first decoding speech to generate ASR transcripts and subsequently using these transcripts along with encoded speech for prompting, we guide the speech translation in a two-step process like chain-of-thought (CoT) prompting. Low-rank adaptation (LoRA) is used for the T5 LLM for model adaptation and shows superior performance to full model fine-tuning. Experimental results show that the proposed CoT prompting significantly improves AST performance, achieving an average increase of 2.4 BLEU points across 6 En->X or X->En AST tasks compared to speech prompting alone. Additionally, compared to a related CoT prediction method that predicts a concatenated sequence of ASR and AST transcripts, our method performs better by an average of 2 BLEU points.
大语言模型(LLMs)在语言理解和生成方面取得了显著的进步。在基于文本的LLM的成功基础上,最近的研究将这些模型适应性语音嵌入用于提示,从而产生了表现强大的Speech-LLM模型,这些模型在自动语音识别(ASR)和自动语音翻译(AST)方面表现出优异性能。在这项工作中,我们提出了一个新颖的方法,以利用ASR转录作为Speech-LLM中AST的提示,基于编码器-解码器文本LLM。Speech-LLM模型由语音编码器和解码器结构Megatron-T5组成。首先将语音解码以生成ASR转录,然后将这些转录与编码的语音一起用于提示,我们在两个步骤中引导语音翻译,如思维链(CoT)提示。对于T5 LLM,使用了低秩适应(LoRA),该方法在模型适应方面表现出优异性能,相较于仅使用语音提示的AST性能提高了2.4 BLEU点。实验结果表明,与仅使用 speech 提示的AST相比,所提出的CoT提示显著提高了AST性能,平均增加了6个En->X或X->En AST任务中的2.4 BLEU点。此外,与相关CoT预测方法预测的将ASR和AST转录拼接为连续序列相比,我们的方法在平均2 BLEU点的优势。
https://arxiv.org/abs/2409.11538
Federated learning (FL) has rapidly evolved as a promising paradigm that enables collaborative model training across distributed participants without exchanging their local data. Despite its broad applications in fields such as computer vision, graph learning, and natural language processing, the development of a data projection model that can be effectively used to visualize data in the context of FL is crucial yet remains heavily under-explored. Neighbor embedding (NE) is an essential technique for visualizing complex high-dimensional data, but collaboratively learning a joint NE model is difficult. The key challenge lies in the objective function, as effective visualization algorithms like NE require computing loss functions among pairs of data. In this paper, we introduce \textsc{FedNE}, a novel approach that integrates the \textsc{FedAvg} framework with the contrastive NE technique, without any requirements of shareable data. To address the lack of inter-client repulsion which is crucial for the alignment in the global embedding space, we develop a surrogate loss function that each client learns and shares with each other. Additionally, we propose a data-mixing strategy to augment the local data, aiming to relax the problems of invisible neighbors and false neighbors constructed by the local $k$NN graphs. We conduct comprehensive experiments on both synthetic and real-world datasets. The results demonstrate that our \textsc{FedNE} can effectively preserve the neighborhood data structures and enhance the alignment in the global embedding space compared to several baseline methods.
联邦学习(FL)作为一种快速发展的有益范式,在分布式参与者的协同模型训练方面取得了显著的进步,而无需交换他们的局部数据。尽管FL在计算机视觉、图学习和自然语言处理等领域具有广泛应用,然而,开发一个可用于在FL背景下有效可视化数据的数据投影模型仍然至关重要,但却鲜有人深入研究。邻居嵌入(NE)是一种用于可视化复杂高维数据的 essential 技术,但协同学习联合NE模型却较为困难。关键挑战在于目标函数,因为像NE这样的有效可视化算法需要计算数据对之间的损失函数。在本文中,我们引入了 \textsc{FedNE},一种将 \textsc{FedAvg} 框架与对比 NE 技术相结合的新型方法,没有任何共享数据的要求。为解决全局嵌入空间中客户端之间的互斥性问题,我们开发了一种代理损失函数,每个客户端都会学习和与其他客户端共享。此外,我们还提出了一种数据混合策略,旨在通过增加局部数据来缓解本地 $k$NN 图中的不可见邻居和虚假邻居问题。我们对 synthetic 和 real-world 数据进行了全面的实验。实验结果表明,与 several 基线方法相比,我们的 \textsc{FedNE} 能够有效保留邻居数据结构并增强全局嵌入空间中的对齐。
https://arxiv.org/abs/2409.11509
Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6° mean-absolute-error in 3D source localization over the baseline.
人类可以通过不精确的自然语言描述来想象声音场景。例如,用“狮子咆哮从我身后传来!”这样的短语描述一个 acoustic 环境很容易想象。对于机器具有与人类相同的理解程度,机器必须知道狮子是什么(语义属性)、"背后"的概念(空间属性),以及这些语言信息如何与声音的语义和空间属性对齐(一个咆哮声从哪里来)。 最先进的音频基础模型通过学习将音频场景映射到自然文本描述中进行训练,这些模型对抗性训练非空间音频和文本对,因此缺乏空间意识。相比之下,声音事件定位和检测模型只能识别来自固定类别的声音,并将源定位到绝对位置(例如,0.2米),而不是用自然语言描述的位置。为了填补这些空白,我们提出了 ELSA,一种具有空间意识的音频和文本嵌入模型,使用多模态对比学习进行训练。ELSA 支持非空间音频、空间音频和用自然语言描述的开放词汇文本注解,描述声音的语义和空间成分。要训练 ELSA:(a) 我们将三个开源音频数据集中的音频和文本进行空间增强,总共 4,738 小时音频;(b) 我们设计了一个编码器,利用对比学习捕捉非空间音频的语义和空间属性,以及空间音频的语义和空间属性。ELSA 在语义检索和 3D 源本地化方面与最先进的模型相当。特别地,ELSA 在基准线上的平均音频到文本和文本到音频的 R@1 比率为 +2.8%,在基准线上实现了比 -11.6° 的绝对误差更优异的 3D 源本地化表现。
https://arxiv.org/abs/2409.11369
The deployment of multimodal large language models (MLLMs) has demonstrated remarkable success in engaging in conversations involving visual inputs, thanks to the superior power of large language models (LLMs). Those MLLMs are typically built based on the LLMs, with an image encoder to process images into the token embedding space of the LLMs. However, the integration of visual modality has introduced a unique vulnerability: the MLLM becomes susceptible to malicious visual inputs and prone to generating sensitive or harmful responses, even though the LLM has been trained on textual dataset to align with human value. In this paper, we first raise the question: ``Do the MLLMs possess safety-awareness against malicious image inputs?". We find that after adding a principle that specifies the safety requirement into the input of the MLLM, the model's safety awareness becomes boosted. This phenomenon verifies the existence of MLLM's safety-awareness against image inputs, it is only weakened by the modality gap. We then introduce a simple yet effective technique termed CoCA, which amplifies the safety-awareness of the MLLM by calibrating its output distribution. Our proposed strategy helps the model reclaim its original safety awareness without losing its original capabilities. We verify the effectiveness of our approach on both multimodal safety and understanding benchmarks.
多模态大型语言模型(MLLMs)在涉及视觉输入的对话中取得了显著的成功,得益于大型语言模型(LLMs)的优越力量。这些MLLM通常基于LLM构建,具有图像编码器将图像处理成LLM的词向量空间。然而,视觉模式的引入引入了一个独特的漏洞:即使LLM已经通过文本数据集进行了训练,以与人类价值观对齐,MLLM仍然容易受到恶意视觉输入的影响,甚至容易生成敏感或有害的响应。在本文中,我们首先提出了一个问题:“MLLM是否具有对恶意图像输入的安全意识?”我们发现,在将安全性要求指定为MLLM输入的原则后,模型的安全性意识得到了增强。这一现象证实了MLLM对图像输入的安全意识,这种安全性意识只有在模态差距的情况下才会减弱。然后,我们引入了一种简单而有效的技术称为CoCA,通过调整其输出分布来增强MLLM的安全意识。我们提出的方法使模型能够恢复其原有的安全性,同时保持其原有的功能。我们在多模态安全和理解基准上验证了我们的方法的有效性。
https://arxiv.org/abs/2409.11365
Numerous methods have been proposed to adapt a pre-trained foundational CLIP model for few-shot classification. As CLIP is trained on a large corpus, it generalises well through adaptation to few-shot classification. In this work, we analyse the intra-modal overlap in image space in terms of embedding representation. Our analysis shows that, due to contrastive learning, embeddings from CLIP model exhibit high cosine similarity distribution overlap in the image space between paired and unpaired examples affecting the performance of few-shot training-free classification methods which rely on similarity in the image space for their predictions. To tackle intra-modal overlap we propose to train a lightweight adapter on a generic set of samples from the Google Open Images dataset demonstrating that this improves accuracy for few-shot training-free classification. We validate our contribution through extensive empirical analysis and demonstrate that reducing the intra-modal overlap leads to a) improved performance on a number of standard datasets, b) increased robustness to distribution shift and c) higher feature variance rendering the features more discriminative for downstream tasks.
为了适应少样本分类,已经提出了许多方法来对预训练的CLIP模型进行修改。由于CLIP在大数据集上训练,因此通过适应少样本分类表现良好。在这项工作中,我们分析图像空间中内模型的嵌入表示。我们的分析表明,由于对比学习,CLIP模型的嵌入在成对和未成对实例之间在图像空间表现出高余弦相似性分布重叠,这会影响那些基于图像空间预测的少样本训练 free 的分类算法的准确性。为了应对内模冲突,我们提出了一个轻量级的适配器,在Google Open Images数据集中的通用样本上进行训练,证明了这对少样本训练free分类的准确性有所提高。通过广泛的实证分析验证我们的贡献,我们发现减少内模冲突会导致以下结果:a) 在多个标准数据集上的性能提高;b) 对分布漂移的增加容错性;c) 提高下游任务的特征变异性,使得特征更具判别性。
https://arxiv.org/abs/2409.11338
Contextualized embeddings vary by context, even for the same token, and form a distribution in the embedding space. To analyze this distribution, we focus on the norm of the mean embedding and the variance of the embeddings. In this study, we first demonstrate that these values follow the well-known formula for variance in statistics and provide an efficient sequential computation method. Then, by observing embeddings from intermediate layers of several Transformer models, we found a strong trade-off relationship between the norm and the variance: as the mean embedding becomes closer to the origin, the variance increases. This trade-off is likely influenced by the layer normalization mechanism used in Transformer models. Furthermore, when the sets of token embeddings are treated as clusters, we show that the variance of the entire embedding set can theoretically be decomposed into the within-cluster variance and the between-cluster variance. We found experimentally that as the layers of Transformer models deepen, the embeddings move farther from the origin, the between-cluster variance relatively decreases, and the within-cluster variance relatively increases. These results are consistent with existing studies on the anisotropy of the embedding spaces across layers.
上下文嵌入在很大程度上取决于上下文,即使是相同的单词,也会在嵌入空间中形成一个分布。为了分析这个分布,我们关注均嵌入的范数和方差。在这项研究中,我们首先证明这些值遵循统计学中众所周知的一个公式,并为一种高效的序列计算方法。然后,通过观察几个Transformer模型的中间层嵌入,我们发现均嵌入和方差之间存在一个强烈的正比关系:均嵌入越接近原点,方差越大。这种权衡可能受到Transformer模型中使用的层归一化机制的影响。此外,当将单词嵌入视为簇时,我们发现整个嵌入集的方差可以理论上分解为簇内方差和簇间方差。我们通过实验发现,随着Transformer模型层数的加深,嵌入越远离原点,簇内方差相对减少,簇间方差相对增加。这些结果与现有研究中层间嵌入空间非均匀性的现象是一致的。
https://arxiv.org/abs/2409.11253
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at this https URL.
无人机视频中的多个目标跟踪(MOT)对于计算机视觉的各种应用非常重要。当前的MOT跟踪器依赖于准确的物体检测结果和精确的目标识别(ReID)匹配。这些方法专注于优化目标的空间属性,而忽略了建模物体关系的时间线索,尤其是在具有挑战性的跟踪条件下,如物体变形和模糊等。为解决上述问题,我们提出了一个新颖的时空凝聚多目标跟踪框架(STCMOT),它利用历史嵌入特征来建模ReID和检测特征的序列顺序。具体来说,我们引入了一个时间嵌入增强模块,以增强基于相邻帧合作的个体嵌入的区分度。然后,通过一个时间检测平滑模块将轨迹嵌入传播,以挖掘时间域中的显著目标位置。在VisDrone2019和UAVDT数据集上进行的大量实验证明,我们的STCMOT在MOTA和IDF1指标上达到了最先进的水平。源代码已发布在https://这个链接上。
https://arxiv.org/abs/2409.11234
We present SuperCoder2.0, an advanced autonomous system designed to enhance software development through artificial intelligence. The system combines an AI-native development approach with intelligent agents to enable fully autonomous coding. Key focus areas include a retry mechanism with error output traceback, comprehensive code rewriting and replacement using Abstract Syntax Tree (ast) parsing to minimize linting issues, code embedding technique for retrieval-augmented generation, and a focus on localizing methods for problem-solving rather than identifying specific line numbers. The methodology employs a three-step hierarchical search space reduction approach for code base navigation and bug localization:utilizing Retrieval Augmented Generation (RAG) and a Repository File Level Map to identify candidate files, (2) narrowing down to the most relevant files using a File Level Schematic Map, and (3) extracting 'relevant locations' within these files. Code editing is performed through a two-part module comprising CodeGeneration and CodeEditing, which generates multiple solutions at different temperature values and replaces entire methods or classes to maintain code integrity. A feedback loop executes repository-level test cases to validate and refine solutions. Experiments conducted on the SWE-bench Lite dataset demonstrate SuperCoder2.0's effectiveness, achieving correct file localization in 84.33% of cases within the top 5 candidates and successfully resolving 34% of test instances. This performance places SuperCoder2.0 fourth globally on the SWE-bench leaderboard. The system's ability to handle diverse repositories and problem types highlights its potential as a versatile tool for autonomous software development. Future work will focus on refining the code editing process and exploring advanced embedding models for improved natural language to code mapping.
我们推出了SuperCoder2.0,这是一个通过人工智能增强软件开发的高级自主系统。该系统将AI原生开发方法与智能代理相结合,实现完全自主编程。关键关注点包括具有错误输出跟踪的重复机制、使用抽象语法树(ast)解析进行全面代码重写和替换以最小化编码问题、用于检索增强生成的代码嵌入技术以及将解决方案聚焦于解决问题而不是仅仅识别具体行号。该方法采用三级分层搜索空间减法进行代码库导航和问题定位:利用检索增强生成(RAG)和仓库文件级别映射来确定候选文件,(2)缩小到最相关的文件,使用文件级别概略映射,(3)提取这些文件中的相关位置。代码编辑通过由CodeGeneration和CodeEditing组成的两个部分进行,生成不同温度值下的多个解决方案,并替换整个方法或类以保持代码完整性。在SWE-bench Lite数据集上进行实验证明SuperCoder2.0的有效性,在top 5个候选者中的84.33%的文件本地化正确,并成功解决了34%的测试实例。这种性能使SuperCoder2.0在SWE-bench领导者榜上排名第四。系统处理不同仓库和问题类型的能力表明它作为自主软件开发工具的潜力。未来的工作将关注优化代码编辑过程,并探索用于提高自然语言到编码映射的高级嵌入模型。
https://arxiv.org/abs/2409.11190