Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
近期的大规模语言模型(LLMs)在通用文本嵌入任务上表现出色。虽然稠密嵌入一直主导着相关研究,我们首次提出了基于词典的嵌入方法(LENS),这种方法利用了LLM并在此类任务中取得了竞争性性能。针对传统因果LLM中存在的固有分词冗余问题和单向注意力限制,LENS通过分词嵌入聚类来整合词汇空间,并探索双向注意机制及多种池化策略。具体而言,LENS简化了词典匹配过程,为每个维度分配一个特定的分词簇,在这个簇中,语义相似的单词被聚集在一起,而通过双向注意力则释放出LLM的全部潜力。广泛的实验表明,LENS在大规模文本嵌入基准(MTEB)上优于稠密嵌入方法,并提供与稠密嵌入相同大小的紧凑特征表示。值得一提的是,将LENS与稠密嵌入相结合,在MTEB中的检索子集(即BEIR)中取得了最先进的性能。
https://arxiv.org/abs/2501.09749
With the increased use of the internet and social networks for online discussions, the spread of toxic and inappropriate content on social networking sites has also increased. Several studies have been conducted in different languages. However, there is less work done for South Asian languages for inappropriate content identification using deep learning techniques. In Urdu language, the spellings are not unique, and people write different common spellings for the same word, while mixing it other languages, like English in the text makes it more challenging, and limited research work is available to process such language with the finest algorithms. The use of attention layer with a deep learning model can help handling the long-term dependencies and increase its efficiency . To explore the effects of the attention layer, this study proposes attention-based Bidirectional GRU hybrid model for identifying inappropriate content in Urdu Unicode text language. Four different baseline deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the performance of the proposed model. The results of these models were compared based on evaluation metrics, dataset size, and impact of the word embedding layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our proposed model BiGRU-A outperformed all other baseline models by yielding 84\% accuracy without using pre-trained word2Vec layer. From our experiments, we have established that the attention layer improves the model's efficiency, and pre-trained word2Vec embedding does not work well with an inappropriate content dataset.
随着互联网和社交网络在在线讨论中的使用增加,社交媒体平台上毒性和不适当内容的传播也有所增加。不同语言中已经进行了多项研究,但在南亚语言中利用深度学习技术进行不当内容识别的研究工作较少。乌尔都语拼写并不唯一,同一单词有多种常见的拼写方式,而且会与其他语言(如英语)混合使用,这使得处理这种语言更具挑战性,并且可用的算法研究有限。 使用注意力层与深度学习模型相结合可以帮助处理长期依赖关系并提高其效率。为了探索注意力层的效果,本研究提出了一种基于注意力的双向GRU混合模型,用于识别乌尔都语Unicode文本中的不当内容。四种不同的基线深度学习模型:LSTM、Bi-LSTM、GRU和TCN被用来比较所提出的模型性能。根据评估指标、数据集大小以及词嵌入层的影响来对比这些模型的结果。我们使用了预训练的乌尔都语word2Vec嵌入。 我们的拟议模型BiGRU-A在不使用预训练的word2Vec层的情况下达到了84%的准确率,优于所有其他基线模型。从实验中得出结论,注意力层可以提高模型效率,并且与不当内容数据集相比,预训练的词向量层表现不佳。
https://arxiv.org/abs/2501.09722
Metric learning projects samples into an embedded space, where similarities and dissimilarities are quantified based on their learned representations. However, existing methods often rely on label-guided representation learning, where representations of different modalities, such as audio and visual data, are aligned based on annotated labels. This approach tends to underutilize latent complex features and potential relationships inherent in the distributions of audio and visual data that are not directly tied to the labels, resulting in suboptimal performance in audio-visual embedding learning. To address this issue, we propose a novel architecture that integrates cross-modal triplet loss with progressive self-distillation. Our method enhances representation learning by leveraging inherent distributions and dynamically refining soft audio-visual alignments -- probabilistic alignments between audio and visual data that capture the inherent relationships beyond explicit labels. Specifically, the model distills audio-visual distribution-based knowledge from annotated labels in a subset of each batch. This self-distilled knowledge is used t
将度量学习项目样本转换到嵌入空间中,在该空间中,相似性和不相似性基于它们的学得表示进行量化。然而,现有的方法通常依赖于标签引导的表征学习,即不同模态(如音频和视觉数据)的表示是根据标注标签对齐的。这种做法往往未能充分挖掘与标签直接关联之外的音频和视觉数据分布中固有的复杂特征和潜在关系,导致在音视频嵌入学习上的性能不佳。 为解决这一问题,我们提出了一种新型架构,该架构集成了跨模态三元组损失(cross-modal triplet loss)与渐进式自我蒸馏(progressive self-distillation)。我们的方法通过利用内在分布并动态优化音频-视觉软对齐来增强表征学习——即在标签之外捕捉音视频数据之间固有关系的概率性对齐。具体而言,模型从每个批次的子集中提取注释标签的基础知识,并基于这些知识点进行自我蒸馏。这种方法使得模型能够逐步提高其表示能力,不仅利用了直接标注信息,还充分利用了未被显式标签覆盖的数据内在结构和复杂特征。通过这种方式,我们的方法能够在跨模态数据的学习中更好地捕捉到隐藏的潜在关系,从而提升音视频嵌入学习的整体性能。 简而言之,这种方法旨在解决现有度量学习在处理多模态数据时可能存在的局限性,并且通过引入自我蒸馏机制来改进模型对复杂内在结构的理解和利用。
https://arxiv.org/abs/2501.09608
Meshes are used to represent complex objects in high fidelity physics simulators across a variety of domains, such as radar sensing and aerodynamics. There is growing interest in using neural networks to accelerate physics simulations, and also a growing body of work on applying neural networks directly to irregular mesh data. Since multiple mesh topologies can represent the same object, mesh augmentation is typically required to handle topological variation when training neural networks. Due to the sensitivity of physics simulators to small changes in mesh shape, it is challenging to use these augmentations when training neural network-based physics simulators. In this work, we show that variations in mesh topology can significantly reduce the performance of neural network simulators. We evaluate whether pretraining can be used to address this issue, and find that employing an established autoencoder pretraining technique with graph embedding models reduces the sensitivity of neural network simulators to variations in mesh topology. Finally, we highlight future research directions that may further reduce neural simulator sensitivity to mesh topology.
网格被用来在雷达感知和空气动力学等多个领域中,以高保真度来表示复杂物体的物理仿真器中。近年来,使用神经网络加速物理仿真的兴趣日益增长,并且越来越多的研究集中在如何直接应用神经网络处理不规则网格数据上。由于同一对象可以由多种不同的网格拓扑结构表示,在训练神经网络时通常需要进行网格增强以应对拓扑变化的影响。然而,由于物理仿真器对网格形状的细微改变非常敏感,因此在使用这些增强方法来训练基于神经网络的物理仿真器时面临挑战。 在这项工作中,我们展示了网格拓扑的变化可以显著降低神经网络仿真的性能。我们评估了预训练是否可用于解决这一问题,并发现采用已建立的自编码器预训练技术与图嵌入模型相结合的方法,能够减少神经网络仿真器对网格拓扑变化的敏感性。最后,我们指出了可能进一步降低基于神经网络的物理仿真器对于网格拓扑敏感性的未来研究方向。 总结而言,这项工作揭示了网格拓扑的变化对神经网络仿真的性能有显著影响,并提出了一种通过预训练来减轻这种影响的方法。
https://arxiv.org/abs/2501.09597
Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data. Methods: Our approach has two key components. First, Few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, Text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs. Results: We evaluate our approach to generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks. Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in this https URL.
翻译: 目的:手术工作流程分析对于提高手术效率和安全性至关重要。然而,以往的研究严重依赖大规模标注数据集,在成本、可扩展性和对专家注释的依赖方面存在挑战。为了应对这一问题,我们提出了Surg-FTDA(少量样本文本驱动适应),旨在仅使用少量配对图像标签数据来处理各种手术工作流程分析任务。 方法:我们的方法包含两个关键组成部分。首先,“基于少量样本选择的模态对齐”选取一小部分图像,并将其嵌入与下游任务中的文本嵌入对齐,以此弥合了模态差距。其次,“文本驱动适应”仅利用文本数据训练解码器,从而无需配对的图像-文本数据。然后将此解码器应用于对齐后的图像嵌入中,使在没有明确图像-文本对的情况下也能执行与图像相关的任务。 结果:我们评估了Surg-FTDA在生成性任务(图像描述)和判别性任务(三元组识别和阶段识别)中的表现。结果显示,Surg-FTDA优于基准方法,并且能够很好地泛化到下游任务中。结论:我们提出了一种文本驱动适应的方法,该方法减轻了模态差距并处理了手术工作流程分析的多个下游任务,同时大大减少了对大规模标注数据集的依赖。代码和数据集将在此网址发布(注:原文中没有提供具体的URL链接)。
https://arxiv.org/abs/2501.09555
The meanings and relationships of words shift over time. This phenomenon is referred to as semantic this http URL focused on understanding how semantic shifts occur over multiple time periods is essential for gaining a detailed understanding of semantic this http URL, detecting change points only between adjacent time periods is insufficient for analyzing detailed semantic shifts, and using BERT-based methods to examine word sense proportions incurs a high computational this http URL address those issues, we propose a simple yet intuitive framework for how semantic shifts occur over multiple time periods by leveraging a similarity matrix between the embeddings of the same word through this http URL compute a diachronic word similarity matrix using fast and lightweight word embeddings across arbitrary time periods, making it deeper to analyze continuous semantic this http URL, by clustering the similarity matrices for different words, we can categorize words that exhibit similar behavior of semantic shift in an unsupervised manner.
单词的意义和关系会随着时间的推移而发生变化,这种现象被称为语义漂移。为了深入了解这种多时期内发生的语义变化,研究在不同时间间隔期间检测到的变化点是不够的,并且使用基于BERT的方法来考察词义比例也面临着较高的计算成本问题。 为了解决这些问题,我们提出了一种简单直观的框架,通过利用相同词汇在其嵌入表示之间的相似性矩阵,以描述多时期内语义变化的发生。我们的方法可以跨任意时间间隔计算出一种历时性的单词相似度矩阵,并使用快速且轻量级的词向量来实现这一点,从而更深入地分析连续的语义变化。 通过聚类不同词汇的相似度矩阵,我们可以无监督地将表现出类似语义漂移行为的词语归为一类。
https://arxiv.org/abs/2501.09538
The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
VLM(视觉语言模型)的成功往往依赖于能够动态生成高分辨率图像的方案,这些方案会自适应地将输入图像分割成多个区域块,以保留图像中的细节信息。然而,这样的方法会产生大量的冗余视觉标记,从而大大降低了VLM的效率。为了在不增加额外训练成本的情况下提高VLM的效率,许多研究工作提出了通过过滤掉无用的视觉标记或聚合它们的信息来减少视觉标记的方法。一些方法提出根据VLM中的自注意力机制来减少视觉标记,但由于这种机制存在偏差,会导致响应不准确。仅仅依赖于视觉线索进行标记减少的方法对文本是不可知的,在处理与问题最相关的区域时会失败,尤其是在查询对象在图像中不太突出的情况下。 在这项工作中,我们首先进行了实验以证明原始文本嵌入与视觉标记对齐,并且对于尾部视觉标记没有偏差。然后,我们提出了一种自我适应的跨模态注意力混合机制,在预训练大语言模型(LLM)层中动态利用视觉显著性和文本到图像相似性的有效性来选择那些信息丰富的视觉标记。广泛的实验表明,所提出的这种方法在无训练成本的情况下实现了最先进的VLM加速性能,尤其是在减少率足够大的情况下尤其有效。
https://arxiv.org/abs/2501.09532
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at this https URL.
在现实世界的顺序决策任务中,如自动驾驶、机器人技术和医疗保健领域,从观察到的状态-动作轨迹(state-action trajectories)中学习对于模仿、分类和聚类等任务至关重要。例如,无人驾驶汽车需要复制人类驾驶行为,而机器人系统和医疗健康系统则可以从建模决策序列中受益,不论这些数据是否来自专家。现有的轨迹编码方法通常专注于特定的任务或依赖于奖励信号,这限制了它们在跨领域和任务中的泛化能力。 受嵌入模型如CLIP(Contrastive Language–Image Pre-training)和BERT(Bidirectional Encoder Representations from Transformers)在静态域中取得成功的启发,我们提出了一种将状态-动作轨迹嵌入到一个潜在空间的方法。这种方法旨在捕捉动态决策过程中的技能与能力,并且不需要奖励标签,从而能够在不同领域和任务之间实现更好的泛化。 我们的贡献主要包括三个方面: 1. 我们引入了一种新的轨迹嵌入方法,该方法能够从状态-动作数据中捕获多种能力。 2. 学习到的嵌入具有强大的跨下游任务表示能力,包括模仿、分类、聚类和回归等。 3. 嵌入还展示了独特的性质,如在IQ-Learn中控制代理行为以及潜在空间中的加性结构。 实验结果证实了我们提出的方法优于传统方法,为各种应用提供了更灵活且强大的轨迹表示。我们的代码可在以下网址获取:[这个URL应该是一个实际可用的链接,在此处用占位符"this https URL"代替]。
https://arxiv.org/abs/2501.09327
Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
在信息时代,由于其普遍性和实际应用价值,短文本分类已获得了广泛关注。近期,图学习与对比学习相结合的技术,在解决短文本分类中的语义稀疏性和标注数据不足的挑战方面显示出巨大潜力。然而,现有的模型存在一定的局限性:它们依赖于显式的数据增强技术来生成对比视图,这会导致语义污染和噪声;此外,这些模型仅关注于从生成的视图中学习内在一致性,而忽视了其他潜在视图中的有价值的区别信息。 为了解决这些问题,我们提出了一种用于短文本分类的简单图对比学习框架(SimSTC)。我们的方法包括在多个与文本相关的组件图上执行图学习以获取多视角文本嵌入,随后直接在这些建模后的嵌入上应用对比学习。特别值得注意的是,我们的方法消除了生成对比视图所需的数据增强操作,同时仍然利用了多视图对比学习的好处。尽管该模型结构简单,但在各种数据集上的性能表现却非常出色,并且超过了大型语言模型的水平。
https://arxiv.org/abs/2501.09219
Large Language Models (LLMs) have revolutionized artificial intelligence (AI) by enabling human like text generation and natural language understanding. However, their reliance on static training data limits their ability to respond to dynamic, real time queries, resulting in outdated or inaccurate outputs. Retrieval Augmented Generation (RAG) has emerged as a solution, enhancing LLMs by integrating real time data retrieval to provide contextually relevant and up-to-date responses. Despite its promise, traditional RAG systems are constrained by static workflows and lack the adaptability required for multistep reasoning and complex task management. Agentic Retrieval-Augmented Generation (Agentic RAG) transcends these limitations by embedding autonomous AI agents into the RAG pipeline. These agents leverage agentic design patterns reflection, planning, tool use, and multiagent collaboration to dynamically manage retrieval strategies, iteratively refine contextual understanding, and adapt workflows to meet complex task requirements. This integration enables Agentic RAG systems to deliver unparalleled flexibility, scalability, and context awareness across diverse applications. This survey provides a comprehensive exploration of Agentic RAG, beginning with its foundational principles and the evolution of RAG paradigms. It presents a detailed taxonomy of Agentic RAG architectures, highlights key applications in industries such as healthcare, finance, and education, and examines practical implementation strategies. Additionally, it addresses challenges in scaling these systems, ensuring ethical decision making, and optimizing performance for real-world applications, while providing detailed insights into frameworks and tools for implementing Agentic RAG
大型语言模型(LLMs)通过实现类似人类的文本生成和自然语言理解彻底革新了人工智能领域。然而,它们依赖于静态训练数据的特性限制了其应对动态、实时查询的能力,导致输出过时或不准确。检索增强生成(RAG)作为一种解决方案应运而生,它通过将实时数据检索集成到LLM中来提供上下文相关和最新的响应。尽管RAG系统展现出巨大潜力,但传统的RAG系统受限于静态工作流程,并且缺乏多步推理和复杂任务管理所需的适应性。 代理增强生成(Agentic RAG)则超越了这些限制,在RAG管道中嵌入自主AI代理。这些代理利用代理人设计模式的反思、规划、工具使用及多代理协作,以动态管理检索策略、迭代细化上下文理解,并适应工作流程以满足复杂任务需求。这种集成使得Agentic RAG系统能够在各种应用中提供无与伦比的灵活性、可扩展性和上下文感知能力。 本文综述全面探索了Agentic RAG,从其基础原理和RAG范式的演进开始。它详细介绍了Agentic RAG架构的分类,并强调了在医疗保健、金融和教育等行业中的关键应用,同时探讨了实用实现策略。此外,该综述还讨论了扩展这些系统所面临的挑战,确保伦理决策并优化实际应用性能的问题,并提供了关于实施Agentic RAG框架和技术的详细见解。 通过这样的综合分析,我们可以更深入地理解Agentic RAG的技术细节及其在现实世界中的广泛应用潜力。
https://arxiv.org/abs/2501.09136
Foundation models (FMs) have shown transformative potential in radiology by performing diverse, complex tasks across imaging modalities. Here, we developed CT-FM, a large-scale 3D image-based pre-trained model designed explicitly for various radiological tasks. CT-FM was pre-trained using 148,000 computed tomography (CT) scans from the Imaging Data Commons through label-agnostic contrastive learning. We evaluated CT-FM across four categories of tasks, namely, whole-body and tumor segmentation, head CT triage, medical image retrieval, and semantic understanding, showing superior performance against state-of-the-art models. Beyond quantitative success, CT-FM demonstrated the ability to cluster regions anatomically and identify similar anatomical and structural concepts across scans. Furthermore, it remained robust across test-retest settings and indicated reasonable salient regions attached to its embeddings. This study demonstrates the value of large-scale medical imaging foundation models and by open-sourcing the model weights, code, and data, aims to support more adaptable, reliable, and interpretable AI solutions in radiology.
基础模型(FMs)在放射学中展示了变革潜力,能够在不同的成像模式下执行多样且复杂的任务。在这里,我们开发了CT-FM,这是一个大规模的基于3D图像的预训练模型,专门针对各种放射学任务设计。CT-FM通过无标签对比学习方法,在来自影像数据公用库(Imaging Data Commons)的148,000个计算机断层扫描(CT)扫描基础上进行了预训练。 我们评估了CT-FM在四个类别的任务中的表现,包括全身和肿瘤分割、头部CT分类、医学图像检索以及语义理解。结果显示,在所有类别中,CT-FM的表现均优于最先进的模型。 除了量化的成功之外,CT-FM还展示出将解剖区域进行聚类,并能够在不同扫描之间识别类似的解剖结构的能力。此外,它在测试-重测设置下仍然保持了其稳健性,并显示出了与其嵌入关联的合理显著区域。 本研究证明了大规模医学影像基础模型的价值,并通过开源该模型权重、代码和数据,旨在支持放射学领域中更灵活、可靠且可解释的人工智能解决方案。
https://arxiv.org/abs/2501.09001
3D scene generation has garnered growing attention in recent years and has made significant progress. Generating 4D cities is more challenging than 3D scenes due to the presence of structurally complex, visually diverse objects like buildings and vehicles, and heightened human sensitivity to distortions in urban environments. To tackle these issues, we propose CityDreamer4D, a compositional generative model specifically tailored for generating unbounded 4D cities. Our main insights are 1) 4D city generation should separate dynamic objects (e.g., vehicles) from static scenes (e.g., buildings and roads), and 2) all objects in the 4D scene should be composed of different types of neural fields for buildings, vehicles, and background stuff. Specifically, we propose Traffic Scenario Generator and Unbounded Layout Generator to produce dynamic traffic scenarios and static city layouts using a highly compact BEV representation. Objects in 4D cities are generated by combining stuff-oriented and instance-oriented neural fields for background stuff, buildings, and vehicles. To suit the distinct characteristics of background stuff and instances, the neural fields employ customized generative hash grids and periodic positional embeddings as scene parameterizations. Furthermore, we offer a comprehensive suite of datasets for city generation, including OSM, GoogleEarth, and CityTopia. The OSM dataset provides a variety of real-world city layouts, while the Google Earth and CityTopia datasets deliver large-scale, high-quality city imagery complete with 3D instance annotations. Leveraging its compositional design, CityDreamer4D supports a range of downstream applications, such as instance editing, city stylization, and urban simulation, while delivering state-of-the-art performance in generating realistic 4D cities.
近年来,三维场景生成受到了越来越多的关注,并取得了显著进展。然而,生成四维城市(包含时间维度的变化)比单纯生成三维场景更具挑战性,这是因为建筑物和车辆等结构复杂、视觉多样化的物体的存在以及人们对城市环境中任何失真的高度敏感性。为了解决这些问题,我们提出了CityDreamer4D,这是一种专门用于生成无边界四维城市的组合式生成模型。 我们的主要见解是: 1. 四维城市的生成应该将动态对象(例如车辆)与静态场景(例如建筑物和道路)区分开来。 2. 在四维场景中的所有物体都应由不同类型神经场组成,包括针对建筑物、车辆及背景物质的神经场。 具体来说,我们提出了交通场景生成器和无边界布局生成器,这两个生成器使用高度紧凑的BEV(鸟瞰图)表示方式来产生动态交通场景和静态城市布局。在四维城市中,物体通过结合面向物质和实例化的神经场(背景物质、建筑物和车辆)进行生成。 为了适应背景物质和实例的不同特性,这些神经场采用了定制化生成哈希网格和周期位置嵌入作为场景参数化方式。 此外,我们提供了一整套用于城市生成的数据集,包括OSM、Google Earth 和 CityTopia。其中OSM数据集提供了多种真实世界的城市布局,而 Google 地球和CityTopia 数据集则提供了大规模、高质量的城市图像,并附有3D实例标注信息。借助其组合式设计,CityDreamer4D 支持一系列下游应用,包括实例编辑、城市风格化及都市仿真等,同时在生成逼真的四维城市方面表现出色。
https://arxiv.org/abs/2501.08983
In this paper, we propose a novel cross-attention-based generative adversarial network (GAN) for the challenging person image generation task. Cross-attention is a novel and intuitive multi-modal fusion method in which an attention/correlation matrix is calculated between two feature maps of different modalities. Specifically, we propose the novel XingGAN (or CrossingGAN), which consists of two generation branches that capture the person's appearance and shape, respectively. Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement. This has not been considered by any other existing GAN-based image generation work. To further learn the long-range correlations between different person poses at different scales and sub-regions, we propose two novel multi-scale cross-attention blocks. To tackle the issue of independent correlation computations within the cross-attention mechanism leading to noisy and ambiguous attention weights, which hinder performance improvements, we propose a module called enhanced attention (EA). Lastly, we introduce a novel densely connected co-attention module to fuse appearance and shape features at different stages effectively. Extensive experiments on two public datasets demonstrate that the proposed method outperforms current GAN-based methods and performs on par with diffusion-based methods. However, our method is significantly faster than diffusion-based methods in both training and inference.
在本文中,我们提出了一种新颖的基于交叉注意力的生成对抗网络(GAN),用于具有挑战性的人体图像生成任务。交叉注意力是一种创新且直观的多模态融合方法,在这种方法中,计算不同模式的两个特征图之间的注意力/相关矩阵。 具体来说,我们提出了一个新的模型XingGAN(或称为CrossingGAN),该模型包含两个生成分支,分别捕捉人体的外观和形状。此外,为了有效传输并更新人的形状和外观嵌入以实现相互改进,我们还提出两种新的交叉注意模块。这一特性是现有的基于GAN的图像生成工作中未曾考虑过的。 为进一步学习不同姿势之间以及不同规模和子区域之间的长程相关性,我们提出了两个新颖的多尺度交叉注意力模块。为了解决交叉注意力机制中独立的相关计算导致噪声大且模棱两可的关注权重问题(这些问题阻碍了性能的提升),我们提出了一种称为增强注意(Enhanced Attention, EA)的模块。 最后,为了有效融合不同阶段的人体外观和形状特征,我们引入了一个新颖的密集连接共注意力模块。在两个公开数据集上的广泛实验表明,所提出的模型优于现有的基于GAN的方法,并且与基于扩散的方法性能相当。然而,在训练和推理过程中,我们的方法比基于扩散的方法快得多。
https://arxiv.org/abs/2501.08900
UV map estimation is used in computer vision for detailed analysis of human posture or activity. Previous methods assign pixels to body model vertices by comparing pixel descriptors independently, without enforcing global coherence or plausibility in the UV map. We propose Pose-Constrained Continuous Surface Embeddings (PC-CSE), which integrates estimated 2D human pose into the pixel-to-vertex assignment process. The pose provides global anatomical constraints, ensuring that UV maps remain coherent while preserving local precision. Evaluation on DensePose COCO demonstrates consistent improvement, regardless of the chosen 2D human pose model. Whole-body poses offer better constraints by incorporating additional details about the hands and feet. Conditioning UV maps with human pose reduces invalid mappings and enhances anatomical plausibility. In addition, we highlight inconsistencies in the ground-truth annotations.
UV map估计在计算机视觉中用于详细分析人体姿态或活动。以前的方法通过独立比较像素描述符将像素分配给人体模型顶点,而没有强制执行全局一致性和合理性。我们提出了一种名为姿势约束连续表面嵌入(Pose-Constrained Continuous Surface Embeddings, PC-CSE)的新方法,该方法在像素到顶点的分配过程中整合了估计的2D人体姿态信息。这种姿势提供了全球解剖学约束,确保UV图的一致性同时保留局部精度。在DensePose COCO数据集上的评估表明,在选择不同的2D人体姿态模型时,PC-CSE都能取得一致的改进效果。全身姿态通过加入手和脚的额外细节提供更好的约束条件。用人类姿势对UV地图进行条件处理可以减少无效映射,并增强解剖学合理性。此外,我们还指出了地面真实标注中的不一致性问题。
https://arxiv.org/abs/2501.08815
Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabelled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision $\texttt{InfoHier}$, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. $\texttt{InfoHier}$ has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.
分析大规模数据集,尤其是涉及如图像这般复杂且高维度的数据时,是一项极具挑战性的任务。尽管自监督学习(SSL)在从无标签数据中学习表示方面表现出了有效性,但它通常侧重于扁平、非层次化的结构,忽略了实际数据集中存在的多层次关系。分层聚类(HC)能够通过将数据组织成树状结构来揭示这些关系,但其往往依赖于刚性的相似性度量标准,难以捕捉不同类型复杂数据的本质特征。 为了解决这些问题,我们提出了一种新的框架$\texttt{InfoHier}$,它结合了SSL和HC,以共同学习出稳健的潜在表示与层级结构。该方法利用SSL提供自适应表示能力,从而增强HC捕捉复杂模式的能力;同时,它还整合了HC损失函数来优化SSL训练过程,生成更加符合底层信息层次的表征。 $\texttt{InfoHier}$有望提升聚类和表示学习两方面的表现力与性能,在数据分析、管理和信息检索方面带来显著优势。
https://arxiv.org/abs/2501.08717
Unsupervised representation learning has significantly advanced various machine learning tasks. In the computer vision domain, state-of-the-art approaches utilize transformations like random crop and color jitter to achieve invariant representations, embedding semantically the same inputs despite transformations. However, this can degrade performance in tasks requiring precise features, such as localization or flower classification. To address this, recent research incorporates equivariant representation learning, which captures transformation-sensitive information. However, current methods depend on transformation labels and thus struggle with interdependency and complex transformations. We propose Self-supervised Transformation Learning (STL), replacing transformation labels with transformation representations derived from image pairs. The proposed method ensures transformation representation is image-invariant and learns corresponding equivariant transformations, enhancing performance without increased batch complexity. We demonstrate the approach's effectiveness across diverse classification and detection tasks, outperforming existing methods in 7 out of 11 benchmarks and excelling in detection. By integrating complex transformations like AugMix, unusable by prior equivariant methods, this approach enhances performance across tasks, underscoring its adaptability and resilience. Additionally, its compatibility with various base models highlights its flexibility and broad applicability. The code is available at this https URL.
无监督表示学习在各种机器学习任务中取得了显著进展。在计算机视觉领域,最先进的方法利用随机裁剪和颜色抖动等变换来实现不变性表示,在存在变化的情况下也能嵌入语义相同的数据输入。然而,这种方法对于需要精确特征的任务(如定位或花卉分类)来说可能会降低性能表现。为了解决这个问题,最近的研究引入了等变表示学习,该方法捕捉到了转换敏感信息。不过,当前的方法依赖于变换标签,因此在处理互相关性和复杂变换时会遇到困难。 我们提出了一种自监督变换学习(Self-supervised Transformation Learning, STL)方法,它用从图像对中衍生出的变换表示来替代变换标签。这种方法确保了变换表示是图像不变的,并能学习相应的等变变换,从而在不增加批次复杂度的情况下提升了性能表现。我们在广泛的分类和检测任务上展示了该方法的有效性,在11个基准中的7个上超过了现有的方法,并且在检测方面表现出色。 通过结合AugMix这类以前的等变方法无法使用的复杂变换,这种方法可以提升跨任务的表现力,突显了其适应性和稳定性。此外,它与各种基础模型兼容,展示了其灵活性和广泛适用性。相关代码可以在以下网址获得:[提供的链接]。
https://arxiv.org/abs/2501.08712
The scarcity of speaker-annotated far-field speech presents a significant challenge in developing high-performance far-field speaker verification (SV) systems. While data augmentation using large-scale near-field speech has been a common strategy to address this limitation, the mismatch in acoustic environments between near-field and far-field speech significantly hinders the improvement of far-field SV effectiveness. In this paper, we propose an adaptive speech augmentation approach leveraging NaturalSpeech3, a pre-trained foundation text-to-speech (TTS) model, to convert near-field speech into far-field speech by incorporating far-field acoustic ambient noise for data augmentation. Specifically, we utilize FACodec from NaturalSpeech3 to decompose the speech waveform into distinct embedding subspaces-content, prosody, speaker, and residual (acoustic details) embeddings-and reconstruct the speech waveform from these disentangled representations. In our method, the prosody, content, and residual embeddings of far-field speech are combined with speaker embeddings from near-field speech to generate augmented pseudo far-field speech that maintains the speaker identity from the out-domain near-field speech while preserving the acoustic environment of the in-domain far-field speech. This approach not only serves as an effective strategy for augmenting training data for far-field speaker verification but also extends to cross-data augmentation for enrollment and test speech in evaluation this http URL results on FFSVC demonstrate that the adaptive data augmentation method significantly outperforms traditional approaches, such as random noise addition and reverberation, as well as other competitive data augmentation strategies.
远场语音中的说话人标注数据稀缺,这对开发高性能的远场语音识别(SV)系统构成了重大挑战。虽然使用大规模近场语音进行数据增强是一种常见的策略来应对这一限制,但近场和远场语音之间声学环境的不匹配严重阻碍了远场SV有效性的提升。在本文中,我们提出了一种自适应语音增强方法,该方法利用NaturalSpeech3(一个预训练的基础文本到语音[TTS]模型)将近场语音转换为远场语音,并通过加入远场声学背景噪声来进行数据增强。具体来说,我们使用NaturalSpeech3的FACodec来分解语音波形成为不同的嵌入子空间——内容、韵律、说话人和残差(声学细节)嵌入,并从这些分离表示中重建语音波形。在我们的方法中,将远场语音的韵律、内容和残差嵌入与近场语音的说话人嵌入结合,生成增强伪远场语音,在保持出域近场语音说话人身份的同时保留了入域远场语音的声学环境。这种方法不仅为远场语音识别的数据增强提供了一种有效策略,还可以扩展到评估中的注册和测试语音之间的跨数据增强。FFSVC上的实验结果表明,自适应数据增强方法显著优于传统方法(如随机噪声添加和混响)及其他竞争性数据增强策略。
https://arxiv.org/abs/2501.08691
This paper proposes the ViT Token Constraint and Multi-scale Memory bank (TCMM) method to address the patch noises and feature inconsistency in unsupervised person re-identification works. Many excellent methods use ViT features to obtain pseudo labels and clustering prototypes, then train the model with contrastive learning. However, ViT processes images by performing patch embedding, which inevitably introduces noise in patches and may compromise the performance of the re-identification model. On the other hand, previous memory bank based contrastive methods may lead data inconsistency due to the limitation of batch size. Furthermore, existing pseudo label methods often discard outlier samples that are difficult to cluster. It sacrifices the potential value of outlier samples, leading to limited model diversity and robustness. This paper introduces the ViT Token Constraint to mitigate the damage caused by patch noises to the ViT architecture. The proposed Multi-scale Memory enhances the exploration of outlier samples and maintains feature consistency. Experimental results demonstrate that our system achieves state-of-the-art performance on common benchmarks. The project is available at \href{this https URL}{this https URL}.
本文提出了ViT Token Constraint和多尺度记忆库(TCMM)方法,旨在解决无监督行人重识别工作中出现的补丁噪声和特征不一致性问题。许多优秀的方法利用ViT特征来获取伪标签和聚类原型,然后通过对比学习训练模型。然而,由于ViT是通过对图像进行补丁嵌入处理来工作的,这不可避免地会在各个补丁中引入噪声,并可能影响重识别模型的性能。另一方面,基于以往记忆库的对比方法可能会因为批次大小的限制而导致数据不一致的问题。此外,现有的伪标签生成方法往往忽略那些难以聚类的异常样本,从而牺牲了这些异常样本的潜在价值,导致模型多样性和鲁棒性的局限性。 本文介绍了一种ViT Token Constraint技术来减轻补丁噪声对ViT架构造成的损害。提出的多尺度记忆库则增强了对异常样本的探索,并保持了特征的一致性。实验结果显示,我们的系统在常见基准测试中达到了最先进的性能水平。该项目可在[此处](this https URL)访问。
https://arxiv.org/abs/2501.09044
This paper introduces a novel approach to enhance the performance of Gaussian Shading, a prevalent watermarking technique, by integrating the Exact Diffusion Inversion via Coupled Transformations (EDICT) framework. While Gaussian Shading traditionally embeds watermarks in a noise latent space, followed by iterative denoising for image generation and noise addition for watermark recovery, its inversion process is not exact, leading to potential watermark distortion. We propose to leverage EDICT's ability to derive exact inverse mappings to refine this process. Our method involves duplicating the watermark-infused noisy latent and employing a reciprocal, alternating denoising and noising scheme between the two latents, facilitated by EDICT. This allows for a more precise reconstruction of both the image and the embedded watermark. Empirical evaluation on standard datasets demonstrates that our integrated approach yields a slight, yet statistically significant improvement in watermark recovery fidelity. These results highlight the potential of EDICT to enhance existing diffusion-based watermarking techniques by providing a more accurate and robust inversion mechanism. To the best of our knowledge, this is the first work to explore the synergy between EDICT and Gaussian Shading for digital watermarking, opening new avenues for research in robust and high-fidelity watermark embedding and extraction.
本文介绍了一种新颖的方法,通过结合精确扩散逆向变换框架(EDICT)来提升高斯着色技术的性能。高斯着色是一种广泛使用的水印技术,它通常在噪声潜在空间中嵌入水印,并通过迭代去噪生成图像和添加噪声恢复水印。然而,其逆过程不准确,可能导致水印失真。我们提出利用EDICT精确推导逆向映射的能力来优化这一过程。我们的方法包括复制包含水印的噪声潜变量,并在两个潜变量之间交替进行去噪和加噪操作,借助于EDICT框架实现。这使得图像和嵌入水印的重构更为精准。 通过标准数据集上的实证评估表明,我们的集成方法使水印恢复保真度有了细微但统计上显著的提升。这些结果突显了EDICT增强现有基于扩散技术的水印处理能力,并提供更准确、更具鲁棒性的逆向机制的潜力。据我们所知,这是首次探索EDICT与高斯着色在数字水印中的协同作用的研究工作,为研究稳健且高保真的水印嵌入和提取开辟了新的途径。
https://arxiv.org/abs/2501.08604
We propose GOTLoc, a robust localization method capable of operating even in outdoor environments where GPS signals are unavailable. The method achieves this robust localization by leveraging comparisons between scene graphs generated from text descriptions and maps. Existing text-based localization studies typically represent maps as point clouds and identify the most similar scenes by comparing embeddings of text and point cloud data. However, point cloud maps have limited scalability as it is impractical to pre-generate maps for all outdoor spaces. Furthermore, their large data size makes it challenging to store and utilize them directly on actual robots. To address these issues, GOTLoc leverages compact data structures, such as scene graphs, to store spatial information, enabling individual robots to carry and utilize large amounts of map data. Additionally, by utilizing publicly available map data, such as OpenStreetMap, which provides global information on outdoor spaces, we eliminate the need for additional effort to create custom map data. For performance evaluation, we utilized the KITTI360Pose dataset in conjunction with corresponding OpenStreetMap data to compare the proposed method with existing approaches. Our results demonstrate that the proposed method achieves accuracy comparable to algorithms relying on point cloud maps. Moreover, in city-scale tests, GOTLoc required significantly less storage compared to point cloud-based methods and completed overall processing within a few seconds, validating its applicability to real-world robotics. Our code is available at this https URL.
我们提出了一种名为GOTLoc的鲁棒定位方法,该方法能够在GPS信号不可用的户外环境中运行。通过利用从文本描述生成的场景图与地图之间的比较,这种方法实现了稳健的位置定位。现有的基于文本的定位研究通常将地图表示为点云,并通过比较文本和点云数据的嵌入来识别最相似的场景。然而,点云地图在可扩展性方面存在限制,因为预先生成所有户外空间的地图是不切实际的。此外,由于其庞大的数据量,直接存储和利用这些数据对实际机器人来说具有挑战性。 为了解决这些问题,GOTLoc采用了紧凑的数据结构(如场景图)来存储空间信息,这使得单个机器人能够携带并使用大量的地图数据。此外,通过利用公开的地图数据,例如OpenStreetMap,该方法消除了创建自定义地图数据的额外工作量,并提供了关于户外空间的全球信息。 为了评估性能,我们结合了KITTI360Pose数据集和相应的OpenStreetMap数据来比较所提出的方法与现有方法。实验结果显示,我们的方法在精度上可以媲美依赖于点云地图的算法。此外,在城市规模测试中,GOTLoc所需的存储量远小于基于点云的方法,并且能够在几秒钟内完成全部处理过程,证明了其在现实世界机器人技术中的适用性。 我们的代码可以在[此链接](https://example.com)获取。
https://arxiv.org/abs/2501.08575