Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776
In this paper, we propose a novel approach to address the problem of camera and radar sensor fusion for 3D object detection in autonomous vehicle perception systems. Our approach builds on recent advances in deep learning and leverages the strengths of both sensors to improve object detection performance. Precisely, we extract 2D features from camera images using a state-of-the-art deep learning architecture and then apply a novel Cross-Domain Spatial Matching (CDSM) transformation method to convert these features into 3D space. We then fuse them with extracted radar data using a complementary fusion strategy to produce a final 3D object representation. To demonstrate the effectiveness of our approach, we evaluate it on the NuScenes dataset. We compare our approach to both single-sensor performance and current state-of-the-art fusion methods. Our results show that the proposed approach achieves superior performance over single-sensor solutions and could directly compete with other top-level fusion methods.
在本文中,我们提出了一种新的方法来解决自动驾驶感知系统中3D物体检测的问题。我们的方法基于最近在深度学习方面的进展,并利用两个传感器的优势来提高物体检测性能。具体来说,我们使用最先进的深度学习架构提取相机图像的2D特征,然后应用一种新颖的跨域空间匹配(CDSM)变换方法将它们转换为3D空间。接着,我们使用互补的融合策略将提取的雷达数据与2D特征融合,产生最终的3D物体表示。为了证明我们方法的有效性,我们在 nuScenes 数据集上进行了评估。我们将我们的方法与单传感器性能和当前最先进的融合方法进行了比较。我们的结果表明,与单传感器解决方案相比,所提出的方法具有卓越的性能,并可以直接与其他顶级融合方法竞争。
https://arxiv.org/abs/2404.16548
Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to "jailbreaking" attacks, where carefully crafted prompts elicit them to produce toxic content. One category of jailbreak attacks is reformulating the task as adversarial attacks by eliciting the LLM to generate an affirmative response. However, the typical attack in this category GCG has very limited attack success rate. In this study, to better study the jailbreak attack, we introduce the DSN (Don't Say No) attack, which prompts LLMs to not only generate affirmative responses but also novelly enhance the objective to suppress refusals. In addition, another challenge lies in jailbreak attacks is the evaluation, as it is difficult to directly and accurately assess the harmfulness of the attack. The existing evaluation such as refusal keyword matching has its own limitation as it reveals numerous false positive and false negative instances. To overcome this challenge, we propose an ensemble evaluation pipeline incorporating Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators. Extensive experiments demonstrate the potency of the DSN and the effectiveness of ensemble evaluation compared to baseline methods.
确保大型语言模型(LLMs)的安全对生成符合人类价值观的响应至关重要。尽管它们能够识别并避免有害查询,但LLMs仍然容易受到“破解”攻击,这种攻击是通过对LLM生成具有毒性内容的精心策划的提示来实现的。其中一种破解攻击是将任务重新建模为对抗性攻击,通过让LLM生成积极响应。然而,这种攻击类型的典型攻击成功率非常有限。 在本研究中,为了更好地研究破解攻击,我们引入了DSN(不要说“不”)攻击,该攻击要求LLM不仅生成积极响应,而且还通过增强目标来抑制拒绝。此外,另一个挑战是破解攻击的评估,因为很难直接且准确地评估攻击的危害。现有的评估方法,如拒绝关键词匹配,本身也有其局限性,因为它揭示了大量的误判和误判实例。为了克服这个挑战,我们提出了一个包含自然语言推理(NLI)矛盾评估和两个外部LLM评估器的元学习评估管道。大量的实验证明,DSN和元学习的组合比基线方法更具有威力。
https://arxiv.org/abs/2404.16369
Linking a claim to grounded references is a critical ability to fulfill human demands for authentic and reliable information. Current studies are limited to specific tasks like information retrieval or semantic matching, where the claim-reference relationships are unique and fixed, while the referential knowledge linking (RKL) in real-world can be much more diverse and complex. In this paper, we propose universal referential knowledge linking (URL), which aims to resolve diversified referential knowledge linking tasks by one unified model. To this end, we propose a LLM-driven task-instructed representation compression, as well as a multi-view learning approach, in order to effectively adapt the instruction following and semantic understanding abilities of LLMs to referential knowledge linking. Furthermore, we also construct a new benchmark to evaluate ability of models on referential knowledge linking tasks across different scenarios. Experiments demonstrate that universal RKL is challenging for existing approaches, while the proposed framework can effectively resolve the task across various scenarios, and therefore outperforms previous approaches by a large margin.
将一个论点与实际引用联系起来是满足人们对真实和可靠信息的需求的关键能力。目前的研究仅限于信息检索或语义匹配等特定任务,其中论点与引用关系是独特的和固定的,而现实世界中的引用知识链接(RKL)可以非常多样化和复杂。在本文中,我们提出了通用引用知识链接(URL),旨在通过一个统一的模型解决多样化的引用知识链接任务。为此,我们提出了一个LLM驱动的任务指令表示压缩以及多视角学习方法,以有效适应LLM的指令跟随和语义理解能力。此外,我们还构建了一个新的基准来评估模型在引用知识链接任务中的能力,以评估模型在不同场景下的引用知识链接能力。实验证明,现有方法的通用RKL非常具有挑战性,而所提出的框架可以有效地解决各种场景下的任务,因此极大地超越了以前的方法。
https://arxiv.org/abs/2404.16248
AutoGluon-Multimodal (AutoMM) is introduced as an open-source AutoML library designed specifically for multimodal learning. Distinguished by its exceptional ease of use, AutoMM enables fine-tuning of foundational models with just three lines of code. Supporting various modalities including image, text, and tabular data, both independently and in combination, the library offers a comprehensive suite of functionalities spanning classification, regression, object detection, semantic matching, and image segmentation. Experiments across diverse datasets and tasks showcases AutoMM's superior performance in basic classification and regression tasks compared to existing AutoML tools, while also demonstrating competitive results in advanced tasks, aligning with specialized toolboxes designed for such purposes.
AutoGluon-Multimodal(AutoMM)是一个专为多模态学习而设计的开源 AutoML 库。它以其出色的易用性而著称,使用户只需三行代码即可微调基本模型。支持各种模态,包括图像、文本和表格数据,独立或组合,库提供了全面的函数集,包括分类、回归、目标检测、语义匹配和图像分割。通过在各种数据集和任务上的实验,展示了 AutoMM 在基本分类和回归任务上优越的性能,同时在高级任务上也有竞争力的表现,与专门为此目的设计的工具箱相吻合。
https://arxiv.org/abs/2404.16233
Quality-Diversity (QD) approaches are a promising direction to develop open-ended processes as they can discover archives of high-quality solutions across diverse niches. While already successful in many applications, QD approaches usually rely on combining only one or two solutions to generate new candidate solutions. As observed in open-ended processes such as technological evolution, wisely combining large diversity of these solutions could lead to more innovative solutions and potentially boost the productivity of QD search. In this work, we propose to exploit the pattern-matching capabilities of generative models to enable such efficient solution combinations. We introduce In-context QD, a framework of techniques that aim to elicit the in-context capabilities of pre-trained Large Language Models (LLMs) to generate interesting solutions using the QD archive as context. Applied to a series of common QD domains, In-context QD displays promising results compared to both QD baselines and similar strategies developed for single-objective optimization. Additionally, this result holds across multiple values of parameter sizes and archive population sizes, as well as across domains with distinct characteristics from BBO functions to policy search. Finally, we perform an extensive ablation that highlights the key prompt design considerations that encourage the generation of promising solutions for QD.
质量多样性(QD)方法是开发开放性过程的有前途的方向,因为它们可以揭示不同领域的优质解决方案。虽然已经在许多应用中取得成功,但QD方法通常仅结合一个或两个解决方案来生成新的候选解决方案。如开放性过程(如技术进步)中所观察到的,明智地结合这些解决方案的大多样性可能会导致更具创新性的解决方案,并可能提高QD搜索的生产力。在这项工作中,我们提出了利用生成模型的模式匹配能力来实现这种有效的解决方案组合。我们引入了In-Context QD,一个旨在通过将预训练的大语言模型(LLMs)在QD档案中生成有趣解决方案的技术框架。将In-Context QD应用于一系列常见的QD领域,与QD基线和为单目标优化 similar strategies 相比,显示出积极的结果。此外,这一结果在参数大小和档案人口大小不同的领域以及具有不同特性的领域中均成立。最后,我们进行了一项广泛的消融,重点突出了鼓励QD生成有前途的解决方案的关键提示设计考虑。
https://arxiv.org/abs/2404.15794
Multimodal search has become increasingly important in providing users with a natural and effective way to ex-press their search intentions. Images offer fine-grained details of the desired products, while text allows for easily incorporating search modifications. However, some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries, which may contain ambiguous, implicit, and irrelevant in-formation. Addressing these issues may require systems with enhanced matching capabilities, reasoning abilities, and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally, we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model, it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.
多模态搜索已经成为为用户提供自然且有效表达其搜索意图的方法变得越来越重要。图像提供了所需产品的精细细节,而文本允许轻松地包括搜索修改。然而,一些现有的多模态搜索系统不可靠,未能解决简单的查询。随着自然语言文本查询的大幅波动,可能包含模糊、隐含和无关信息,这个问题变得更加严重。解决这些问题可能需要具有增强的匹配能力、推理能力和上下文感知查询解析和重写能力的系统。本文介绍了一种新颖的多模态搜索模型,在Fashion200K数据集上实现了新的性能里程碑。此外,我们提出了一种新颖的搜索接口,整合了大型语言模型(LLMs),以促进自然语言交互。此接口在与用户进行交互并考虑之前搜索的同时将查询路由到搜索系统。与我们的多模态搜索模型结合,它预示着能提供类似人类交互的购物助手的新时代的来临。
https://arxiv.org/abs/2404.15790
Bayesian flow networks (BFNs) iteratively refine the parameters, instead of the samples in diffusion models (DMs), of distributions at various noise levels through Bayesian inference. Owing to its differentiable nature, BFNs are promising in modeling both continuous and discrete data, while simultaneously maintaining fast sampling capabilities. This paper aims to understand and enhance BFNs by connecting them with DMs through stochastic differential equations (SDEs). We identify the linear SDEs corresponding to the noise-addition processes in BFNs, demonstrate that BFN's regression losses are aligned with denoise score matching, and validate the sampler in BFN as a first-order solver for the respective reverse-time SDE. Based on these findings and existing recipes of fast sampling in DMs, we propose specialized solvers for BFNs that markedly surpass the original BFN sampler in terms of sample quality with a limited number of function evaluations (e.g., 10) on both image and text datasets. Notably, our best sampler achieves an increase in speed of 5~20 times for free. Our code is available at this https URL.
Bayesian流网络(BFNs)通过贝叶斯推理在扩散模型的(DMs)分布上逐步优化参数,而不是通过贝叶斯推理来优化扩散模型的样本。 由于其不同的可导性,BFN在建模连续和离散数据的同时保持快速抽样的能力。本文旨在通过随机微分方程(SDEs)将BFN与DMs连接起来,以理解并提高BFN。我们识别出BFN中噪声添加过程对应的线性SDE,证明BFN的回归损失与去噪得分匹配,并验证BFN的抽样器作为相应反向时间SDE的初值解。基于这些发现和现有的快速抽样在DM的食谱,我们提出了专用的BFN抽样器,在有限的函数评估(例如10)下,显著优于原始BFN抽样器,在图像和文本数据集上提高样本质量。值得注意的是,我们最好的抽样器在免费的情况下可以实现5~20倍的增长速度。我们的代码可在此处访问:https://url.com/
https://arxiv.org/abs/2404.15766
Clinical trial matching is the task of identifying trials for which patients may be potentially eligible. Typically, this task is labor-intensive and requires detailed verification of patient electronic health records (EHRs) against the stringent inclusion and exclusion criteria of clinical trials. This process is manual, time-intensive, and challenging to scale up, resulting in many patients missing out on potential therapeutic options. Recent advancements in Large Language Models (LLMs) have made automating patient-trial matching possible, as shown in multiple concurrent research studies. However, the current approaches are confined to constrained, often synthetic datasets that do not adequately mirror the complexities encountered in real-world medical data. In this study, we present the first, end-to-end large-scale empirical evaluation of clinical trial matching using real-world EHRs. Our study showcases the capability of LLMs to accurately match patients with appropriate clinical trials. We perform experiments with proprietary LLMs, including GPT-4 and GPT-3.5, as well as our custom fine-tuned model called OncoLLM and show that OncoLLM, despite its significantly smaller size, not only outperforms GPT-3.5 but also matches the performance of qualified medical doctors. All experiments were carried out on real-world EHRs that include clinical notes and available clinical trials from a single cancer center in the United States.
临床试验匹配的任务是确定可能符合条件的患者。通常,这项任务费力且需要对患者的电子病历(EHR)与临床试验的严格纳入和排除标准进行详细验证。这个过程是手动、时间密集且难以扩展的,导致许多患者错过了潜在的治疗选择。近年来,大型语言模型(LLMs)的进步使得自动化患者-试验匹配成为可能,正如多个同时研究论文所展示的那样。然而,现有方法仅限于受限的、通常是由合成数据集,这些数据集并不能充分反映现实医学数据的复杂性。在本研究中,我们首次完成了针对现实世界EHR的大型规模实证评估,对临床试验匹配。我们使用专有的LLM进行了实验,包括GPT-4和GPT-3.5,以及我们自定义的微调模型OncoLLM,并证明了OncoLLM在显著较小的规模下不仅超过了GPT-3.5,而且其性能甚至超过了合格的医生。所有实验都是在包括美国单个癌症中心在内的现实世界EHR上进行的。
https://arxiv.org/abs/2404.15549
Replicating the remarkable athleticism seen in animals has long been a challenge in robotics control. Although Reinforcement Learning (RL) has demonstrated significant progress in dynamic legged locomotion control, the substantial sim-to-real gap often hinders the real-world demonstration of truly dynamic movements. We propose a new framework to mitigate this gap through frequency-domain analysis-based impedance matching between simulated and real robots. Our framework offers a structured guideline for parameter selection and the range for dynamics randomization in simulation, thus facilitating a safe sim-to-real transfer. The learned policy using our framework enabled jumps across distances of 55 cm and heights of 38 cm. The results are, to the best of our knowledge, one of the highest and longest running jumps demonstrated by an RL-based control policy in a real quadruped robot. Note that the achieved jumping height is approximately 85% of that obtained from a state-of-the-art trajectory optimization method, which can be seen as the physical limit for the given robot hardware. In addition, our control policy accomplished stable walking at speeds up to 2 m/s in the forward and backward directions, and 1 m/s in the sideway direction.
复制动物在运动中的惊人 athletic 性一直是一个挑战,尤其是在机器人控制领域。虽然强化学习 (RL) 在动态腿履带运动控制方面取得了显著的进步,但巨大的模拟与现实之间的差距通常会阻碍在现实世界中真正动态运动的演示。我们提出了一种新的框架,通过基于频域分析的模拟与现实机器人之间的阻尼匹配来缓解这个差距。我们的框架为参数选择和动态随机化在模拟中的范围提供了结构化的指导,从而促进了安全的模拟到实体的转移。使用我们框架学习到的策略,跳跃距离达到了55厘米,高度达到了38厘米。据我们所知,这是基于 RL 的控制策略在实心四足机器人中实现的最高和最长的跳跃。需要注意的是,所达到的跳跃高度大约是先进轨迹优化方法得到的结果的85%,可以看出这是给定机器人硬件的物理极限。此外,我们的控制策略在前进和后退方向上实现了稳定的步行,速度达到2米/秒,而在侧面方向上实现了1米/秒的步行。
https://arxiv.org/abs/2404.15096
The PLAID (Performance-optimized Late Interaction Driver) algorithm for ColBERTv2 uses clustered term representations to retrieve and progressively prune documents for final (exact) document scoring. In this paper, we reproduce and fill in missing gaps from the original work. By studying the parameters PLAID introduces, we find that its Pareto frontier is formed of a careful balance among its three parameters; deviations beyond the suggested settings can substantially increase latency without necessarily improving its effectiveness. We then compare PLAID with an important baseline missing from the paper: re-ranking a lexical system. We find that applying ColBERTv2 as a re-ranker atop an initial pool of BM25 results provides better efficiency-effectiveness trade-offs in low-latency settings. However, re-ranking cannot reach peak effectiveness at higher latency settings due to limitations in recall of lexical matching and provides a poor approximation of an exhaustive ColBERTv2 search. We find that recently proposed modifications to re-ranking that pull in the neighbors of top-scoring documents overcome this limitation, providing a Pareto frontier across all operational points for ColBERTv2 when evaluated using a well-annotated dataset. Curious about why re-ranking methods are highly competitive with PLAID, we analyze the token representation clusters PLAID uses for retrieval and find that most clusters are predominantly aligned with a single token and vice versa. Given the competitive trade-offs that re-ranking baselines exhibit, this work highlights the importance of carefully selecting pertinent baselines when evaluating the efficiency of retrieval engines.
PLAID(高性能晚期交互驱动器)算法用于 ColBERTv2 时,它使用聚类词表示来检索并逐步修剪文本来实现最终(精确)文档评分。在本文中,我们复制并填补了原始工作中的缺失部分。通过研究 PLAID 引入的参数,我们发现其 Pareto 前沿是由其三个参数之间的谨慎平衡组成的;超出建议设置的偏差可能会显著增加延迟,而不仅仅是提高其有效性。然后,我们将 PLAID 与原始论文中重要的基线进行比较:对词汇系统进行重新排名。我们发现,将 ColBERTv2 作为初始池的 BM25 结果上的重新排名提供了更好的效率-效果权衡。然而,由于词汇匹配的回忆限制,在较高延迟设置上无法达到峰值效果,并且对完整的 ColBERTv2 搜索的近似度很低。我们发现,最近提出的重新排名修改方法,如吸引邻居最高评分文档的邻居,克服了这一限制,为使用良好注释的数据集评估 PLAID 时提供了 Pareto 前沿。关于为什么重新排名方法与 PLAID 具有高度竞争性,我们分析了 PLAID 使用时的词表示聚类,并发现大多数聚类都是高度相关的单一词,反之亦然。鉴于重新排名基线的竞争性,这项工作突出了在评估检索引擎的效率时谨慎选择相关基线的重要性。
https://arxiv.org/abs/2404.14989
Creating artistic 3D scenes can be time-consuming and requires specialized knowledge. To address this, recent works such as ARF, use a radiance field-based approach with style constraints to generate 3D scenes that resemble a style image provided by the user. However, these methods lack fine-grained control over the resulting scenes. In this paper, we introduce Controllable Artistic Radiance Fields (CoARF), a novel algorithm for controllable 3D scene stylization. CoARF enables style transfer for specified objects, compositional 3D style transfer and semantic-aware style transfer. We achieve controllability using segmentation masks with different label-dependent loss functions. We also propose a semantic-aware nearest neighbor matching algorithm to improve the style transfer quality. Our extensive experiments demonstrate that CoARF provides user-specified controllability of style transfer and superior style transfer quality with more precise feature matching.
创建艺术化的3D场景可能需要花费时间,并且需要专业知识。为了解决这个问题,最近的工作如ARF,采用基于辐射场的方法,带有风格约束,从用户提供的风格图像中生成类似于用户风格的3D场景。然而,这些方法缺乏对生成场景的细粒度控制。在本文中,我们介绍了可控制的艺术化辐射场(CoARF),一种用于可控制3D场景的风格化的新算法。CoARF允许指定对象的样式转移、合成3D样式转移和语义感知样式转移。我们通过具有不同标签相关损失函数的分割掩码实现可控性。我们还提出了一个语义感知最近邻匹配算法,以提高样式转移质量。我们广泛的实验证明,CoARF提供了用户指定风格转移的可控性和卓越的样式转移质量,具有更精确的特征匹配。
https://arxiv.org/abs/2404.14967
Information Retrieval (IR) systems are crucial tools for users to access information, widely applied in scenarios like search engines, question answering, and recommendation systems. Traditional IR methods, based on similarity matching to return ranked lists of documents, have been reliable means of information acquisition, dominating the IR field for years. With the advancement of pre-trained language models, generative information retrieval (GenIR) has emerged as a novel paradigm, gaining increasing attention in recent years. Currently, research in GenIR can be categorized into two aspects: generative document retrieval (GR) and reliable response generation. GR leverages the generative model's parameters for memorizing documents, enabling retrieval by directly generating relevant document identifiers without explicit indexing. Reliable response generation, on the other hand, employs language models to directly generate the information users seek, breaking the limitations of traditional IR in terms of document granularity and relevance matching, offering more flexibility, efficiency, and creativity, thus better meeting practical needs. This paper aims to systematically review the latest research progress in GenIR. We will summarize the advancements in GR regarding model training, document identifier, incremental learning, downstream tasks adaptation, multi-modal GR and generative recommendation, as well as progress in reliable response generation in aspects of internal knowledge memorization, external knowledge augmentation, generating response with citations and personal information assistant. We also review the evaluation, challenges and future prospects in GenIR systems. This review aims to offer a comprehensive reference for researchers in the GenIR field, encouraging further development in this area.
信息检索(IR)系统对于用户访问信息至关重要,在搜索引擎、问题回答和推荐系统等场景中得到了广泛应用。传统的IR方法,基于相似度匹配返回排名文档的排名列表,是不可靠的信息获取手段,多年来一直是IR领域的主导。随着预训练语言模型的进步,生成式信息检索(GenIR)作为一种新颖的范式应运而生,并在近年来受到了越来越多的关注。目前,GenIR的研究可以分为两个方面:生成式文档检索(GR)和可靠响应生成。GR利用生成模型的参数进行记忆文档,直接生成相关文档标识,无需显式索引。另一方面,可靠响应生成采用语言模型直接生成用户所需的信息,打破了传统IR在文档粒度和相关性匹配方面的限制,提供了更多的灵活性、效率和创新,从而更好地满足实际需求。本文旨在系统地回顾GenIR的最新研究进展。我们将总结GR关于模型训练、文档标识、增量学习、下游任务适应、多模态GR和生成式推荐以及可靠响应生成的最新进展,以及GR在内部知识记忆、外部知识增强和生成带有引用和个人信息助手回应方面的进展。我们还回顾了GenIR系统的评估、挑战和未来前景。本 review旨在为GenIR领域的研究人员提供全面的参考,鼓励该领域进一步发展。
https://arxiv.org/abs/2404.14851
Recently, integrating visual controls into text-to-image~(T2I) models, such as ControlNet method, has received significant attention for finer control capabilities. While various training-free methods make efforts to enhance prompt following in T2I models, the issue with visual control is still rarely studied, especially in the scenario that visual controls are misaligned with text prompts. In this paper, we address the challenge of ``Prompt Following With Visual Control" and propose a training-free approach named Mask-guided Prompt Following (MGPF). Object masks are introduced to distinct aligned and misaligned parts of visual controls and prompts. Meanwhile, a network, dubbed as Masked ControlNet, is designed to utilize these object masks for object generation in the misaligned visual control region. Further, to improve attribute matching, a simple yet efficient loss is designed to align the attention maps of attributes with object regions constrained by ControlNet and object masks. The efficacy and superiority of MGPF are validated through comprehensive quantitative and qualitative experiments.
近年来,将视觉控制集成到文本到图像(T2I)模型中,如ControlNet方法,因更精确的控制能力而受到广泛关注。虽然各种训练免费的方法努力增强T2I模型的提示跟随,但视觉控制的一个问题仍然很少被研究,尤其是在视觉控制与文本提示对齐的情况下。在本文中,我们解决了“与视觉控制一起跟踪提示”的挑战,并提出了一个无需训练的名为Mask-guided Prompt Following(MGPF)的训练免费方法。我们引入了物体掩码来对对齐和错位的视觉控制和提示的不同部分。同时,设计了一个名为Masked ControlNet的网络,用于在错位的视觉控制区域中利用这些物体掩码进行物体生成。此外,为了提高属性匹配,设计了一个简单而有效的损失,将属性的注意力图与受控网络和物体掩码约束的对象区域对齐。通过全面的定量实验和定性实验,验证了MGPF的有效性和优越性。
https://arxiv.org/abs/2404.14768
Recent progress in large-scale pre-training has led to the development of advanced vision-language models (VLMs) with remarkable proficiency in comprehending and generating multimodal content. Despite the impressive ability to perform complex reasoning for VLMs, current models often struggle to effectively and precisely capture the compositional information on both the image and text sides. To address this, we propose FineMatch, a new aspect-based fine-grained text and image matching benchmark, focusing on text and image mismatch detection and correction. This benchmark introduces a novel task for boosting and evaluating the VLMs' compositionality for aspect-based fine-grained text and image matching. In this task, models are required to identify mismatched aspect phrases within a caption, determine the aspect's class, and propose corrections for an image-text pair that may contain between 0 and 3 mismatches. To evaluate the models' performance on this new task, we propose a new evaluation metric named ITM-IoU for which our experiments show a high correlation to human evaluation. In addition, we also provide a comprehensive experimental analysis of existing mainstream VLMs, including fully supervised learning and in-context learning settings. We have found that models trained on FineMatch demonstrate enhanced proficiency in detecting fine-grained text and image mismatches. Moreover, models (e.g., GPT-4V, Gemini Pro Vision) with strong abilities to perform multimodal in-context learning are not as skilled at fine-grained compositional image and text matching analysis. With FineMatch, we are able to build a system for text-to-image generation hallucination detection and correction.
近年来在大型预训练方面的进步导致开发了在多模态内容理解和解剖方面表现出众的先进视觉语言模型(VLMs)。尽管VLMs在复杂推理方面表现出色,但目前的模型通常很难有效地和精确地捕捉图像和文本两侧的组合信息。为了解决这个问题,我们提出了FineMatch,一个新的基于 aspects 的细粒度文本和图像匹配基准,重点关注文本和图像不匹配检测和纠正。这个基准为基于 aspects 的细粒度文本和图像匹配的 VLMs 的组合性评估引入了一个新的任务。在这个任务中,模型需要找出文本中的不匹配 aspects,确定 aspect 的类别,并针对可能包含 0 到 3 不匹配的图像-文本对提出修正。为了评估模型在新任务上的表现,我们提出了一个名为 ITM-IoU 的新评估指标,我们的实验结果表明它与人类评价高度相关。此外,我们还对现有的主流 VLMs 进行了全面的实验分析,包括完全监督学习和上下文学习场景。我们发现,在 FineMatch 上训练的模型在检测细粒度文本和图像不匹配方面表现更出色。此外,具有良好多模态上下文学习能力的模型(如 GPT-4V,Gemini Pro Vision)在细粒度组合图像和文本匹配分析方面并不熟练。通过 FineMatch,我们能够构建一个系统,用于检测文本到图像生成的幻觉,并进行修正。
https://arxiv.org/abs/2404.14715
In this paper, we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences, NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning, where narrative language and evaluative information are predicted separately. However, this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this, we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task, thus enabling task interactivity. To support further research in this field, we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally, we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code are released at \href{this https URL }{here}.
在本文中,我们研究了一个名为叙述动作评估(NAE)的新问题。NAE的目标是生成专业的评论来评估一个行动的执行。与传统的评分基于动作质量评估和涉及浅层句子的视频标题等任务不同,NAE专注于在自然语言中创建详细的叙述。这些叙述提供了动作的详细描述以及客观评价。因为需要叙述的灵活性和评估的严谨性,NAE是一个更具挑战性的任务。一个现有的可能解决方案是使用多任务学习,其中叙述语言和评估信息分别预测。然而,由于任务之间存在差异和语言信息与评估信息之间的差异,这种方法在个人任务上产生了较低的性能。为了解决这个问题,我们提出了一个提示引导的多模态交互框架。这个框架使用了一对变压器来促进不同信息模态之间的交互。它还使用提示将评分回归任务转化为视频文本匹配任务,从而实现任务交互。为了支持在这个领域进一步的研究,我们用高质量、全面的动作叙述重新标注了MTL-AQA和FineGym数据集。此外,我们还为NAE建立了基准。大量实验结果证明,我们的方法超越了单独学习和 naive multi-task learning 方法。数据和代码发布在 \href{this <https://this URL> }{这里}。
https://arxiv.org/abs/2404.14471
We address the task of estimating camera parameters from a set of images depicting a scene. Popular feature-based structure-from-motion (SfM) tools solve this task by incremental reconstruction: they repeat triangulation of sparse 3D points and registration of more camera views to the sparse point cloud. We re-interpret incremental structure-from-motion as an iterated application and refinement of a visual relocalizer, that is, of a method that registers new views to the current state of the reconstruction. This perspective allows us to investigate alternative visual relocalizers that are not rooted in local feature matching. We show that scene coordinate regression, a learning-based relocalization approach, allows us to build implicit, neural scene representations from unposed images. Different from other learning-based reconstruction methods, we do not require pose priors nor sequential inputs, and we optimize efficiently over thousands of images. Our method, ACE0 (ACE Zero), estimates camera poses to an accuracy comparable to feature-based SfM, as demonstrated by novel view synthesis. Project page: this https URL
我们从一系列图像描述的场景中估计相机的参数。流行的基于特征的结构从运动(SfM)工具通过迭代重构稀疏的3D点并进行相机视图的注册来解决这个任务。我们将递归结构从运动重新解释为迭代应用和优化,即从当前重建状态的视觉重定位器中注册新视图。这种视角使我们能够研究不基于局部特征匹配的 alternative visual relocalizers。我们证明了场景坐标回归,一种基于学习的重定位方法,可以从未姿态的图像中构建隐含的神经场景表示。与其它学习 based 的重构方法不同,我们不需要姿态先验或者顺序输入,而且我们在成千上万的图像上优化效率。我们的方法 ACE0(ACE Zero)估计相机的参数,其精度与基于特征的 SfM 相当,如图所示,通过新颖的视图合成证明了这一点。页面链接:这个 <https:// this URL>
https://arxiv.org/abs/2404.14351
Heterogeneous Face Recognition (HFR) aims to expand the applicability of Face Recognition (FR) systems to challenging scenarios, enabling the matching of face images across different domains, such as matching thermal images to visible spectra. However, the development of HFR systems is challenging because of the significant domain gap between modalities and the lack of availability of large-scale paired multi-channel data. In this work, we leverage a pretrained face recognition model as a teacher network to learn domaininvariant network layers called Domain-Invariant Units (DIU) to reduce the domain gap. The proposed DIU can be trained effectively even with a limited amount of paired training data, in a contrastive distillation framework. This proposed approach has the potential to enhance pretrained models, making them more adaptable to a wider range of variations in data. We extensively evaluate our approach on multiple challenging benchmarks, demonstrating superior performance compared to state-of-the-art methods.
异质面部识别(HFR)旨在将面部识别(FR)系统的应用扩展到具有挑战性的场景中,实现不同领域间面部图像的匹配,例如将热成像与可见光谱进行匹配。然而,由于模态之间的显著差异和大规模多通道数据缺乏,HFR系统的发展具有挑战性。在这项工作中,我们利用预训练的人脸识别模型作为教师网络,学习领域无关网络层,称为领域无关单元(DIU),以减少领域差距。所提出的DIU可以在训练过程中有效处理有限量的成对训练数据,并通过对比性蒸馏框架实现有效训练。这种方法具有提高预训练模型的潜力,使它们对数据中的更广泛的变异性具有更强的适应性。我们对我们的方法在多个具有挑战性的基准进行了广泛评估,证明了其在最先进的methods之上的卓越性能。
https://arxiv.org/abs/2404.14343
Heterogeneous Face Recognition (HFR) focuses on matching faces from different domains, for instance, thermal to visible images, making Face Recognition (FR) systems more versatile for challenging scenarios. However, the domain gap between these domains and the limited large-scale datasets in the target HFR modalities make it challenging to develop robust HFR models from scratch. In our work, we view different modalities as distinct styles and propose a method to modulate feature maps of the target modality to address the domain gap. We present a new Conditional Adaptive Instance Modulation (CAIM ) module that seamlessly fits into existing FR networks, turning them into HFR-ready systems. The CAIM block modulates intermediate feature maps, efficiently adapting to the style of the source modality and bridging the domain gap. Our method enables end-to-end training using a small set of paired samples. We extensively evaluate the proposed approach on various challenging HFR benchmarks, showing that it outperforms state-of-the-art methods. The source code and protocols for reproducing the findings will be made publicly available
异质面部识别(HFR)关注于不同领域(例如,热图像到可见图像)的匹配,使面部识别(FR)系统在具有挑战性的场景更具多样性。然而,这些领域与目标HFR模态之间存在的领域差距以及目标HFR模态中有限的大规模数据集使从零开始开发鲁棒HFR模型具有挑战性。在我们的工作中,我们将不同模块视为不同的样式,并提出了一个方法来调节目标模态的特征图以解决领域差距。我们提出了一个名为条件自适应实例调制(CAIM)的模块,它与现有的FR网络无缝集成,使它们成为HFR ready的系统。CAIM模块调节中间特征图,有效地适应源模态的风格,并弥合领域差距。我们的方法使用一对配对样本进行端到端训练。我们在各种具有挑战性的HFR基准中广泛评估所提出的方案,结果表明,它超过了最先进的方法。复制这些发现的源代码和协议将公开发布。
https://arxiv.org/abs/2404.14247
The user base of short video apps has experienced unprecedented growth in recent years, resulting in a significant demand for video content analysis. In particular, text-video retrieval, which aims to find the top matching videos given text descriptions from a vast video corpus, is an essential function, the primary challenge of which is to bridge the modality gap. Nevertheless, most existing approaches treat texts merely as discrete tokens and neglect their syntax structures. Moreover, the abundant spatial and temporal clues in videos are often underutilized due to the lack of interaction with text. To address these issues, we argue that using texts as guidance to focus on relevant temporal frames and spatial regions within videos is beneficial. In this paper, we propose a novel Syntax-Hierarchy-Enhanced text-video retrieval method (SHE-Net) that exploits the inherent semantic and syntax hierarchy of texts to bridge the modality gap from two perspectives. First, to facilitate a more fine-grained integration of visual content, we employ the text syntax hierarchy, which reveals the grammatical structure of text descriptions, to guide the visual representations. Second, to further enhance the multi-modal interaction and alignment, we also utilize the syntax hierarchy to guide the similarity calculation. We evaluated our method on four public text-video retrieval datasets of MSR-VTT, MSVD, DiDeMo, and ActivityNet. The experimental results and ablation studies confirm the advantages of our proposed method.
近年来,短视频应用程序的用户基础经历了空前的增长,导致对视频内容分析的需求显著增加。特别是,文本-视频检索,旨在从庞大的视频语料库中找到与给定文本描述的匹配视频,是至关重要的功能,其挑战在于弥合模态差距。然而,大多数现有方法仅仅将文本视为离散的标记,而忽略了它们的语法结构。此外,视频中丰富的空间和时间线索往往没有被充分利用,因为缺乏与文本的交互。为了应对这些问题,我们认为将文本作为指导,集中关注视频中的相关时态帧和空间区域,从两个角度弥合模态差距是有益的。在本文中,我们提出了一种新颖的语法层次结构增强文本-视频检索方法(SHE-Net),它利用了文本固有的语义和语法层次结构,从两个角度弥合模态差距。首先,为了促进更细粒度的视觉内容整合,我们采用文本语法层次结构,该结构揭示了文本描述的语法结构,指导视觉表示。其次,为了进一步增强多模态交互和对齐,我们还利用语法层次结构指导相似度计算。我们对MSR-VTT、MSVD、DiDeMo和ActivityNet等四个公共文本-视频检索数据集进行了实验评估。实验结果和消融实验证实了我们提出的方法的优势。
https://arxiv.org/abs/2404.14066