There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains ~50% on our text and video scores, showing a large gap compared to the human baseline of ~90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. The dataset and evaluation code are available at this https URL.
最近,人们普遍认为现代大型多模态模型(LMMs)已经解决了与短视频理解相关的大部分关键挑战。因此,学术界和工业界逐渐将注意力转向理解长视频所提出的更复杂挑战。然而,这是真的吗?我们的研究结果表明,即使处理短视频,LMMs仍然缺乏许多基本推理能力。我们引入了Vinoground,一个包含1000个短和自然视频对的时间反事实LMM评估基准。我们证明了现有的LMMs在区分不同动作和物体变换的时间差异方面严重挣扎。例如,最佳模型GPT-4o在我们的文本和视频评分上的得分仅为~50%,与人类基线(~90%)相比存在很大的差距。所有开源的多模态模型和CLIP基于模型表现得更糟,产生主要是随机猜测的性能。通过这项工作,我们阐明了一个重要的问题,即短视频中的时间推理是一个尚未完全解决的问题。数据集和评估代码可在此链接查看:https://github.com/jhlau/Vinoground
https://arxiv.org/abs/2410.02763
Information Retrieval (IR) methods aim to identify relevant documents in response to a given query, which have gained remarkable attention due to their successful application in various natural language tasks. However, existing approaches typically consider only the textual information within the documents, which overlooks the fact that documents can contain multiple modalities, including texts, images, and tables. Further, they often segment each long document into multiple discrete passages for embedding, preventing them from capturing the overall document context and interactions between paragraphs. We argue that these two limitations lead to suboptimal document representations for retrieval. In this work, to address them, we aim to produce more comprehensive and nuanced document representations by holistically embedding documents interleaved with different modalities. Specifically, we achieve this by leveraging the capability of recent vision-language models that enable the processing and integration of text, images, and tables into a unified format and representation. Moreover, to mitigate the information loss from segmenting documents into passages, instead of representing and retrieving passages individually, we further merge the representations of segmented passages into one single document representation, while we additionally introduce a reranking strategy to decouple and identify the relevant passage within the document if necessary. Then, through extensive experiments on diverse information retrieval scenarios considering both the textual and multimodal queries, we show that our approach substantially outperforms relevant baselines, thanks to the consideration of the multimodal information interleaved within the documents in a unified way.
信息检索(IR)方法旨在针对给定查询识别相关的文档,这是由于其在各种自然语言任务中取得成功而备受关注。然而,现有的方法通常仅考虑文档中的文本信息,而忽略了文档可以包含多种形式的信息,包括文本、图像和表格。此外,它们通常将每个长文档分割成多个离散的段落进行嵌入,阻止了它们捕捉到整个文档的上下文和段落之间的互动。我们认为,这两个限制导致了检索到的文档表示不是最优的。在本文中,为了应对这些限制,我们旨在通过将文档与不同形式的信息集成来生成更全面和细微的文档表示。具体来说,我们通过利用最近在视觉语言模型上取得的处理和整合文本、图像和表格统一格式和表示的能力来实现这一目标。此外,为了减轻将文档分割为段落所带来的信息损失,我们进一步将分割段落的表示合并为一个单独的文档表示,同时引入了重排策略来在必要时将相关段落的重排组合成一个单独的文档表示。然后,通过在各种信息检索场景中进行广泛的实验,包括文本和多模态查询,我们证明了我们的方法在很大程度上超过了相关基线,得益于在文档中考虑了多种形式的信息的统一处理。
https://arxiv.org/abs/2410.02729
Text-to-image (T2I) diffusion models have drawn attention for their ability to generate high-quality images with precise text alignment. However, these models can also be misused to produce inappropriate content. Existing safety measures, which typically rely on text classifiers or ControlNet-like approaches, are often insufficient. Traditional text classifiers rely on large-scale labeled datasets and can be easily bypassed by rephrasing. As diffusion models continue to scale, fine-tuning these safeguards becomes increasingly challenging and lacks flexibility. Recent red-teaming attack researches further underscore the need for a new paradigm to prevent the generation of inappropriate content. In this paper, we introduce SteerDiff, a lightweight adaptor module designed to act as an intermediary between user input and the diffusion model, ensuring that generated images adhere to ethical and safety standards with little to no impact on usability. SteerDiff identifies and manipulates inappropriate concepts within the text embedding space to guide the model away from harmful outputs. We conduct extensive experiments across various concept unlearning tasks to evaluate the effectiveness of our approach. Furthermore, we benchmark SteerDiff against multiple red-teaming strategies to assess its robustness. Finally, we explore the potential of SteerDiff for concept forgetting tasks, demonstrating its versatility in text-conditioned image generation.
文本到图像(T2I)扩散模型因能够在精确的文本对齐下生成高质量图像而引起了关注。然而,这些模型也可能被用于制作不适当的内容。现有的安全措施,通常依赖于文本分类器或类似ControlNet的方法,往往是不够的。传统的文本分类器依赖于大规模标记数据集,并且很容易通过重新表述绕过。随着扩散模型不断扩展,对这些安全措施进行微调变得越来越具有挑战性,而且缺乏灵活性。最近的一些红队攻击研究进一步强调了需要一种新的范式来防止生成不适当内容的重要性。在本文中,我们引入了SteerDiff,一个轻量级的适配器模块,旨在充当用户输入和扩散模型之间的中介,确保生成的图像符合道德和安全标准,对可用性影响较小。SteerDiff在文本嵌入空间中识别和操作不适当的概念,引导模型远离有害的输出。我们对各种概念消除任务进行了广泛的实验,以评估我们方法的有效性。此外,我们还对SteerDiff与多种红队策略进行了比较,以评估其稳健性。最后,我们探讨了SteerDiff在概念忘记任务中的潜力,展示了其在文本条件下图像生成的多样性和灵活性。
https://arxiv.org/abs/2410.02710
Unneeded elements in the attention's context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention improves language modeling performance in a variety of model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to standard transformers with ~2X more heads and parameters in their attention modules. Selective attention also allows decreasing the size of the attention's context buffer, leading to meaningful reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters trained on C4 with context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.
无需注意力的上下文中的多余元素会降低性能。我们引入了选择性注意力,这是一种简单的参数free修改标准注意力机制,将注意力减少到多余元素上。选择性注意力在各种模型大小和上下文长度的语言建模任务中提高了性能。例如,使用C4上训练的语言建模目标的各种变换器,在具有选择性注意力的变换器中,具有与具有大约2X个头和参数的标准变换器相等的性能。选择性注意力还允许在推理期间减小注意力的上下文缓冲区的大小,从而在内存和计算需求上实现有意义地减少。例如,具有100M参数并且在C4上使用上下文大小为512、1024和2048的各种变换器,在具有选择性注意力的情况下,其注意力模块需要16X、25X和47X少的内存, respectively,与没有选择性注意力的情况相同,具有相同的验证预测精度。
https://arxiv.org/abs/2410.02703
Immunogenicity prediction is a central topic in reverse vaccinology for finding candidate vaccines that can trigger protective immune responses. Existing approaches typically rely on highly compressed features and simple model architectures, leading to limited prediction accuracy and poor generalizability. To address these challenges, we introduce ProVaccine, a novel deep learning solution with a dual attention mechanism that integrates pre-trained latent vector representations of protein sequences and structures. We also compile the most comprehensive immunogenicity dataset to date, encompassing over 9,500 antigen sequences, structures, and immunogenicity labels from bacteria, viruses, and tumors. Extensive experiments demonstrate that ProVaccine outperforms existing methods across a wide range of evaluation metrics. Furthermore, we establish a post-hoc validation protocol to assess the practical significance of deep learning models in tackling vaccine design challenges. Our work provides an effective tool for vaccine design and sets valuable benchmarks for future research.
免疫原性预测是反向 vaccinology 中寻找可以引发保护性免疫应答的候选疫苗的一个核心话题。现有的方法通常依赖于高度压缩的特征和简单的模型架构,导致预测精度和泛化能力有限。为了应对这些挑战,我们引入了 ProVaccine,一种新颖的深度学习解决方案,具有双关注机制,整合了蛋白质序列和结构的预训练潜在向量表示。我们还汇总了有史以来最全面的免疫原性数据集,包括来自细菌、病毒和肿瘤的超过9,500个抗原序列、结构和免疫原性标签。广泛的实验证明,ProVaccine在各种评估指标上优于现有方法。此外,我们还建立了一种后验验证协议,以评估深度学习模型在解决疫苗设计挑战中的实际意义。我们的工作为疫苗设计提供了有效的工具,并为未来研究提供了宝贵的基准。
https://arxiv.org/abs/2410.02647
Information retrieval (IR) systems have played a vital role in modern digital life and have cemented their continued usefulness in this new era of generative AI via retrieval-augmented generation. With strong language processing capabilities and remarkable versatility, large language models (LLMs) have become popular choices for zero-shot re-ranking in IR systems. So far, LLM-based re-ranking methods rely on strong generative capabilities, which restricts their use to either specialized or powerful proprietary models. Given these restrictions, we ask: is autoregressive generation necessary and optimal for LLMs to perform re-ranking? We hypothesize that there are abundant signals relevant to re-ranking within LLMs that might not be used to their full potential via generation. To more directly leverage such signals, we propose in-context re-ranking (ICR), a novel method that leverages the change in attention pattern caused by the search query for accurate and efficient re-ranking. To mitigate the intrinsic biases in LLMs, we propose a calibration method using a content-free query. Due to the absence of generation, ICR only requires two ($O(1)$) forward passes to re-rank $N$ documents, making it substantially more efficient than generative re-ranking methods that require at least $O(N)$ forward passes. Our novel design also enables ICR to be applied to any LLM without specialized training while guaranteeing a well-formed ranking. Extensive experiments with two popular open-weight LLMs on standard single-hop and multi-hop information retrieval benchmarks show that ICR outperforms RankGPT while cutting the latency by more than 60% in practice. Through detailed analyses, we show that ICR's performance is specially strong on tasks that require more complex re-ranking signals. Our findings call for further exploration on novel ways of utilizing open-weight LLMs beyond text generation.
信息检索(IR)系统在现代数字生活发挥了重要作用,并巩固了在新一代生成人工智能时代继续发挥其有用性的地位。强大的语言处理能力和出色的可扩展性使得大型语言模型(LLMs)成为了IR系统中零散度排序的首选。到目前为止,基于LLM的排序方法依赖于强大的生成能力,这限制了它们的使用仅限于专业或强大的专用模型。鉴于这些限制,我们问:自回归生成是否对LLMs执行排序是必要且最优的?我们假设,LLMs中存在与排序相关的丰富信号,这些信号可能无法完全利用生成能力。为了更直接地利用这些信号,我们提出了上下文排序(ICR)方法,一种新方法,它利用搜索查询引起的关注模式变化来准确且有效地进行排序。为了减轻LLMs固有的偏见,我们提出了一个内容免费的查询的校准方法。由于没有生成能力,ICR仅需要两个前馈 pass 来重新排序 N 个文档,这使得它比需要至少 $O(N)$ 前馈 pass 的生成性排序方法更加高效。我们的新设计还使ICR能够应用于任何LLM,同时保证形成良好的排名。在标准单跳和多跳信息检索基准上,使用两个流行的开放权重LLM的实验表明,ICR在实际应用中优于 RankGPT,并将延迟降低约60%。通过详细的分析,我们表明ICR在需要更复杂排序信号的任务上表现特别强。我们的研究结果呼吁进一步探讨如何利用开放重量LLM的更多新颖方法。
https://arxiv.org/abs/2410.02642
Deformable image registration is crucial for aligning medical images in a non-linear fashion across different modalities, allowing for precise spatial correspondence between varying anatomical structures. This paper presents NestedMorph, a novel network utilizing a Nested Attention Fusion approach to improve intra-subject deformable registration between T1-weighted (T1w) MRI and diffusion MRI (dMRI) data. NestedMorph integrates high-resolution spatial details from an encoder with semantic information from a decoder using a multi-scale framework, enhancing both local and global feature extraction. Our model notably outperforms existing methods, including CNN-based approaches like VoxelMorph, MIDIR, and CycleMorph, as well as Transformer-based models such as TransMorph and ViT-V-Net, and traditional techniques like NiftyReg and SyN. Evaluations on the HCP dataset demonstrate that NestedMorph achieves superior performance across key metrics, including SSIM, HD95, and SDlogJ, with the highest SSIM of 0.89, and the lowest HD95 of 2.5 and SDlogJ of 0.22. These results highlight NestedMorph's ability to capture both local and global image features effectively, leading to superior registration performance. The promising outcomes of this study underscore NestedMorph's potential to significantly advance deformable medical image registration, providing a robust framework for future research and clinical applications. The source code and our implementation are available at: this https URL
非线性图像配准对于在不同模态之间对医疗图像进行对齐至关重要,实现不同解剖结构的准确空间对应。本文介绍了一种新网络NestedMorph,利用Nested Attention Fusion方法在T1加权(T1w)MRI和扩散加权MRI(dMRI)数据之间进行自适应对齐。NestedMorph通过多尺度框架将编码器中高分辨率的空间信息与解码器中语义信息相结合,增强局部和全局特征提取。我们的模型在包括CNN基于方法(如VoxelMorph、MIDIR和CycleMorph)以及Transformer基于方法(如TransMorph和ViT-V-Net)以及传统方法(如NiftyReg和SyN)的基础上,显著优于现有方法。在HCP数据集上的评估表明,NestedMorph在关键指标,包括SSIM、HD95和SDlogJ上取得了卓越的性能,SSIM为0.89,HD95为2.5,SDlogJ为0.22。这些结果强调了NestedMorph有效捕捉局部和全局图像特征的能力,从而实现卓越的配准性能。本研究的结果表明,NestedMorph有很大的潜力显著改进非线性图像配准,为未来的研究和临床应用提供了一个稳健的框架。源代码和我们的实现可以在以下这个链接中获得:https://this URL
https://arxiv.org/abs/2410.02550
Large Language Models (LLMs), known for their versatility in textual data, are increasingly being explored for their potential to enhance medical image segmentation, a crucial task for accurate diagnostic imaging. This study explores enhancing Vision Transformers (ViTs) for medical image segmentation by integrating pre-trained LLM transformer blocks. Our approach, which incorporates a frozen LLM transformer block into the encoder of a ViT-based model, leads to substantial improvements in segmentation performance across various medical imaging modalities. We propose a Hybrid Attention Mechanism that combines global and local feature learning with a Multi-Scale Fusion Block for aggregating features across different scales. The enhanced model shows significant performance gains, including an average Dice score increase from 0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index. These results demonstrate the effectiveness of LLM-based transformers in refining medical image segmentation, highlighting their potential to significantly boost model accuracy and robustness. The source code and our implementation are available at: this https URL
大语言模型(LLMs)因其在文本数据中的多才多艺而受到越来越多的关注,它们越来越被探索用于提高医学图像分割的潜力,这是准确诊断成像的关键任务。本研究探讨通过将预训练LLM变换器模块集成到基于ViT的模型的编码器中,提高医学图像分割性能的方法。我们的方法是将冻结的LLM变换器模块融入ViT模型的编码器中,从而在各种医学成像模式上显著提高分割性能。我们提出了一种混合注意机制,结合全局和局部特征学习以及多尺度融合块来聚合不同尺度上的特征。增强后的模型显示出显著的性能提升,包括平均Dice得分从0.74增加到0.79以及准确率、精确率和Jaccard指数的改善。这些结果证明了基于LLM的变换器在优化医学图像分割方面的有效性,突出了它们在提高模型准确性和鲁棒性方面的巨大潜力。源代码和我们的实现可以在以下链接中找到:https://this URL
https://arxiv.org/abs/2410.02458
With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios.
在具有广泛的国家行动空间的情况下,高效的 multi- 代理探索仍然是强化学习中的一个长期挑战。尽管追求新颖性、多样性和不确定性吸引了越来越多的关注,但盲目探索带来的冗余努力对社区来说是一个实际问题。本文介绍了一种系统方法,称为 LEMAE,选择从知识渊博的大型语言模型 (LLM) 中获取有关高效 multi- 代理探索的有用任务相关指导。具体来说,我们在低 LLM 推理成本的条件下,将语言知识从 LLM grounding into symbolic key states,这些状态对于任务完成至关重要。为了释放关键状态的力量,我们设计了基于子空间的 Hindsight Intrinsic Reward (SHIR),通过增加奖励密度来引导代理器到达关键状态。此外,我们还构建了用于特定任务的关键状态记忆树 (KSMT),以跟踪关键状态之间的转移。通过减少冗余探索,LEMAE 在具有挑战性的基准(如 SMAC 和 MPE)上超过了现有 SOTA 方法,实现了某些场景下的 10 倍加速。
https://arxiv.org/abs/2410.02511
Customized Image Generation, generating customized images with user-specified concepts, has raised significant attention due to its creativity and novelty. With impressive progress achieved in subject customization, some pioneer works further explored the customization of action and interaction beyond entity (i.e., human, animal, and object) appearance. However, these approaches only focus on basic actions and interactions between two entities, and their effects are limited by insufficient ''exactly same'' reference images. To extend customized image generation to more complex scenes for general real-world applications, we propose a new task: event-customized image generation. Given a single reference image, we define the ''event'' as all specific actions, poses, relations, or interactions between different entities in the scene. This task aims at accurately capturing the complex event and generating customized images with various target entities. To solve this task, we proposed a novel training-free event customization method: FreeEvent. Specifically, FreeEvent introduces two extra paths alongside the general diffusion denoising process: 1) Entity switching path: it applies cross-attention guidance and regulation for target entity generation. 2) Event transferring path: it injects the spatial feature and self-attention maps from the reference image to the target image for event generation. To further facilitate this new task, we collected two evaluation benchmarks: SWiG-Event and Real-Event. Extensive experiments and ablations have demonstrated the effectiveness of FreeEvent.
定制图像生成,使用用户指定概念生成定制图像,因其创造力和新颖性而引起了广泛关注。在主题自定义方面取得了令人印象深刻的进展后,一些先驱作品进一步探索了定制动作和交互的范围,超出了实体(即人类、动物和物体)外观。然而,这些方法仅关注基本动作和两个实体之间的交互,其效果受到不足的“完全相同”参考图像的限制。为了将定制图像生成扩展到更复杂的场景,我们提出了一个新的任务:事件定制图像生成。给定一个单一的参考图像,我们定义“事件”为场景中不同实体之间具体动作、姿势、关系或交互。这个任务旨在准确捕捉复杂的事件,并生成具有各种目标实体的定制图像。为了解决这个问题,我们提出了一个新颖的训练-free 事件定制方法:FreeEvent。具体来说,FreeEvent 在普通扩散去噪过程中引入了两条附加路径: 1. 实体切换路径:它应用跨注意引导和调节来生成目标实体的交叉注意力。 2. 事件传递路径:它从参考图像中注入空间特征和自注意图,用于事件生成。 为了进一步促进这项新任务,我们还收集了两个评估基准:SWiG-Event 和 Real-Event。大量的实验和分析证明了FreeEvent的有效性。
https://arxiv.org/abs/2410.02483
As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50\% more effective across different scales of the CelebA dataset.
随着扩散概率模型(DPMs)作为主流模型应用于生成人工智能(GenAI),研究其对训练数据的记忆已经引起了越来越多的关注。在这方面,现有工作旨在建立通过记忆学习DPMs的认知,这一理解对于确定扩散模型中数据泄露和版权侵犯的风险以及更重要的是可靠地应用GenAI至关重要。现有研究表明,条件DPM比无条件DPM更容易通过记忆学习训练数据,并且大多数动机数据提取方法都是为条件DPM设计的。然而,这些理解主要是基于经验性的,从无条件模型中提取训练数据被证明是非常具有挑战性的。在本文中,我们提供了一种在模型收敛的条件下对扩散和无条件DPMs进行记忆的理解。我们的理论分析表明,通过构建一个适当的代理条件,从无条件模型中提取数据也可以有效。基于这一结果,我们提出了名为\textbf{Surrogate condItional Data Extraction (SIDE)}的新数据提取方法,该方法利用训练在生成数据上的时间依赖分类器作为代理条件,从无条件DPM中提取训练数据。实证结果表明,在具有挑战性的场景中,我们的SIDE可以提取训练数据,而以前的方法在此情况下均失败,并且具有平均超过50%的效率,覆盖了CelebA数据集中的不同规模。
https://arxiv.org/abs/2410.02467
This paper introduces a new hybrid descriptor for 3D point matching and point cloud registration, combining local geometrical properties and learning-based feature propagation for each point's neighborhood structure description. The proposed architecture first extracts prior geometrical information by computing each point's planarity, anisotropy, and omnivariance using a Principal Components Analysis (PCA). This prior information is completed by a descriptor based on the normal vectors estimated thanks to constructing a neighborhood based on triangles. The final geometrical descriptor is propagated between the points using local graph convolutions and attention mechanisms. The new feature extractor is evaluated on ModelNet40, Bunny Stanford dataset, KITTI and MVP (Multi-View Partial)-RG for point cloud registration and shows interesting results, particularly on noisy and low overlapping point clouds.
本文提出了一种新的混合描述符,用于3D点匹配和点云配准,将局部几何性质和学习基特征传播相结合,用于描述每个点的邻域结构。所提出的架构首先通过计算每个点的平移、各向异性和二进制化来提取先验几何信息,这是通过计算每个点的法线得到的。先验信息由基于三角形的邻域构建的描述符补充。最后,通过局部图卷积和注意机制在点之间传播几何描述符。对新的特征提取器在ModelNet40、Bunny Stanford数据集、KITTI和MVP(多视角部分点云配准)上的评估表明有趣的结果,尤其是在嘈杂和低重叠点云上。
https://arxiv.org/abs/2410.02420
The Diffusion Model has not only garnered noteworthy achievements in the realm of image generation but has also demonstrated its potential as an effective pretraining method utilizing unlabeled data. Drawing from the extensive potential unveiled by the Diffusion Model in both semantic correspondence and open vocabulary segmentation, our work initiates an investigation into employing the Latent Diffusion Model for Few-shot Semantic Segmentation. Recently, inspired by the in-context learning ability of large language models, Few-shot Semantic Segmentation has evolved into In-context Segmentation tasks, morphing into a crucial element in assessing generalist segmentation models. In this context, we concentrate on Few-shot Semantic Segmentation, establishing a solid foundation for the future development of a Diffusion-based generalist model for segmentation. Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework. Subsequently, we delve deeper into optimizing the infusion of information from the support mask and simultaneously re-evaluating how to provide reasonable supervision from the query mask. Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework and effectively utilizing the pre-training prior. Experimental results demonstrate that our method significantly outperforms the previous SOTA models in multiple settings.
扩散模型不仅在图像生成领域取得了显著的成就,还在利用未标注数据作为有效预训练方法方面展现了其潜力。从扩散模型在语义匹配和开放词汇分割领域展示的广泛潜力中,我们的工作开始了研究,探讨使用潜在扩散模型进行少样本语义分割。近年来,受到大型语言模型在上下文理解能力的影响,少样本语义分割已经演变成评估通用分割模型的关键要素。在这种情况下,我们专注于少样本语义分割,为基于扩散模型的通用分割模型的发展奠定了坚实的基础。我们的初始关注点在于理解如何促进查询图像和支持图像之间的交互,从而在自注意力框架内提出KV融合方法。随后,我们深入研究了如何优化支持掩码中信息的注入以及同时重新评估如何从查询掩码提供合理的监督。根据我们的分析,我们建立了一个简单而有效的框架,名为DiffewS,保留了原始潜在扩散模型的生成框架,并有效利用了预训练的先前知识。实验结果表明,我们的方法在多个设置中显著优于之前的最佳模型。
https://arxiv.org/abs/2410.02369
Mamba, a special case of the State Space Model, is gaining popularity as an alternative to template-based deep learning approaches in medical image analysis. While transformers are powerful architectures, they have drawbacks, including quadratic computational complexity and an inability to address long-range dependencies efficiently. This limitation affects the analysis of large and complex datasets in medical imaging, where there are many spatial and temporal relationships. In contrast, Mamba offers benefits that make it well-suited for medical image analysis. It has linear time complexity, which is a significant improvement over transformers. Mamba processes longer sequences without attention mechanisms, enabling faster inference and requiring less memory. Mamba also demonstrates strong performance in merging multimodal data, improving diagnosis accuracy and patient outcomes. The organization of this paper allows readers to appreciate the capabilities of Mamba in medical imaging step by step. We begin by defining core concepts of SSMs and models, including S4, S5, and S6, followed by an exploration of Mamba architectures such as pure Mamba, U-Net variants, and hybrid models with convolutional neural networks, transformers, and Graph Neural Networks. We also cover Mamba optimizations, techniques and adaptations, scanning, datasets, applications, experimental results, and conclude with its challenges and future directions in medical imaging. This review aims to demonstrate the transformative potential of Mamba in overcoming existing barriers within medical imaging while paving the way for innovative advancements in the field. A comprehensive list of Mamba architectures applied in the medical field, reviewed in this work, is available at Github.
Mamba,一种 State Space Model 的特殊情况,正在成为医学图像分析中模板为基础的深度学习方法的替代品。尽管 Transformer 是一种强大的架构,但它们存在一些局限性,包括二次计算复杂性和无法有效地解决长距离依赖问题。这种局限性影响到医疗影像大数据的分析,其中存在许多空间和时间关系。相比之下,Mamba 提供了在医学图像分析中具有优势的益处。它具有线性时间复杂性,这是 Transformer 的重大改进。Mamba 在没有注意力机制的情况下处理较长的序列,实现更快的推理并需要更少的内存。Mamba 还展示了在合并多模态数据方面的强大性能,提高诊断准确性和患者 outcomes。本文的组织使读者能够逐步了解 Mamba 在医学影像分析中的能力。我们首先定义了 State Space Model 和模型的核心概念,包括 S4、S5 和 S6,接着探讨了 Mamba 的架构,如纯 Mamba、U-Net 变体和具有卷积神经网络、Transformer 和 Graph Neural Networks 的混合模型。我们还涵盖了 Mamba 的优化、技术和适应性,扫描、数据集、应用、实验结果,并最后结论与挑战及未来在医学影像领域的发展趋势。本综述旨在展示 Mamba 在克服现有医疗影像工作中的障碍的同时,为该领域推动创新进展奠定基础。本工作中回顾了在医学领域应用的 Mamba 架构的完整列表,可在 Github 上查看。
https://arxiv.org/abs/2410.02362
A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
评估LLM能力的一种标准方法是提出一个多选题问题,并选择具有最高逻辑值的选项作为模型的预测答案。然而,这种评估LLM的方式存在局限性,因为即使模型知道正确答案,它可能也会因为遵循这种严格格式而难以简单地选择相应的字母。为了解决这个问题,我们引入了新的分数,更好地捕捉并揭示模型的潜在知识:查询-键分数(QK-score),它是从注意头中查询和键表示之间的交互中得到的,以及基于注意权重的注意力分数。这些分数是从特定的\textit{选择并复制}头中提取的,这些头在流行多选题问题回答(MCQA)数据集中表现出一致的性能。根据这些分数,我们的方法提高了知识提取,使得LLaMA2-7B模型在流行MCQA基准上实现了最高16%的提高,而大模型在流行MCQA基准上实现了最高10%的提高。同时,在简单的合成数据集上,模型的准确度几乎提高了60%,几乎实现了一致性准确率,因此证明了方法在减轻MCQA格式限制方面的效率。为了支持我们的主张,我们在零散和少样本设置上对参数从70亿到700亿进行实验。
https://arxiv.org/abs/2410.02343
The increasing demand for transparent and reliable models, particularly in high-stakes decision-making areas such as medical image analysis, has led to the emergence of eXplainable Artificial Intelligence (XAI). Post-hoc XAI techniques, which aim to explain black-box models after training, have been controversial in recent works concerning their fidelity to the models' predictions. In contrast, Self-eXplainable AI (S-XAI) offers a compelling alternative by incorporating explainability directly into the training process of deep learning models. This approach allows models to generate inherent explanations that are closely aligned with their internal decision-making processes. Such enhanced transparency significantly supports the trustworthiness, robustness, and accountability of AI systems in real-world medical applications. To facilitate the development of S-XAI methods for medical image analysis, this survey presents an comprehensive review across various image modalities and clinical applications. It covers more than 200 papers from three key perspectives: 1) input explainability through the integration of explainable feature engineering and knowledge graph, 2) model explainability via attention-based learning, concept-based learning, and prototype-based learning, and 3) output explainability by providing counterfactual explanation and textual explanation. Additionally, this paper outlines the desired characteristics of explainability and existing evaluation methods for assessing explanation quality. Finally, it discusses the major challenges and future research directions in developing S-XAI for medical image analysis.
随着对透明和可靠模型的不断需求,尤其是在高风险决策领域,如医学图像分析,出现了可解释人工智能(XAI)。后验XAI技术,其旨在解释训练后模型的黑盒,近年来在评估其对模型预测的可靠性方面引起了争议。相比之下,自解释人工智能(S-XAI)通过将可解释性直接融入深度学习模型的训练过程,提供了一种引人注目的解决方案。这种方法使得模型能够生成与其内部决策过程密切相关的固有解释。这种增强的可解释性显著支持了人工智能系统在现实医学应用中的可信度、稳健性和问责制。为了促进医疗图像分析中S-XAI方法的发展,这项调查对各种图像模式和临床应用进行了全面的回顾。它涵盖了三个关键观点:1)通过将可解释性特征工程和知识图谱集成到输入中进行解释性,2)通过关注式学习、概念学习和原型学习实现模型的可解释性,3)通过提供反事实解释和文本解释实现输出可解释性。此外,本文还概述了可解释性和现有评估方法评估解释质量的期望特征。最后,本文讨论了开发S-XAI用于医学图像分析的主要挑战和未来研究方向。
https://arxiv.org/abs/2410.02331
Integrating artificial intelligence into modern society is profoundly transformative, significantly enhancing productivity by streamlining various daily tasks. AI-driven recognition systems provide notable advantages in the food sector, including improved nutrient tracking, tackling food waste, and boosting food production and consumption efficiency. Accurate food classification is a crucial initial step in utilizing advanced AI models, as the effectiveness of this process directly influences the success of subsequent operations; therefore, achieving high accuracy at a reasonable speed is essential. Despite existing research efforts, a gap persists in improving performance while ensuring rapid processing times, prompting researchers to pursue cost-effective and precise models. This study addresses this gap by employing the state-of-the-art EfficientNetB7 architecture, enhanced through transfer learning, data augmentation, and the CBAM attention module. This methodology results in a robust model that surpasses previous studies in accuracy while maintaining rapid processing suitable for real-world applications. The Food11 dataset from Kaggle was utilized, comprising 16643 imbalanced images across 11 diverse classes with significant intra-category diversities and inter-category similarities. Furthermore, the proposed methodology, bolstered by various deep learning techniques, consistently achieves an impressive average accuracy of 96.40%. Notably, it can classify over 60 images within one second during inference on unseen data, demonstrating its ability to deliver high accuracy promptly. This underscores its potential for practical applications in accurate food classification and enhancing efficiency in subsequent processes.
将人工智能融入现代社会是彻底颠覆性的,通过简化各种日常任务显著提高生产力。 AI 驱动的识别系统在食品领域具有显著优势,包括改善营养追踪、解决食品浪费和提高食品生产和消费效率。准确的食品分类是利用高级 AI 模型的关键初始步骤,因为这一过程的有效性直接影响后续操作的成功;因此,在合理的时间内实现高准确度是至关重要的。尽管现有研究已经取得了很大进展,但在保证快速处理时间的同时提高性能方面仍然存在差距,促使研究人员追求成本效益和精确的模型。本研究通过采用最先进的 EfficientNetB7 架构、通过迁移学习、数据增强和 CBAM 注意模块进行优化,来解决这一差距。这一方法产生了一个稳健的模型,在保持对真实应用场景的高准确度的同时,实现了惊人的平均准确度为 96.40%。值得注意的是,在推理时它可以将超过 60 张图像分类,证明其迅速提供高准确度的能力。这表明其在准确食品分类和提高后续过程效率的实用潜力。
https://arxiv.org/abs/2410.02304
Modeling temporal characteristics plays a significant role in the representation learning of audio waveform. We propose Contrastive Long-form Language-Audio Pretraining (\textbf{CoLLAP}) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words), while enabling contrastive learning across modalities and temporal dynamics. Leveraging recent Music-LLMs to generate long-form music captions for full-length songs, augmented with musical temporal structures, we collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds. We propose a novel contrastive learning architecture that fuses language representations with structured audio representations by segmenting each song into clips and extracting their embeddings. With an attention mechanism, we capture multimodal temporal correlations, allowing the model to automatically weigh and enhance the final fusion score for improved contrastive alignment. Finally, we develop two variants of the CoLLAP model with different types of backbone language models. Through comprehensive experiments on multiple long-form music-text retrieval datasets, we demonstrate consistent performance improvement in retrieval accuracy compared with baselines. We also show the pretrained CoLLAP models can be transferred to various music information retrieval tasks, with heterogeneous long-form multimodal contexts.
建模时变特征在音频波形表示学习中起着重要作用。我们提出了一种名为 Contrastive Long-form Language-Audio Pretraining (CoLLAP) 的方法,显著扩展了输入音频(长达5分钟)和语言描述(超过250个单词)的感知窗口,同时通过跨模态和时变动态进行对比学习。利用最近的 Music-LLMs 生成完整歌曲的长篇音乐摘要,并添加了音乐时序结构,我们收集了基于大型音频集训练数据集的51.3K个音频文本对,平均音频长度达到288秒。我们提出了一种新颖的对比学习架构,通过将语言表示与结构化音频表示结合,将每首歌曲分割为片段并提取它们的嵌入。通过关注机制,我们捕捉了多模态时变关联,使得模型能够自动权衡并增强最终的融合得分,从而改善对比对齐。最后,我们开发了两种类型的 CoLLAP 模型,分别为不同类型的骨干语言模型。通过在多个长篇音乐文本检索数据集上的全面实验,我们证明了与基线相比,检索准确性的提高是持续的。我们还证明了预训练的 CoLLAP 模型可以应用于各种音乐信息检索任务,包括具有异质长篇多模态上下文的各种任务。
https://arxiv.org/abs/2410.02271
Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts, often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use it to measure and thereby compare prompt sensitivity of various open-source LLMs. We find that merely increasing the parameter count or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of MCQtype tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results is open-sourced at this https URL.
尽管大型语言模型(LLMs)具有令人印象深刻的功能,但研究表明它们对提示的小变化非常敏感,常常对提示的小变化产生显著的输出差异,例如拼写错误、措辞改变或提示模板的改变。然而,在评估LLM的质量时,通常只关注其下游任务的性能,而对提示的敏感性却关注不足。为了填补这一空白,我们提出了POSIX - 一个新颖的提示敏感性索引,作为可靠地衡量提示敏感性的指标,从而为LLM性能的全面评估提供了更多的依据。 POSIX背后的关键思想是,通过替换相应的提示来替换提示,计算给定响应的似然度的相对变化。我们提供了充分的实证证据,证明POSIX在捕捉提示敏感性方面非常有效,并随后使用它来测量和比较各种开源LLM的提示敏感性。我们发现,仅增加参数数量或指令调整并不能降低提示敏感性,而添加一些少量的训练样本(甚至只是一个)几乎总是导致提示敏感性的显著下降。我们还发现,在MCQ类型任务中,对提示模板的修改会导致最高的敏感性,而在开放性生成任务中,同义词转换会导致最高的敏感性。 我们开源了复制我们结果的代码到这个链接:https://www.aclweb.org/anthology/W17-4296
https://arxiv.org/abs/2410.02185
Arabic handwritten text recognition (HTR) is challenging, especially for historical texts, due to diverse writing styles and the intrinsic features of Arabic script. Additionally, Arabic handwriting datasets are smaller compared to English ones, making it difficult to train generalizable Arabic HTR models. To address these challenges, we propose HATFormer, a transformer-based encoder-decoder architecture that builds on a state-of-the-art English HTR model. By leveraging the transformer's attention mechanism, HATFormer captures spatial contextual information to address the intrinsic challenges of Arabic script through differentiating cursive characters, decomposing visual representations, and identifying diacritics. Our customization to historical handwritten Arabic includes an image processor for effective ViT information preprocessing, a text tokenizer for compact Arabic text representation, and a training pipeline that accounts for a limited amount of historic Arabic handwriting data. HATFormer achieves a character error rate (CER) of 8.6% on the largest public historical handwritten Arabic dataset, with a 51% improvement over the best baseline in the literature. HATFormer also attains a comparable CER of 4.2% on the largest private non-historical dataset. Our work demonstrates the feasibility of adapting an English HTR method to a low-resource language with complex, language-specific challenges, contributing to advancements in document digitization, information retrieval, and cultural preservation.
阿拉伯手写文本识别(HTR)具有挑战性,尤其是在历史文本中,因为阿拉伯文本具有多样性的书写风格和阿拉伯文字本特征。此外,阿拉伯手写数据集比英语数据集要小,这使得训练具有泛化能力的阿拉伯HTR模型变得困难。为了应对这些挑战,我们提出了HATFormer,一种基于最先进的英语HTR模型的Transformer编码器-解码器架构。通过利用Transformer的注意力机制,HATFormer捕捉到阿拉伯文本的空间上下文信息,通过区分手写字符、分解视觉表示和识别变体来解决阿拉伯文字本固有的挑战。我们对历史手写阿拉伯的定制包括一个有效的ViT信息预处理图像处理器、一个紧凑的阿拉伯文本词条izer和一个考虑有限历史阿拉伯手写数据训练工作流的训练管道。HATFormer在最大的公共历史手写阿拉伯数据集上的字符错误率(CER)为8.6%,在文献中的最佳基线上的性能提高了51%。HATFormer还在最大的私人非历史数据集上获得了与文献中类似且可比的CER,为将英语HTR方法应用于具有复杂、语言特定挑战的低资源语言奠定了基础,有助于促进文档数字化、信息检索和文化遗产保护的发展。
https://arxiv.org/abs/2410.02179