Recent large language models (LLMs) have demonstrated exceptional performance on general-purpose text embedding tasks. While dense embeddings have dominated related research, we introduce the first Lexicon-based EmbeddiNgS (LENS) leveraging LLMs that achieve competitive performance on these tasks. Regarding the inherent tokenization redundancy issue and unidirectional attention limitations in traditional causal LLMs, LENS consolidates the vocabulary space through token embedding clustering, and investigates bidirectional attention and various pooling strategies. Specifically, LENS simplifies lexicon matching by assigning each dimension to a specific token cluster, where semantically similar tokens are grouped together, and unlocking the full potential of LLMs through bidirectional attention. Extensive experiments demonstrate that LENS outperforms dense embeddings on the Massive Text Embedding Benchmark (MTEB), delivering compact feature representations that match the sizes of dense counterparts. Notably, combining LENSE with dense embeddings achieves state-of-the-art performance on the retrieval subset of MTEB (i.e. BEIR).
近期的大规模语言模型(LLMs)在通用文本嵌入任务上表现出色。虽然稠密嵌入一直主导着相关研究,我们首次提出了基于词典的嵌入方法(LENS),这种方法利用了LLM并在此类任务中取得了竞争性性能。针对传统因果LLM中存在的固有分词冗余问题和单向注意力限制,LENS通过分词嵌入聚类来整合词汇空间,并探索双向注意机制及多种池化策略。具体而言,LENS简化了词典匹配过程,为每个维度分配一个特定的分词簇,在这个簇中,语义相似的单词被聚集在一起,而通过双向注意力则释放出LLM的全部潜力。广泛的实验表明,LENS在大规模文本嵌入基准(MTEB)上优于稠密嵌入方法,并提供与稠密嵌入相同大小的紧凑特征表示。值得一提的是,将LENS与稠密嵌入相结合,在MTEB中的检索子集(即BEIR)中取得了最先进的性能。
https://arxiv.org/abs/2501.09749
With the increased use of the internet and social networks for online discussions, the spread of toxic and inappropriate content on social networking sites has also increased. Several studies have been conducted in different languages. However, there is less work done for South Asian languages for inappropriate content identification using deep learning techniques. In Urdu language, the spellings are not unique, and people write different common spellings for the same word, while mixing it other languages, like English in the text makes it more challenging, and limited research work is available to process such language with the finest algorithms. The use of attention layer with a deep learning model can help handling the long-term dependencies and increase its efficiency . To explore the effects of the attention layer, this study proposes attention-based Bidirectional GRU hybrid model for identifying inappropriate content in Urdu Unicode text language. Four different baseline deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the performance of the proposed model. The results of these models were compared based on evaluation metrics, dataset size, and impact of the word embedding layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our proposed model BiGRU-A outperformed all other baseline models by yielding 84\% accuracy without using pre-trained word2Vec layer. From our experiments, we have established that the attention layer improves the model's efficiency, and pre-trained word2Vec embedding does not work well with an inappropriate content dataset.
随着互联网和社交网络在在线讨论中的使用增加,社交媒体平台上毒性和不适当内容的传播也有所增加。不同语言中已经进行了多项研究,但在南亚语言中利用深度学习技术进行不当内容识别的研究工作较少。乌尔都语拼写并不唯一,同一单词有多种常见的拼写方式,而且会与其他语言(如英语)混合使用,这使得处理这种语言更具挑战性,并且可用的算法研究有限。 使用注意力层与深度学习模型相结合可以帮助处理长期依赖关系并提高其效率。为了探索注意力层的效果,本研究提出了一种基于注意力的双向GRU混合模型,用于识别乌尔都语Unicode文本中的不当内容。四种不同的基线深度学习模型:LSTM、Bi-LSTM、GRU和TCN被用来比较所提出的模型性能。根据评估指标、数据集大小以及词嵌入层的影响来对比这些模型的结果。我们使用了预训练的乌尔都语word2Vec嵌入。 我们的拟议模型BiGRU-A在不使用预训练的word2Vec层的情况下达到了84%的准确率,优于所有其他基线模型。从实验中得出结论,注意力层可以提高模型效率,并且与不当内容数据集相比,预训练的词向量层表现不佳。
https://arxiv.org/abs/2501.09722
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
幻觉仍然是大型视觉语言模型(LVLM)面临的主要挑战之一。直接偏好优化(DPO)作为一种简单的解决方案,近年来受到了越来越多的关注,它通过从反映同一提示和图像的响应中幻觉严重程度所构建的偏好对进行直接学习。然而,现有的工作中的不同数据构建方法带来了显著的性能差异。我们在这里识别了一个关键因素:结果在很大程度上取决于所构建的数据是否与DPO最初的(参考)策略一致。理论上分析表明,从离策略数据学习会受到更新后的策略和参考策略之间存在的KL散度的影响。 从数据集分布的角度来看,我们系统地总结了现有算法使用DPO解决幻觉问题时固有的缺陷。为了解决这些问题,我们提出了在政策对齐(OPA)-DPO框架,它利用专家反馈来纠正幻觉响应,并以在策略的方式对准原始和经过专家修订的响应。值得注意的是,在仅使用4.8k数据的情况下,与先前使用的16k样本训练的最佳现有算法相比,OPA-DPO使LLaVA-1.5-7B模型在AMBER基准测试中实现了幻觉率额外降低13.26%,在Object-Hal基准测试中降低了5.39%。
https://arxiv.org/abs/2501.09695
This tutorial provides an in-depth guide on inference-time guidance and alignment methods for optimizing downstream reward functions in diffusion models. While diffusion models are renowned for their generative modeling capabilities, practical applications in fields such as biology often require sample generation that maximizes specific metrics (e.g., stability, affinity in proteins, closeness to target structures). In these scenarios, diffusion models can be adapted not only to generate realistic samples but also to explicitly maximize desired measures at inference time without fine-tuning. This tutorial explores the foundational aspects of such inference-time algorithms. We review these methods from a unified perspective, demonstrating that current techniques -- such as Sequential Monte Carlo (SMC)-based guidance, value-based sampling, and classifier guidance -- aim to approximate soft optimal denoising processes (a.k.a. policies in RL) that combine pre-trained denoising processes with value functions serving as look-ahead functions that predict from intermediate states to terminal rewards. Within this framework, we present several novel algorithms not yet covered in the literature. Furthermore, we discuss (1) fine-tuning methods combined with inference-time techniques, (2) inference-time algorithms based on search algorithms such as Monte Carlo tree search, which have received limited attention in current research, and (3) connections between inference-time algorithms in language models and diffusion models. The code of this tutorial on protein design is available at this https URL
这篇教程提供了关于推理时引导和对齐方法的深入指南,这些方法用于优化扩散模型中的下游奖励函数。虽然扩散模型因其生成建模能力而闻名,但在生物学等领域中的实际应用通常需要生成最大化特定指标(例如蛋白质的稳定性、亲和力以及接近目标结构的程度)的样本。在这些场景中,可以对扩散模型进行调整,使其不仅能生成逼真的样本,还能在推理时明确地最大化所需的度量值而不需微调。本教程探讨了此类推理时间算法的基础方面,并从统一的角度回顾这些方法,表明当前的技术——如基于序列蒙特卡洛(SMC)的引导、基于价值的采样以及分类器引导——旨在近似软优化去噪过程(即RL中的策略),该过程结合了预训练的去噪过程和作为预测函数的价值功能,从中间状态到最终奖励。在此框架内,我们提出了一些尚未在文献中被涵盖的新算法。 此外,本教程还讨论了: 1. 结合推理时间技术的微调方法; 2. 基于搜索算法(如蒙特卡洛树搜索)的推理时间算法,在当前研究中受到了较少关注;以及 3. 语言模型与扩散模型之间在推理时间算法上的联系。 有关蛋白质设计教程代码,请访问此链接:[https URL]
https://arxiv.org/abs/2501.09685
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
VLM(视觉语言模型)的成功往往依赖于能够动态生成高分辨率图像的方案,这些方案会自适应地将输入图像分割成多个区域块,以保留图像中的细节信息。然而,这样的方法会产生大量的冗余视觉标记,从而大大降低了VLM的效率。为了在不增加额外训练成本的情况下提高VLM的效率,许多研究工作提出了通过过滤掉无用的视觉标记或聚合它们的信息来减少视觉标记的方法。一些方法提出根据VLM中的自注意力机制来减少视觉标记,但由于这种机制存在偏差,会导致响应不准确。仅仅依赖于视觉线索进行标记减少的方法对文本是不可知的,在处理与问题最相关的区域时会失败,尤其是在查询对象在图像中不太突出的情况下。 在这项工作中,我们首先进行了实验以证明原始文本嵌入与视觉标记对齐,并且对于尾部视觉标记没有偏差。然后,我们提出了一种自我适应的跨模态注意力混合机制,在预训练大语言模型(LLM)层中动态利用视觉显著性和文本到图像相似性的有效性来选择那些信息丰富的视觉标记。广泛的实验表明,所提出的这种方法在无训练成本的情况下实现了最先进的VLM加速性能,尤其是在减少率足够大的情况下尤其有效。
https://arxiv.org/abs/2501.09532
Online medical consultation (OMC) restricts doctors to gathering patient information solely through inquiries, making the already complex sequential decision-making process of diagnosis even more challenging. Recently, the rapid advancement of large language models has demonstrated a significant potential to transform OMC. However, most studies have primarily focused on improving diagnostic accuracy under conditions of relatively sufficient information, while paying limited attention to the "inquiry" phase of the consultation process. This lack of focus has left the relationship between "inquiry" and "diagnosis" insufficiently explored. In this paper, we first extract real patient interaction strategies from authentic doctor-patient conversations and use these strategies to guide the training of a patient simulator that closely mirrors real-world behavior. By inputting medical records into our patient simulator to simulate patient responses, we conduct extensive experiments to explore the relationship between "inquiry" and "diagnosis" in the consultation process. Experimental results demonstrate that inquiry and diagnosis adhere to the Liebig's law: poor inquiry quality limits the effectiveness of diagnosis, regardless of diagnostic capability, and vice versa. Furthermore, the experiments reveal significant differences in the inquiry performance of various models. To investigate this phenomenon, we categorize the inquiry process into four types: (1) chief complaint inquiry; (2) specification of known symptoms; (3) inquiry about accompanying symptoms; and (4) gathering family or medical history. We analyze the distribution of inquiries across the four types for different models to explore the reasons behind their significant performance differences. We plan to open-source the weights and related code of our patient simulator at this https URL.
在线医疗咨询(OMC)限制了医生只能通过询问来收集患者信息,使得本已复杂的诊断决策过程更加复杂。最近,大型语言模型的迅速发展显示出显著潜力,可以彻底改变OMC的方式。然而,大多数研究主要集中在提高在相对充足信息条件下的诊断准确性上,而对咨询过程中“问诊”阶段的关注较少。这种忽视导致了关于“问诊”与“诊断”之间关系的研究不足。在这篇论文中,我们首先从真实的医生和患者对话中提取出实际的问诊策略,并利用这些策略来训练一个高度模拟现实行为的患者模拟器。通过将医疗记录输入我们的患者模拟器以模拟患者的反应,我们进行了广泛的实验来探索咨询过程中“问诊”与“诊断”的关系。实验结果表明,问诊和诊断遵循李比希法则:低质量的问诊限制了诊断的有效性,无论诊断能力如何;反之亦然。 此外,这些实验揭示了不同模型在问诊性能上的显著差异。为了探讨这一现象的原因,我们将问诊过程分为四种类型:(1)主诉询问;(2)已知症状的具体化;(3)伴随症状的询问;以及(4)收集家族或个人医疗史信息。我们分析了不同类型问诊在整个咨询过程中不同模型间的分布情况,以探讨其显著性能差异的原因。 我们的患者模拟器权重和相关代码计划在以下网址开源:[此 URL](https://this-url.com) (请注意,原文中提供的实际链接需要替换为有效URL)。
https://arxiv.org/abs/2501.09484
We propose a novel architecture for graph-based dependency parsing that explicitly constructs vectors, from which both arcs and labels are scored. Our method addresses key limitations of the standard two-pipeline approach by unifying arc scoring and labeling into a single network, reducing scalability issues caused by the information bottleneck and lack of parameter sharing. Additionally, our architecture overcomes limited arc interactions with transformer layers to efficiently simulate higher-order dependencies. Experiments on PTB and UD show that our model outperforms state-of-the-art parsers in both accuracy and efficiency.
我们提出了一种基于图的依存句法分析的新架构,该架构明确构建向量,并从中对弧线和标签进行评分。我们的方法通过将弧线评分和标注统一到一个网络中来解决标准两步流程方法的关键限制,从而减少了由于信息瓶颈和缺乏参数共享导致的可扩展性问题。此外,我们的架构克服了变压器层之间有限的弧线交互,以高效地模拟高阶依赖关系。在PTB和UD上的实验表明,我们的模型在准确性和效率方面都优于最先进的解析器。
https://arxiv.org/abs/2501.09451
The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
从文本或视觉输入生成高质量的三维(3D)资产已成为现代生成模型的核心目标。尽管出现了许多3D生成算法,但它们仍然面临着多视角不一致、生成时间长、保真度低以及表面重构问题等挑战。虽然一些研究已经解决了一部分这些问题,但仍缺乏一个全面的解决方案。在这篇文章中,我们介绍了\textbf{CaPa}(雕刻与绘制框架),这是一个能够高效生成高保真3D资产的系统。 CaPa采用了一个两阶段的过程,将几何生成和纹理合成解耦开来。在第一阶段,使用一个多视角输入引导的3D潜在扩散模型来生成几何结构,确保从不同视角观察时的一致性。第二阶段则利用了一种新颖且不依赖特定模型的空间分离注意力机制(Spatially Decoupled Attention),为给定的几何体合成高分辨率纹理(最高可达4K)。此外,我们还提出了一种3D感知的遮挡修复算法,用于填充未绘制区域,确保整个模型的一致性。该流程可以在不到30秒的时间内生成高质量的3D资产,并提供可以直接应用于商业用途的输出结果。 实验结果显示,CaPa在纹理保真度和几何稳定性方面均表现出色,为实用且可扩展的3D资产生成设定了新的标准。
https://arxiv.org/abs/2501.09433
As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous deep-learning-based methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature.
随着虚拟和增强现实应用的流行,全方位图像(ODI)超分辨率变得越来越重要。与在平面上形成的2D平面图像不同,ODIs是投影到球形表面的。因此,将已建立的图像超分辨率方法应用于ODIs需要执行等矩形投影(ERP)以将其映射到平面上。ODI 超分辨率技术需要考虑 ERP 引起的几何失真。然而,如果不考虑 ERP 图像中的几何失真,以前基于深度学习的方法仅利用有限范围内的像素,并且可能轻易错过用于重建的自相似纹理。 在本文中,我们介绍了一种新颖的全方位图像超分辨率几何失真引导变换器(GDGT-OSR)。具体而言,提出了一种集成了可变形自我注意机制的失真调制矩形窗口自我注意机制,以更好地感知失真,并因此涉及更多的自相似纹理。通过利用纬度之间失真的变化性来生成指导信息的新设计失真引导发生器实现了失真调节。此外,我们还提出了动态特征聚合方案以适应地融合来自不同自我注意模块的特性。 我们在公共数据集上展示了广泛的实验结果,并表明新的 GDGT-OSR 方法优于现有文献中的方法。
https://arxiv.org/abs/2406.10869
Understanding the reliability of large language models (LLMs) has recently garnered significant attention. Given LLMs' propensity to hallucinate, as well as their high sensitivity to prompt design, it is already challenging to predict the performance of an individual LLM. However, the problem becomes more complex for compound LLM systems such as cascades, where in addition to each model's standalone performance, we must understand how the error rates of different models interact. In this paper, we present a probabilistic model for the joint performance distribution of a sequence of LLMs, which enables a framework for rationally tuning the confidence thresholds of a LLM cascade using continuous optimization. Compared to selecting confidence thresholds using grid search, our parametric Markov-copula model significantly improves runtime scaling with respect to the length of the cascade and the desired resolution of the cost-error curve, turning them from intractable into low-order polynomial. In addition, the optimal thresholds computed using our continuous optimization-based algorithm increasingly outperform those found via grid search as cascade length grows, improving the area under the cost-error curve by 1.9% on average for cascades consisting of at least three models. Overall, our Markov-copula model provides a rational basis for tuning LLM cascade performance and points to the potential of probabilistic methods in analyzing LLM systems.
理解大规模语言模型(LLM)的可靠性最近引起了广泛关注。由于LLM容易产生幻觉,以及对提示设计的高度敏感性,预测单个LLM的表现已经颇具挑战性。然而,在复合LLM系统(如级联结构)中,除了每个模型自身的性能之外,我们还必须理解不同模型之间错误率如何相互影响,这使得问题变得更加复杂。在本文中,我们提出了一种概率模型来描述一系列LLM的联合表现分布,从而为合理调整LLM级联系统的置信度阈值提供了一个基于连续优化的框架。 与使用网格搜索选择置信度阈值相比,我们的参数化马尔可夫-科皮拉(Markov-copula)模型在处理级联长度和所需误差成本曲线分辨率方面显著提高了运行时间的缩放效率,将原本难以解决的问题转化为低阶多项式问题。此外,随着级联结构中模型数量的增长,使用我们连续优化算法计算出的最佳阈值越来越优于网格搜索的结果,在至少包含三个模型的级联系统中,平均使误差成本曲线下的面积提升了1.9%。 总体而言,我们的马尔可夫-科皮拉模型为调优LLM级联性能提供了一个合理的依据,并且展示了概率方法在分析LLM系统的潜力。
https://arxiv.org/abs/2501.09345
Video synthetic aperture radar (ViSAR) has attracted substantial attention in the moving target detection (MTD) field due to its ability to continuously monitor changes in the target area. In ViSAR, the moving targets' shadows will not offset and defocus, which is widely used as a feature for MTD. However, the shadows are difficult to distinguish from the low scattering region in the background, which will cause more missing and false alarms. Therefore, it is worth investigating how to enhance the distinction between the shadows and background. In this study, we proposed the Shadow Enhancement and Background Suppression for ViSAR (SE-BSFV) algorithm. The SE-BSFV algorithm is based on the low-rank representation (LRR) theory and adopts online subspace learning technique to enhance shadows and suppress background for ViSAR images. Firstly, we use a registration algorithm to register the ViSAR images and utilize Gaussian mixture distribution (GMD) to model the ViSAR data. Secondly, the knowledge learned from the previous frames is leveraged to estimate the GMD parameters of the current frame, and the Expectation-maximization (EM) algorithm is used to estimate the subspace parameters. Then, the foreground matrix of the current frame can be obtained. Finally, the alternating direction method of multipliers (ADMM) is used to eliminate strong scattering objects in the foreground matrix to obtain the final results. The experimental results indicate that the SE-BSFV algorithm significantly enhances the shadows' saliency and greatly improves the detection performance while ensuring efficiency compared with several other advanced pre-processing algorithms.
视频合成孔径雷达(ViSAR)在移动目标检测(MTD)领域引起了广泛关注,因为它能够持续监测目标区域的变化。在ViSAR中,移动目标的阴影不会偏移和模糊,这被广泛用作MTD的一个特征。然而,由于背景中的低散射区难以与阴影区分,会导致更多的漏报和误报。因此,研究如何增强阴影与背景之间的区别变得非常重要。 为此,在本研究中我们提出了用于ViSAR的阴影增强及背景抑制算法(SE-BSFV)。该算法基于低秩表示(LRR)理论,并采用在线子空间学习技术来增强ViSAR图像中的阴影并压制背景。首先,我们使用一个配准算法对ViSAR图像进行配准,并利用高斯混合分布(GMD)模型化ViSAR数据。其次,从先前帧中获得的知识被用来估计当前帧的GMD参数,同时采用期望最大算法来估算子空间参数。然后可以获取当前帧的前景矩阵。最后,使用交替方向乘子法(ADMM)消除前景矩阵中的强散射物体以得到最终结果。 实验结果显示,与几种其他先进的预处理算法相比,SE-BSFV算法显著增强了阴影的重要性,并在确保效率的同时大幅提高了检测性能。
https://arxiv.org/abs/2501.09341
We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.
我们提出了一种简单的方法,利用预训练的视觉变压器(Vision Transformers, ViT)进行细粒度分析,旨在识别并定位区分类似视觉类别的特征,例如不同的鸟类或犬品种。像DINO这样的预训练ViT已经显示出能够提取局部化、有信息量特征的非凡能力。然而,使用诸如Grad-CAM之类的注意力图很难指出这些特征:它们通常通过模糊且粗略的热力图来定位整个对象,而不是具体的特征。为此,我们提出了一种新的方法——Prompt Class Attention Map(Prompt-CAM),以解决这一问题。 Prompt-CAM通过对预训练ViT进行类别特定的提示学习,并使用相应的输出来进行分类。为了正确地对图像进行分类,真实类别的提示必须关注在其他类别中未出现的独特图块(即特征)。因此,真实的多头注意力图揭示了特征及其位置。从实现角度来看,Prompt-CAM几乎是一顿免费午餐,只需修改视觉提示调整(Visual Prompt Tuning, VPT)的预测头部即可。这使得Prompt-CAM训练和应用起来相对容易,与需要设计特定模型和训练过程的其他可解释方法形成了鲜明对比。 甚至比最近发布的用于Transformer的INterpretable TRansformer (INTR)还要简单,后者由于其编码器-解码器架构而无法利用预训练的ViT。在来自不同领域的十几个数据集上进行广泛的实证研究验证了Prompt-CAM优越的解释能力。
https://arxiv.org/abs/2501.09333
Transformer-based encoder-decoder models have achieved remarkable success in image-to-image transfer tasks, particularly in image restoration. However, their high computational complexity-manifested in elevated FLOPs and parameter counts-limits their application in real-world scenarios. Existing knowledge distillation methods in image restoration typically employ lightweight student models that directly mimic the intermediate features and reconstruction results of the teacher, overlooking the implicit attention relationships between them. To address this, we propose a Soft Knowledge Distillation (SKD) strategy that incorporates a Multi-dimensional Cross-net Attention (MCA) mechanism for compressing image restoration models. This mechanism facilitates interaction between the student and teacher across both channel and spatial dimensions, enabling the student to implicitly learn the attention matrices. Additionally, we employ a Gaussian kernel function to measure the distance between student and teacher features in kernel space, ensuring stable and efficient feature learning. To further enhance the quality of reconstructed images, we replace the commonly used L1 or KL divergence loss with a contrastive learning loss at the image level. Experiments on three tasks-image deraining, deblurring, and denoising-demonstrate that our SKD strategy significantly reduces computational complexity while maintaining strong image restoration capabilities.
基于Transformer的编码器-解码器模型在图像到图像转换任务中,尤其是图像恢复方面取得了显著的成功。然而,这些模型由于计算复杂度高(表现为更高的FLOPs和参数计数)而限制了其在现实场景中的应用。现有的知识蒸馏方法通常采用轻量级的学生模型直接模仿教师模型的中间特征和重建结果,忽略了两者之间的隐式注意关系。为了解决这一问题,我们提出了一种Soft Knowledge Distillation(SKD)策略,该策略结合了Multi-dimensional Cross-net Attention(MCA)机制以压缩图像恢复模型。这种机制促进了学生和教师在通道和空间维度上的交互,使学生能够隐式地学习注意力矩阵。 此外,我们使用高斯核函数来衡量学生和教师特征之间的距离,确保稳定且高效的特征学习。为了进一步提高重建图像的质量,我们将常用的L1或KL散度损失替换为基于图像级别的对比学习损失。在三项任务——去雨、去模糊和降噪的实验中,我们的SKD策略显著降低了计算复杂性,并保持了强大的图像恢复能力。
https://arxiv.org/abs/2501.09321
Automated audio captioning is a task that generates textual descriptions for audio content, and recent studies have explored using visual information to enhance captioning quality. However, current methods often fail to effectively fuse audio and visual data, missing important semantic cues from each modality. To address this, we introduce LAVCap, a large language model (LLM)-based audio-visual captioning framework that effectively integrates visual information with audio to improve audio captioning performance. LAVCap employs an optimal transport-based alignment loss to bridge the modality gap between audio and visual features, enabling more effective semantic extraction. Additionally, we propose an optimal transport attention module that enhances audio-visual fusion using an optimal transport assignment map. Combined with the optimal training strategy, experimental results demonstrate that each component of our framework is effective. LAVCap outperforms existing state-of-the-art methods on the AudioCaps dataset, without relying on large datasets or post-processing. Code is available at this https URL.
自动音频字幕生成是为音频内容生成文本描述的任务,最近的研究探讨了使用视觉信息来提升字幕的质量。然而,当前的方法往往无法有效地融合音频和视觉数据,从而忽视了来自每种模态的重要语义线索。为了应对这一挑战,我们提出了LAVCap——一种基于大型语言模型(LLM)的音视频字幕生成框架,该框架能够有效结合视觉信息与音频,以提升音频字幕的质量。 LAVCap采用了一种基于最优传输的对齐损失函数,用于弥合音频和视觉特征之间的模态差异,从而实现更有效的语义提取。此外,我们还提出了一种最优传输注意力模块,利用最优传输分配图来增强音视频融合的效果。结合最佳训练策略,实验结果表明我们的框架中的每一部分都是有效的。 LAVCap在AudioCaps数据集上的表现优于现有最先进的方法,并且无需依赖大规模的数据集或后处理步骤。代码可在提供的链接中获取:[此URL](https://此URL)(请将"this https URL"替换为实际的URL)。
https://arxiv.org/abs/2501.09291
Vision Transformers (ViTs) are increasingly being adopted in various sensitive vision applications - like medical diagnosis, facial recognition, etc. To improve the interpretability of such models, many approaches attempt to forward-align them with carefully annotated abstract, human-understandable semantic entities - concepts. Concepts provide global rationales to the model predictions and can be quickly understood/intervened on by domain experts. Most current research focuses on designing model-agnostic, plug-and-play generic concept-based explainability modules that do not incorporate the inner workings of foundation models (e.g., inductive biases, scale invariance, etc.) during training. To alleviate this issue for ViTs, in this paper, we propose a novel Concept Representation Alignment Module (CRAM) which learns both scale and position-aware representations from multi-scale feature pyramids and patch representations respectively. CRAM further aligns these representations with concept annotations through an attention matrix. The proposed CRAM module improves the predictive performance of ViT architectures and also provides accurate and robust concept explanations as demonstrated on five datasets - including three widely used benchmarks (CUB, Pascal APY, Concept-MNIST) and 2 real-world datasets (AWA2, KITS).
视觉变压器(ViTs)在医疗诊断、面部识别等敏感视觉应用中越来越受到青睐。为了提高这些模型的可解释性,许多方法试图通过与精心标注的抽象且人类易于理解的概念进行正向对齐来实现这一点。概念为模型预测提供了全局理由,并且领域专家可以快速理解和干预。然而,目前大多数研究集中在设计不考虑基础模型内部工作原理(如归纳偏差、尺度不变性等)的通用可插拔模块上。 为了缓解这一问题,我们针对ViTs提出了一种新颖的概念表示对齐模块(CRAM)。该模块分别从多尺度特征金字塔和patch表示中学习具有尺度感知和位置感知的表示。此外,通过注意力矩阵,CRAM进一步将这些表示与概念注释进行对齐。所提出的CRAM模块不仅提升了ViT架构的预测性能,还提供了准确且稳健的概念解释,这一效果在五个数据集上得到了验证——包括三个广泛使用的基准(CUB、Pascal APY、Concept-MNIST)以及两个真实世界的数据集(AWA2、KITS)。
https://arxiv.org/abs/2501.09221
Short text classification has gained significant attention in the information age due to its prevalence and real-world applications. Recent advancements in graph learning combined with contrastive learning have shown promising results in addressing the challenges of semantic sparsity and limited labeled data in short text classification. However, existing models have certain limitations. They rely on explicit data augmentation techniques to generate contrastive views, resulting in semantic corruption and noise. Additionally, these models only focus on learning the intrinsic consistency between the generated views, neglecting valuable discriminative information from other potential views. To address these issues, we propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC). Our approach involves performing graph learning on multiple text-related component graphs to obtain multi-view text embeddings. Subsequently, we directly apply contrastive learning on these embeddings. Notably, our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning. Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
在信息时代,由于其普遍性和实际应用价值,短文本分类已获得了广泛关注。近期,图学习与对比学习相结合的技术,在解决短文本分类中的语义稀疏性和标注数据不足的挑战方面显示出巨大潜力。然而,现有的模型存在一定的局限性:它们依赖于显式的数据增强技术来生成对比视图,这会导致语义污染和噪声;此外,这些模型仅关注于从生成的视图中学习内在一致性,而忽视了其他潜在视图中的有价值的区别信息。 为了解决这些问题,我们提出了一种用于短文本分类的简单图对比学习框架(SimSTC)。我们的方法包括在多个与文本相关的组件图上执行图学习以获取多视角文本嵌入,随后直接在这些建模后的嵌入上应用对比学习。特别值得注意的是,我们的方法消除了生成对比视图所需的数据增强操作,同时仍然利用了多视图对比学习的好处。尽管该模型结构简单,但在各种数据集上的性能表现却非常出色,并且超过了大型语言模型的水平。
https://arxiv.org/abs/2501.09219
Prostate cancer (PCa) is the most prevalent cancer among men in the United States, accounting for nearly 300,000 cases, 29% of all diagnoses and 35,000 total deaths in 2024. Traditional screening methods such as prostate-specific antigen (PSA) testing and magnetic resonance imaging (MRI) have been pivotal in diagnosis, but have faced limitations in specificity and generalizability. In this paper, we explore the potential of enhancing PCa lesion segmentation using a novel MRI modality called synthetic correlated diffusion imaging (CDI$^s$). We employ several state-of-the-art deep learning models, including U-Net, SegResNet, Swin UNETR, Attention U-Net, and LightM-UNet, to segment PCa lesions from a 200 CDI$^s$ patient cohort. We find that SegResNet achieved superior segmentation performance with a Dice-Sorensen coefficient (DSC) of $76.68 \pm 0.8$. Notably, the Attention U-Net, while slightly less accurate (DSC $74.82 \pm 2.0$), offered a favorable balance between accuracy and computational efficiency. Our findings demonstrate the potential of deep learning models in improving PCa lesion segmentation using CDI$^s$ to enhance PCa management and clinical support.
前列腺癌(PCa)是美国男性中最常见的癌症,占所有病例的近30万例,占全部诊断病例的29%,并在2024年导致了大约35,000人死亡。传统的筛查方法,如前列腺特异性抗原(PSA)测试和磁共振成像(MRI),在诊断中发挥了关键作用,但它们在特异性和普适性方面存在局限性。本文探讨了一种名为合成相关扩散成像(CDI$^s$)的新MRI模式用于增强PCa病灶分割的潜力。我们采用了几种最先进的深度学习模型,包括U-Net、SegResNet、Swin UNETR、Attention U-Net和LightM-UNet,以从200名CDI$^s$患者的队列中分割出PCa病灶。研究发现,SegResNet在Dice-Sorensen系数(DSC)为76.68 ± 0.8的情况下实现了最佳的分割性能。值得注意的是,尽管Attention U-Net的准确性稍低一些(DSC为74.82 ± 2.0),但它在准确性和计算效率之间提供了一个理想的平衡点。我们的研究结果展示了深度学习模型用于提高使用CDI$^s$进行PCa病灶分割的潜力,并可以增强对PCa的管理和临床支持。
https://arxiv.org/abs/2501.09185
This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.
这项工作引入了一种新颖的保留层(Retention Layer)机制,用于基于Transformer架构的设计中。此机制解决了这些模型固有的内在保持能力不足的问题。与人类认知不同,后者能够编码并动态回忆象征性模板,生成式预训练Transformer仅依赖于固定的预训练权重和短暂的有效上下文窗口,这限制了它们的适应性。 所提出的保留层(Retention Layer)包含一个持久内存模块,该模块能够在实时数据填充、动态召回以及指导输出生成方面发挥作用。这一增强使模型能够跨会话存储、更新并重复使用观察到的模式,从而实现增量学习,并弥合静态预训练与动态上下文敏感适应之间的差距。 保留层的设计借鉴了社会学习过程中的注意力、记忆保持、再现和激励阶段。技术上,它集成了内存注意机制和情景缓冲区,以管理内存可扩展性、缓解过拟合以及确保高效召回。该方法的应用范围广泛,包括自适应个人助手、实时欺诈检测、自主机器人系统、内容审核和医疗诊断等。 在每个领域中,保留机制使得系统能够进行增量学习,个性化输出,并有效应对不断变化的现实挑战。通过模仿人类学习的关键方面,这种改进后的架构促进了更加流畅且响应迅速的人工智能范式的发展,为动态会话感知模型铺平了道路,从而将传统的Transformer能力扩展到需要持续适应性的领域中去。
https://arxiv.org/abs/2501.09166
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
基于预训练的文本到视频模型,先进先出(FIFO)视频扩散技术作为一种无需调整参数即可生成长视频的有效方法最近崭露头角。该技术维护着一个包含逐渐增加噪点的视频帧队列,在队列前端持续产生清晰帧的同时,高斯噪声则在队列尾端被加入。然而,由于缺乏跨帧对应关系建模,FIFO-Diffusion 往往难以维持生成视频中的长时序一致性。 为此,我们提出了 Ouroboros-Diffusion,这是一种新的视频去噪框架,旨在增强结构和内容(主题)的一致性,使任意长度的连贯视频生成成为可能。具体而言,我们在队列尾部引入了一种新的潜在采样技术,以提升结构一致性,确保帧之间的过渡在感知上平滑无间断。 为了进一步提高主题一致性,我们设计了“主题感知跨帧注意力机制”(Subject-Aware Cross-Frame Attention, SACFA),该机制通过在短片段内对齐各帧中的主体对象来实现更好的视觉连贯性。此外,我们还引入了一种自递归引导技术,这种技术利用队列前端所有先前更清晰的帧信息来指导队列尾部较噪点帧的去噪过程,从而促进丰富的上下文全局信息互动。 在 VBench 长视频生成基准测试中的大量实验结果表明,我们的 Ouroboros-Diffusion 在主题一致性、运动平滑度和时序一致性方面表现出了显著的优势。
https://arxiv.org/abs/2501.09019