Identification of suspects based on partial and smudged fingerprints, commonly referred to as fingermarks or latent fingerprints, presents a significant challenge in the field of fingerprint recognition. Although fixed-length embeddings have shown effectiveness in recognising rolled and slap fingerprints, the methods for matching latent fingerprints have primarily centred around local minutiae-based embeddings, failing to fully exploit global representations for matching purposes. Consequently, enhancing latent fingerprints becomes critical to ensuring robust identification for forensic investigations. Current approaches often prioritise restoring ridge patterns, overlooking the fine-macroeconomic details crucial for accurate fingerprint recognition. To address this, we propose a novel approach that uses generative adversary networks (GANs) to redefine Latent Fingerprint Enhancement (LFE) through a structured approach to fingerprint generation. By directly optimising the minutiae information during the generation process, the model produces enhanced latent fingerprints that exhibit exceptional fidelity to ground-truth instances. This leads to a significant improvement in identification performance. Our framework integrates minutiae locations and orientation fields, ensuring the preservation of both local and structural fingerprint features. Extensive evaluations conducted on two publicly available datasets demonstrate our method's dominance over existing state-of-the-art techniques, highlighting its potential to significantly enhance latent fingerprint recognition accuracy in forensic applications.
根据部分和模糊指纹的识别,通常称为指纹或潜在指纹,在指纹识别领域带来了巨大的挑战。尽管固定长度的嵌入已经在识别 rolled 和 slap 指纹方面表现出效果,但是匹配潜在指纹的方法主要集中于基于局部最小凹陷的嵌入,未能充分利用全局表示进行匹配。因此,为了确保在法医调查中具有稳健的识别能力,增强潜在指纹变得至关重要。现有方法通常优先恢复轮廓模式,而忽略了准确指纹识别所必需的微观经济细节。为解决这个问题,我们提出了一种新方法,通过使用生成对抗网络(GANs)通过指纹生成的结构化方法重新定义潜在指纹增强(LFE)。通过在生成过程中直接优化最小凹陷信息,模型产生具有卓越忠实度到真实实例的增强潜在指纹。这导致识别性能显著提高。我们对两个公开可用的数据集进行了广泛的评估,证明我们的方法在现有技术水平上具有优势,其潜在用途是在法医应用程序中显著增强潜在指纹识别准确性。
https://arxiv.org/abs/2409.11802
Speech enhancement aims to improve speech quality and intelligibility in noisy environments. Recent advancements have concentrated on deep neural networks, particularly employing the Two-Stage (TS) architecture to enhance feature extraction. However, the complexity and size of these models remain significant, which limits their applicability in resource-constrained scenarios. Designing models suitable for edge devices presents its own set of challenges. Narrow lightweight models often encounter performance bottlenecks due to uneven loss landscapes. Additionally, advanced operators such as Transformers or Mamba may lack the practical adaptability and efficiency that convolutional neural networks (CNNs) offer in real-world deployments. To address these challenges, we propose Dense-TSNet, an innovative ultra-lightweight speech enhancement network. Our approach employs a novel Dense Two-Stage (Dense-TS) architecture, which, compared to the classic Two-Stage architecture, ensures more robust refinement of the objective function in the later training stages. This leads to improved final performance, addressing the early convergence limitations of the baseline model. We also introduce the Multi-View Gaze Block (MVGB), which enhances feature extraction by incorporating global, channel, and local perspectives through convolutional neural networks (CNNs). Furthermore, we discuss how the choice of loss function impacts perceptual quality. Dense-TSNet demonstrates promising performance with a compact model size of around 14K parameters, making it particularly well-suited for deployment in resource-constrained environments.
为了提高在嘈杂环境中的语音质量和可听度,我们提出了Dense-TSNet,一种创新的超轻量级语音增强网络。与经典Two-Stage架构相比,我们采用了名为Dense Two-Stage(Dense-TS)的新架构,在训练后阶段确保了目标函数更 robust 的细化。这导致最终性能得到改善,解决了基线模型的早期收敛限制。我们还引入了多视角凝视块(MVGB),通过整合卷积神经网络(CNN),增强了特征提取。此外,我们讨论了损失函数的选择如何影响感知质量。Dense-TSNet在具有约14K个参数的紧凑模型大小时,表现出有前景的性能,特别适用于资源受限的环境部署。
https://arxiv.org/abs/2409.11725
Hybrid recommender systems, combining item IDs and textual descriptions, offer potential for improved accuracy. However, previous work has largely focused on smaller datasets and model architectures. This paper introduces Flare (Fusing Language models and collaborative Architectures for Recommender Enhancement), a novel hybrid recommender that integrates a language model (mT5) with a collaborative filtering model (Bert4Rec) using a Perceiver network. This architecture allows Flare to effectively combine collaborative and content information for enhanced recommendations. We conduct a two-stage evaluation, first assessing Flare's performance against established baselines on smaller datasets, where it demonstrates competitive accuracy. Subsequently, we evaluate Flare on a larger, more realistic dataset with a significantly larger item vocabulary, introducing new baselines for this setting. Finally, we showcase Flare's inherent ability to support critiquing, enabling users to provide feedback and refine recommendations. We further leverage critiquing as an evaluation method to assess the model's language understanding and its transferability to the recommendation task.
混合推荐系统结合物品ID和文本描述,具有提高准确性的潜力。然而,之前的工作主要集中在较小数据集和模型架构上。本文介绍了一种新颖的混合推荐系统——Flare(Fusing Language models and collaborative Architectures for Recommender Enhancement),它将语言模型(mT5)与协同过滤模型(Bert4Rec)集成到一个Perceiver网络中。这种架构使Flare能够有效结合协同和内容信息进行增强推荐。我们进行了一轮两阶段的评估,第一阶段评估Flare在较小数据集上的表现,它证明了Flare具有竞争力的准确度。随后,我们在一个更大的、更真实的数据集上对Flare进行了评估,数据集具有显著更大的项目词汇表,为该场景引入了新的基准。最后,我们展示了Flare固有的支持批评的能力,使用户能够提供反馈并进行推荐优化。我们还利用批评作为一种评估方法来评估模型的语言理解和其对推荐任务的转移能力。
https://arxiv.org/abs/2409.11699
The integration of tools in LLM-based agents overcame the difficulties of standalone LLMs and traditional agents' limited capabilities. However, the conjunction of these technologies and the proposed enhancements in several state-of-the-art works followed a non-unified software architecture resulting in a lack of modularity. Indeed, they focused mainly on functionalities and overlooked the definition of the component's boundaries within the agent. This caused terminological and architectural ambiguities between researchers which we addressed in this paper by proposing a unified framework that establishes a clear foundation for LLM-based agents' development from both functional and software architectural perspectives. Our framework, LLM-Agent-UMF (LLM-based Agent Unified Modeling Framework), clearly distinguishes between the different components of an agent, setting LLMs, and tools apart from a newly introduced element: the core-agent, playing the role of the central coordinator of the agent which comprises five modules: planning, memory, profile, action, and security, the latter often neglected in previous works. Differences in the internal structure of core-agents led us to classify them into a taxonomy of passive and active types. Based on this, we proposed different multi-core agent architectures combining unique characteristics of various individual agents. For evaluation purposes, we applied this framework to a selection of state-of-the-art agents, thereby demonstrating its alignment with their functionalities and clarifying the overlooked architectural aspects. Moreover, we thoroughly assessed four of our proposed architectures by integrating distinctive agents into hybrid active/passive core-agents' systems. This analysis provided clear insights into potential improvements and highlighted the challenges involved in the combination of specific agents.
基于LLM的智能体代理的集成克服了单独的LLM和传统代理的局限性。然而,这些技术和几个最先进的论文中提出的改进措施的结合形成了一个非统一软件架构,导致缺乏可扩展性。事实上,它们主要关注功能,而忽视了代理中组件边界的定义。这导致研究人员在术语和架构上存在分歧,我们通过提出一个统一框架,从功能和软件架构的角度为LLM基于代理的开发建立明确的基础来解决这个问题。我们的框架,LLM-Agent-UMF(LLM基于代理统一建模框架),清楚地分清了代理的不同组件,将LLM和工具与新引入的元素区分开来:核心代理,扮演着代理中心协调者的角色,包括规划、内存、轮廓、动作和安全五个模块,后者的前几个版本经常被忽视。由于核心代理内部结构的不同,我们将它们分类为被动和主动类型。基于这一点,我们提出了不同的多核心代理架构,结合了各种单个代理的独特的特点。为了评估目的,我们将这个框架应用于一系列最先进的代理,从而证明了其与代理的功能性相一致,并阐明了结合具体代理所涉及的挑战。此外,我们深入评估了我们提出的四个架构,将独特的代理集成到混合主动/被动核心代理系统的混合架构中。这个分析为潜在的改进提供了清晰的见解,并强调了结合具体代理所涉及的挑战。
https://arxiv.org/abs/2409.11393
Ultrasound imaging, despite its widespread use in medicine, often suffers from various sources of noise and artifacts that impact the signal-to-noise ratio and overall image quality. Enhancing ultrasound images requires a delicate balance between contrast, resolution, and speckle preservation. This paper introduces a novel approach that integrates adaptive beamforming with denoising diffusion-based variance imaging to address this challenge. By applying Eigenspace-Based Minimum Variance (EBMV) beamforming and employing a denoising diffusion model fine-tuned on ultrasound data, our method computes the variance across multiple diffusion-denoised samples to produce high-quality despeckled images. This approach leverages both the inherent multiplicative noise of ultrasound and the stochastic nature of diffusion models. Experimental results on a publicly available dataset demonstrate the effectiveness of our method in achieving superior image reconstructions from single plane-wave acquisitions. The code is available at: this https URL.
超声成像,尽管在医学领域得到了广泛应用,但经常受到各种噪声和伪像的影响,从而影响信号与噪声比和整体图像质量。提高超声图像质量需要在一个对contrast(对比度)、resolution(分辨率)和speckle preservation(伪像保持)之间的微调取得平衡。本文介绍了一种将自适应波形形成与基于扩散的伪像成像相结合的新方法,以解决这一挑战。通过应用基于Eigenspace的最小方差(EBMV)波形形成和利用在超声数据上微调的denoising diffusion模型,我们的方法计算了多个扩散去噪样本之间的方差,从而产生了高质量的去噪图像。这种方法利用了超声的固有乘法噪声和扩散模型的随机性质。公开可用数据集上的实验结果表明,我们的方法能够从单光子测距中实现卓越的图像重构。代码可在此处下载:https://this URL。
https://arxiv.org/abs/2409.11380
Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
视频扩散模型已经在生成高质量视频方面显示出巨大的潜力,因此越来越受到关注。然而,其固有的迭代特性导致大量计算和时间成本。虽然已经通过减少推理步骤(通过诸如一致性蒸馏等技术)尝试加速视频扩散,但这些方法在性能或训练稳定性方面往往存在缺陷。在这项工作中,我们引入了一个两阶段训练框架,将一致性蒸馏与GAN训练相结合,以解决这些挑战。此外,我们提出了一种新颖的视频判别器设计,消除了对视频 latent 的解码,从而提高了最终性能。我们的模型可以在仅一步之内产生高质量的视频,具有进行多步精炼的灵活性,以进一步提高性能。我们在OpenWebVid-1M基准上的定量评估显示,我们的模型显著优于现有方法。值得注意的是,我们的1步性能(FVD 171.15)超过了基于一致性蒸馏的方法AnimateLCM(FVD 184.79)的8步性能,并接近于高级稳定视频扩散(FVD 156.94)的25步性能。
https://arxiv.org/abs/2409.11367
Weakly supervised video anomaly detection (WS-VAD) is a crucial area in computer vision for developing intelligent surveillance systems. This system uses three feature streams: RGB video, optical flow, and audio signals, where each stream extracts complementary spatial and temporal features using an enhanced attention module to improve detection accuracy and robustness. In the first stream, we employed an attention-based, multi-stage feature enhancement approach to improve spatial and temporal features from the RGB video where the first stage consists of a ViT-based CLIP module, with top-k features concatenated in parallel with I3D and Temporal Contextual Aggregation (TCA) based rich spatiotemporal features. The second stage effectively captures temporal dependencies using the Uncertainty-Regulated Dual Memory Units (UR-DMU) model, which learns representations of normal and abnormal data simultaneously, and the third stage is employed to select the most relevant spatiotemporal features. The second stream extracted enhanced attention-based spatiotemporal features from the flow data modality-based feature by taking advantage of the integration of the deep learning and attention module. The audio stream captures auditory cues using an attention module integrated with the VGGish model, aiming to detect anomalies based on sound patterns. These streams enrich the model by incorporating motion and audio signals often indicative of abnormal events undetectable through visual analysis alone. The concatenation of the multimodal fusion leverages the strengths of each modality, resulting in a comprehensive feature set that significantly improves anomaly detection accuracy and robustness across three datasets. The extensive experiment and high performance with the three benchmark datasets proved the effectiveness of the proposed system over the existing state-of-the-art system.
弱监督视频异常检测(WS-VAD)是计算机视觉中一个关键领域,用于开发智能监视系统。该系统使用三个特征流:RGB视频、光学流和音频信号,其中每个流通过增强注意模块提取互补的空间和时间特征,以提高检测精度和鲁棒性。 在第一流中,我们采用基于注意的级联特征增强方法来改善从RGB视频中的空间和时间特征,其中第一阶段包括基于ViT的CLIP模块, top-k特征与I3D和Temporal Contextual Aggregation (TCA)的丰富空间和时间特征并行。第二阶段有效地利用了的不确定性调节双内存单元(UR-DMU)模型来捕捉时间依赖关系。第三阶段用于选择最具相关性的空间和时间特征。第二流从流数据模式下的特征中提取增强注意的时空特征,通过利用深度学习和注意模块的集成来增强。音频流使用与VGGish模型集成的注意模块来捕捉音频线索,旨在基于声音模式检测异常。这些流通过整合运动和音频信号,往往无法通过视觉分析检测到的异常事件来丰富模型。多模态融合的串联利用了每个模态的优势,导致一个全面的特征集,显著提高了跨三个数据集的异常检测精度和鲁棒性。大量实验和三个基准数据集的高性能证明,与现有技术水平相比,所提出的系统具有显著的优越性。
https://arxiv.org/abs/2409.11223
Digitizing 3D static scenes and 4D dynamic events from multi-view images has long been a challenge in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a practical and scalable reconstruction method, gaining popularity due to its impressive reconstruction quality, real-time rendering capabilities, and compatibility with widely used visualization tools. However, the method requires a substantial number of input views to achieve high-quality scene reconstruction, introducing a significant practical bottleneck. This challenge is especially severe in capturing dynamic scenes, where deploying an extensive camera array can be prohibitively costly. In this work, we identify the lack of spatial autocorrelation of splat features as one of the factors contributing to the suboptimal performance of the 3DGS technique in sparse reconstruction settings. To address the issue, we propose an optimization strategy that effectively regularizes splat features by modeling them as the outputs of a corresponding implicit neural field. This results in a consistent enhancement of reconstruction quality across various scenarios. Our approach effectively handles static and dynamic cases, as demonstrated by extensive testing across different setups and scene complexities.
从多视角图像中数字化3D静态场景和4D动态事件是一个计算机视觉和图形学中的长期挑战。近年来,3D高斯展平(3DGS)作为一种实际且可扩展的重建方法而受到欢迎,其出色的重建质量、实时渲染能力和与广泛使用的可视化工具的兼容性使其备受瞩目。然而,该方法需要大量的输入视角才能实现高质量的场景重建,导致实际瓶颈。在捕捉动态场景时,部署广泛的电影相机阵列可能过于昂贵。在本文中,我们将缺乏空间自相关性作为导致3DGS技术在稀疏重建设置中性能低下的因素之一。为了应对这个问题,我们提出了一个优化策略,通过将展平特征建模为相应隐式神经场的输出来有效地规范展平特征。这导致各种场景的重建质量得到一致提升。我们的方法有效地处理了静态和动态情况,这通过在不同设置和场景复杂性上进行广泛的测试得到了充分证明。
https://arxiv.org/abs/2409.11211
Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.
传统的语音增强方法通常过于简化恢复任务,仅关注一种类型的失真。处理多种失真的生成模型通常在语音重建和高频谐波处理方面遇到困难,导致呼吸和喘息伪像,使得重构语音的可听性降低。这些模型还具有较高的计算开销,许多解决方案仅限于在宽带频段内生成输出,这限制了它们在专业应用中的适用性。为了应对这些挑战,我们提出了Hi-ResLDM,一种基于潜在扩散的新颖生成模型,旨在消除多种失真并恢复语音录音到工作室质量,以48kHz的采样率采样。我们基准Hi-ResLDM against最先进的利用GAN和条件流匹配(CFM)组件的方法,证明了其在再生高频带细节方面的卓越性能。Hi-ResLDM不仅在非侵入性指标上表现优异,而且在人类评估中始终被偏好,并且在侵入性评估中具有竞争力,使其成为高分辨率语音修复的理想选择。
https://arxiv.org/abs/2409.11145
Retinal fundus photography is significant in diagnosing and monitoring retinal diseases. However, systemic imperfections and operator/patient-related factors can hinder the acquisition of high-quality retinal images. Previous efforts in retinal image enhancement primarily relied on GANs, which are limited by the trade-off between training stability and output diversity. In contrast, the Schrödinger Bridge (SB), offers a more stable solution by utilizing Optimal Transport (OT) theory to model a stochastic differential equation (SDE) between two arbitrary distributions. This allows SB to effectively transform low-quality retinal images into their high-quality counterparts. In this work, we leverage the SB framework to propose an image-to-image translation pipeline for retinal image enhancement. Additionally, previous methods often fail to capture fine structural details, such as blood vessels. To address this, we enhance our pipeline by introducing Dynamic Snake Convolution, whose tortuous receptive field can better preserve tubular structures. We name the resulting retinal fundus image enhancement framework the Context-aware Unpaired Neural Schrödinger Bridge (CUNSB-RFIE). To the best of our knowledge, this is the first endeavor to use the SB approach for retinal image enhancement. Experimental results on a large-scale dataset demonstrate the advantage of the proposed method compared to several state-of-the-art supervised and unsupervised methods in terms of image quality and performance on downstream tasks.The code is available at \url{this https URL}.
视网膜 fundus 摄影在诊断和监测视网膜疾病方面具有重要意义。然而,全身不完善和操作者/患者相关因素可能阻碍高质量视网膜图像的获取。先前对视网膜图像增强的努力主要依赖于 GAN,它们的训练稳定性与输出多样性之间存在权衡。相比之下,Schrödinger Bridge(SB)通过利用最优传输(OT)理论来建模两个任意分布之间的随机微分方程(SDE),提供了一个更稳定的解决方案。这使得SB能够有效地将低质量的视网膜图像转换为高质量的同类。 在这项工作中,我们利用SB框架提出了一个图像到图像的视网膜图像增强管道。此外,以前的方法通常无法捕捉到细结构细节,如血管。为了解决这个问题,我们通过引入动态蛇卷积来增强我们的管道,该卷积的曲折的接收场可以更好地保留管状结构。我们将这种增强后的视网膜 fundus 图像命名为“上下文感知无配对神经 Schrödinger Bridge”(CUNSB-RFIE)。据我们所知,这是第一个使用SB方法进行视网膜图像增强的尝试。在一大型数据集上的实验结果表明,与几种最先进的监督和无监督方法相比,所提出方法在图像质量和下游任务上的性能具有优势。 代码可在此处访问:\url{this <https:// URL>}。
https://arxiv.org/abs/2409.10966
Generative speech enhancement has recently shown promising advancements in improving speech quality in noisy environments. Multiple diffusion-based frameworks exist, each employing distinct training objectives and learning techniques. This paper aims at explaining the differences between these frameworks by focusing our investigation on score-based generative models and Schrödinger bridge. We conduct a series of comprehensive experiments to compare their performance and highlight differing training behaviors. Furthermore, we propose a novel perceptual loss function tailored for the Schrödinger bridge framework, demonstrating enhanced performance and improved perceptual quality of the enhanced speech signals. All experimental code and pre-trained models are publicly available to facilitate further research and development in this.
生成式语音增强在最近的研究中已经表现出通过改善嘈杂环境中的语音质量取得了进展。存在多种扩散基框架,每种框架都有独特的训练目标和学习技巧。本文旨在通过专注于基于得分的生成模型和Schrödinger桥来解释这些框架之间的差异。我们进行了一系列全面的实验来比较它们的表现并突出了不同的训练行为。此外,我们提出了一个新的专为Schrödinger桥框架设计的感知损失函数,展示了增强语音信号的性能和提高了感知质量。所有实验代码和预训练模型都可以公开获取,以促进进一步的研究和发展。
https://arxiv.org/abs/2409.10753
Recent advancements in automatic code generation using large language models (LLMs) have brought us closer to fully automated secure software development. However, existing approaches often rely on a single agent for code generation, which struggles to produce secure, vulnerability-free code. Traditional program synthesis with LLMs has primarily focused on functional correctness, often neglecting critical dynamic security implications that happen during runtime. To address these challenges, we propose AutoSafeCoder, a multi-agent framework that leverages LLM-driven agents for code generation, vulnerability analysis, and security enhancement through continuous collaboration. The framework consists of three agents: a Coding Agent responsible for code generation, a Static Analyzer Agent identifying vulnerabilities, and a Fuzzing Agent performing dynamic testing using a mutation-based fuzzing approach to detect runtime errors. Our contribution focuses on ensuring the safety of multi-agent code generation by integrating dynamic and static testing in an iterative process during code generation by LLM that improves security. Experiments using the SecurityEval dataset demonstrate a 13% reduction in code vulnerabilities compared to baseline LLMs, with no compromise in functionality.
近年来,使用大型语言模型(LLMs)进行自动代码生成的发展使我们将更接近完全自动化的安全软件开发。然而,现有的方法通常依赖于单个代理生成代码,这往往难以产生安全和无漏洞的代码。传统的程序合成方法主要关注功能性正确性,往往忽视了在运行时发生的重要动态安全影响。为了应对这些挑战,我们提出了AutoSafeCoder框架,这是一个多代理框架,利用LLM驱动的代理进行代码生成、漏洞分析和安全性增强,并通过持续协作实现。框架包括三个代理:一个负责生成代码的编码代理、一个用于识别漏洞的静态分析代理和一个使用基于突变的模糊方法进行动态测试的模糊代理。我们的贡献在于通过在LLM生成代码的过程中集成动态和静态测试,从而提高安全性,确保多代理代码生成的安全性。使用安全评估数据集的安全实验表明,与基线LLM相比,代码漏洞减少了13%,且功能没有受到影响。
https://arxiv.org/abs/2409.10737
For more efficient generalization to unseen domains (classes), most Few-shot Segmentation (FSS) would directly exploit pre-trained encoders and only fine-tune the decoder, especially in the current era of large models. However, such fixed feature encoders tend to be class-agnostic, inevitably activating objects that are irrelevant to the target class. In contrast, humans can effortlessly focus on specific objects in the line of sight. This paper mimics the visual perception pattern of human beings and proposes a novel and powerful prompt-driven scheme, called ``Prompt and Transfer" (PAT), which constructs a dynamic class-aware prompting paradigm to tune the encoder for focusing on the interested object (target class) in the current task. Three key points are elaborated to enhance the prompting: 1) Cross-modal linguistic information is introduced to initialize prompts for each task. 2) Semantic Prompt Transfer (SPT) that precisely transfers the class-specific semantics within the images to prompts. 3) Part Mask Generator (PMG) that works in conjunction with SPT to adaptively generate different but complementary part prompts for different individuals. Surprisingly, PAT achieves competitive performance on 4 different tasks including standard FSS, Cross-domain FSS (e.g., CV, medical, and remote sensing domains), Weak-label FSS, and Zero-shot Segmentation, setting new state-of-the-arts on 11 benchmarks.
为了更有效地泛化到未见过的领域(类别),大多数Few-shot Segmentation(FSS)方法会直接利用预训练的编码器,并仅在当前大模型时代微调解码器,特别是在这种情况下。然而,这样的预训练特征编码器往往不关注目标类别,从而激活与目标类别无关的对象。相比之下,人类可以轻松地关注当前视野中的特定对象。本文模仿了人类的视觉感知模式,并提出了一个新颖且强大的提示驱动方案,称为“提示和传递”(PAT),以构建动态分类感知提示范式,将编码器聚焦于当前任务中感兴趣的对象(目标类别)。 以下是三个关键点以提高提示: 1)跨模态语言信息被引入以初始化每个任务的提示。 2)语义提示转移(SPT)将图像中的类特定语义准确地传递到提示中。 3)部分掩码生成器(PMG)与SPT协同工作,为不同个体生成不同但互补的部分提示。 令人惊讶的是,PAT在包括标准FSS、跨领域FSS(如CV、医疗和遥感领域)、弱标签FSS和零散分割在内的四个任务上实现了竞争性的性能,将新状态置于11个基准上。
https://arxiv.org/abs/2409.10389
In multichannel speech enhancement, effectively capturing spatial and spectral information across different microphones is crucial for noise reduction. Traditional methods, such as CNN or LSTM, attempt to model the temporal dynamics of full-band and sub-band spectral and spatial features. However, these approaches face limitations in fully modeling complex temporal dependencies, especially in dynamic acoustic environments. To overcome these challenges, we modify the current advanced model McNet by introducing an improved version of Mamba, a state-space model, and further propose MCMamba. MCMamba has been completely reengineered to integrate full-band and narrow-band spatial information with sub-band and full-band spectral features, providing a more comprehensive approach to modeling spatial and spectral information. Our experimental results demonstrate that MCMamba significantly improves the modeling of spatial and spectral features in multichannel speech enhancement, outperforming McNet and achieving state-of-the-art performance on the CHiME-3 dataset. Additionally, we find that Mamba performs exceptionally well in modeling spectral information.
在多通道语音增强中,有效地捕捉不同麦克风之间的空间和频谱信息对于降噪至关重要。传统的如CNN或LSTM方法试图建模完整的带和子带频谱和空间特征的时变动态。然而,这些方法在完全建模复杂时变依赖方面存在局限性,尤其是在动态声学环境中。为了克服这些挑战,我们通过引入改进版本的Mamba(一种状态空间模型)对当前的高级模型McNet进行修改,并进一步提出MCMamba。MCMamba已经完全重新设计,以将完整的带和狭窄带空间信息与子带和完整带频谱特征相结合,提供了一种更全面的建模空间和频谱信息的途径。我们的实验结果表明,MCMamba显著提高了多通道语音增强中空间和频谱特征的建模,超越了McNet,并在CHiME-3数据集上实现了最先进的性能。此外,我们发现,Mamba在建模频谱信息方面表现出色。
https://arxiv.org/abs/2409.10376
Speech enhancement models should meet very low latency requirements typically smaller than 5 ms for hearing assistive devices. While various low-latency techniques have been proposed, comparing these methods in a controlled setup using DNNs remains blank. Previous papers have variations in task, training data, scripts, and evaluation settings, which make fair comparison impossible. Moreover, all methods are tested on small, simulated datasets, making it difficult to fairly assess their performance in real-world conditions, which could impact the reliability of scientific findings. To address these issues, we comprehensively investigate various low-latency techniques using consistent training on large-scale data and evaluate with more relevant metrics on real-world data. Specifically, we explore the effectiveness of asymmetric windows, learnable windows, adaptive time domain filterbanks, and the future-frame prediction technique. Additionally, we examine whether increasing the model size can compensate for the reduced window size, as well as the novel Mamba architecture in low-latency environments.
演讲增强模型通常需要满足非常低的延迟要求,通常小于5毫秒,以便于助听器等听觉辅助设备。虽然已经提出了许多低延迟技术,但是使用DNNs在受控设置中比较这些方法仍然是一致的空白。之前 papers 的任务、训练数据、脚本和评估设置存在差异,这使得公平比较成为不可能。此外,所有方法都在小、模拟数据集上进行测试,因此在现实世界的条件下公平评估它们的性能是非常困难的,这可能影响科学研究 的可靠性。为了解决这些问题,我们全面研究了各种低延迟技术,并在大规模数据上进行一致的训练,并通过与现实世界的数据使用更相关的指标进行评估。具体来说,我们探讨了非对称窗口、可学习窗口、自适应时间域滤波器以及未来帧预测技术。此外,我们检查是否增加模型大小可以弥补窗口大小的减少,以及低延迟环境中的新Mamba架构。
https://arxiv.org/abs/2409.10358
We propose an end-to-end attribute compression method for dense point clouds. The proposed method combines a frequency sampling module, an adaptive scale feature extraction module with geometry assistance, and a global hyperprior entropy model. The frequency sampling module uses a Hamming window and the Fast Fourier Transform to extract high-frequency components of the point cloud. The difference between the original point cloud and the sampled point cloud is divided into multiple sub-point clouds. These sub-point clouds are then partitioned using an octree, providing a structured input for feature extraction. The feature extraction module integrates adaptive convolutional layers and uses offset-attention to capture both local and global features. Then, a geometry-assisted attribute feature refinement module is used to refine the extracted attribute features. Finally, a global hyperprior model is introduced for entropy encoding. This model propagates hyperprior parameters from the deepest (base) layer to the other layers, further enhancing the encoding efficiency. At the decoder, a mirrored network is used to progressively restore features and reconstruct the color attribute through transposed convolutional layers. The proposed method encodes base layer information at a low bitrate and progressively adds enhancement layer information to improve reconstruction accuracy. Compared to the latest G-PCC test model (TMC13v23) under the MPEG common test conditions (CTCs), the proposed method achieved an average Bjontegaard delta bitrate reduction of 24.58% for the Y component (21.23% for YUV combined) on the MPEG Category Solid dataset and 22.48% for the Y component (17.19% for YUV combined) on the MPEG Category Dense dataset. This is the first instance of a learning-based codec outperforming the G-PCC standard on these datasets under the MPEG CTCs.
我们提出了一个端到端密集点云属性压缩方法。所提出的方法结合了频率采样模块、具有几何辅助的适应尺度特征提取模块和全局超prior熵模型。频率采样模块使用汉明窗口和快速傅里叶变换来提取点云的高频成分。原始点云与采样点云之间的差异被分为多个子点云。这些子点云 then 被八叉树进行分割,为特征提取提供结构化的输入。特征提取模块集成了自适应卷积层,并使用偏置注意来捕捉本地和全局特征。然后,用于属性特征优化的几何辅助模块被引入。最后,引入了一个全局超prior模型进行熵编码。该模型在低比特率下传播超prior参数,进一步提高编码效率。在解码时,使用镜像网络通过转置卷积层逐步恢复特征并重构颜色属性。与在MPEG common test条件下(CTCs)最新的G-PCC测试模型(TMC13v23)相比,在MPEG Category Solid数据集上,所提出的方法实现了Y组件(21.23% YUV联合)的平均Bjontegaard delta比特率减少24.58%,在MPEG Category Dense数据集上,该方法实现了Y组件(17.19% YUV联合)的平均Bjontegaard delta比特率减少22.48%。这是在MPEG CTC下学习Based 编码器首次在这些数据集上超越G-PCC标准的例子。
https://arxiv.org/abs/2409.10293
In the era of big data, ensuring the quality of datasets has become increasingly crucial across various domains. We propose a comprehensive framework designed to automatically assess and rectify data quality issues in any given dataset, regardless of its specific content, focusing on both textual and numerical data. Our primary objective is to address three fundamental types of defects: absence, redundancy, and incoherence. At the heart of our approach lies a rigorous demand for both explainability and interpretability, ensuring that the rationale behind the identification and correction of data anomalies is transparent and understandable. To achieve this, we adopt a hybrid approach that integrates statistical methods with machine learning algorithms. Indeed, by leveraging statistical techniques alongside machine learning, we strike a balance between accuracy and explainability, enabling users to trust and comprehend the assessment process. Acknowledging the challenges associated with automating the data quality assessment process, particularly in terms of time efficiency and accuracy, we adopt a pragmatic strategy, employing resource-intensive algorithms only when necessary, while favoring simpler, more efficient solutions whenever possible. Through a practical analysis conducted on a publicly provided dataset, we illustrate the challenges that arise when trying to enhance data quality while keeping explainability. We demonstrate the effectiveness of our approach in detecting and rectifying missing values, duplicates and typographical errors as well as the challenges remaining to be addressed to achieve similar accuracy on statistical outliers and logic errors under the constraints set in our work.
在大数据时代,确保数据质量变得越来越重要,涉及各个领域。我们提出了一个全面框架,旨在自动评估和修复任何给定数据集中的数据质量问题,无论其具体内容如何,重点关注文本数据和数值数据。我们的主要目标是解决三种基本缺陷:缺失、冗余和不相容。我们方法的核心是一个严谨的需求,即解释性和可解释性,确保识别和修复数据异常背后的理由是透明和可理解的。为了实现这一目标,我们采用了一种结合统计方法和机器学习算法的混合方法。事实上,通过利用统计技术并与机器学习相结合,我们平衡了准确性和可解释性,使用户能够信任和理解评估过程。 承认自动化数据质量评估过程所带来的挑战,尤其是在时间和准确性方面,我们采用了一种实用策略。只有在必要时才使用资源密集型算法,而尽可能地选择简单、高效的数据质量改进方案。通过对公开提供数据集进行实际分析,我们展示了在提高数据质量的同时保持可解释性的挑战。我们阐明了我们方法在检测和修复缺失值、重复和错别字以及统计异常和逻辑错误方面的有效性,以及为实现我们在工作中设定的类似准确度还有哪些挑战需要解决。
https://arxiv.org/abs/2409.10139
Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces Fast Mamba for Vision (Famba-V), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.
Mamba 和 Vision Mamba (Vim) 模型已经证明作为基于 Transformer 架构的方法的替代品具有潜力。本文介绍了 Fast Mamba for Vision (Famba-V),一种跨层标记融合技术,以提高 Vim 模型的训练效率。Famba-V 的关键思想是基于跨层策略识别和融合不同 Vim 层中的类似标记,而不是简单地应用标记融合均匀地覆盖所有现有工作提出的层。我们在 CIFAR-100 上评估了 Famba-V 的性能。我们的结果表明,Famba-V 能够通过减少训练时间和训练过程中的峰值内存使用量来提高 Vim 模型的训练效率。此外,所提出的跨层策略允许 Famba-V 实现卓越的准确率与效率权衡。这些结果共同表明,Famba-V 是一种有前景的 Vim 模型效率增强技术。
https://arxiv.org/abs/2409.09808
With the rapid development of marine engineering projects such as marine resource extraction and oceanic surveys, underwater visual imaging and analysis has become a critical technology. Unfortunately, due to the inevitable non-linear attenuation of light in underwater environments, underwater images and videos often suffer from low contrast, blurriness, and color degradation, which significantly complicate the subsequent research. Existing underwater image enhancement methods often treat the haze and color cast as a unified degradation process and disregard their independence and interdependence, which limits the performance improvement. Here, we propose a Vision Transformer (ViT)-based network (referred to as WaterFormer) to improve the underwater image quality. WaterFormer contains three major components: a dehazing block (DehazeFormer Block) to capture the self-correlated haze features and extract deep-level features, a Color Restoration Block (CRB) to capture self-correlated color cast features, and a Channel Fusion Block (CFB) to capture fusion features within the network. To ensure authenticity, a soft reconstruction layer based on the underwater imaging physics model is included. To improve the quality of the enhanced images, we introduce the Chromatic Consistency Loss and Sobel Color Loss to train the network. Comprehensive experimental results demonstrate that WaterFormer outperforms other state-of-the-art methods in enhancing underwater images.
随着海洋工程项目(如海洋资源开采和海洋调查)的快速发展和水下视觉成像分析,水下视觉成像和分析已成为关键技术。然而,由于水下环境中光的不可避免非线性衰减,水下图像和视频通常会出现低对比度、模糊和色彩退化,这使得后续研究变得复杂。现有的水下图像增强方法通常将雾和色彩衰减视为一个统一的过程,而忽略了它们之间的独立性和相互依赖关系,从而限制了性能的提高。本文提出了一种基于Vision Transformer(ViT)的网络(称为WaterFormer)来提高水下图像质量。WaterFormer包含三个主要组件:去雾模块(DehazeFormer Block)来捕捉自相关雾特征并提取深层次特征,色彩还原模块(CRB)来捕捉自相关色彩衰减特征,以及通道融合模块(CFB)来捕捉网络内的融合特征。为了确保真实感,水下成像物理模型为基础的软重建层被纳入。为了提高增强图像的质量和对比度,我们引入了色差平衡损失和Sobel色彩损失来训练网络。综合实验结果表明,WaterFormer在其他最先进的方法中表现优异。
https://arxiv.org/abs/2409.09779
Diffusion-based generative models have recently achieved remarkable results in speech and vocal enhancement due to their ability to model complex speech data distributions. While these models generalize well to unseen acoustic environments, they may not achieve the same level of fidelity as the discriminative models specifically trained to enhance particular acoustic conditions. In this paper, we propose Ex-Diff, a novel score-based diffusion model that integrates the latent representations produced by a discriminative model to improve speech and vocal enhancement, which combines the strengths of both generative and discriminative models. Experimental results on the widely used MUSDB dataset show relative improvements of 3.7% in SI-SDR and 10.0% in SI-SIR compared to the baseline diffusion model for speech and vocal enhancement tasks, respectively. Additionally, case studies are provided to further illustrate and analyze the complementary nature of generative and discriminative models in this context.
基于扩散的生成模型最近在语音和语音增强任务中取得了显著的成果,这是由于它们能够建模复杂的语音数据分布。虽然这些模型对未知音频环境的泛化效果很好,但它们可能无法达到特别针对某些音频条件进行训练的 discriminative 模型的同样的保真水平。在本文中,我们提出了 Ex-Diff,一种新颖的基于分数的扩散模型,它将判别模型产生的潜在表示集成到语音和语音增强任务中,结合了生成和判别模型的优势。在广泛使用的MUSDB数据集上进行的实验结果表明,与基线扩散模型相比,SI-SDR 和 SI-SIR 分别实现了3.7%和10.0%的相对改善。此外,案例研究进一步说明了生成和判别模型在该项目中的互补性质。
https://arxiv.org/abs/2409.09642