This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
这篇论文报告了NTIRE 2024人工智能生成内容挑战赛,该挑战赛将与CVPR 2024中的图像修复和增强研讨会(NTIRE)同时举办。这项挑战的目标是解决图像和视频处理领域的一个重大挑战,即人工智能生成内容(AIGC)的图像质量和视频质量评估(VQA)。挑战分为图像赛道和视频赛道。图像赛道使用了AIGIQA-20K,它包含了由15个流行生成模型生成的20,000个AI生成图像(AIGIs)。图像赛道共有318名注册参与者。在开发阶段共收到1,646篇提交,测试阶段收到了221篇提交。最后,16支参赛队伍提交了他们的模型和报告。视频赛道使用了T2VQA-DB,它包含了由9个流行文本转视频(T2V)模型生成的10,000个AI生成视频(AIGVs)。共有196名参与者登记注册在视频赛道上。在开发阶段共收到991篇提交,测试阶段收到了185篇提交。最后,12支参赛队伍提交了他们的模型和报告。有些方法取得了比基线方法更好的效果,两赛道获胜的方法在AIGC上表现出卓越的预测性能。
https://arxiv.org/abs/2404.16687
This paper presents the multi-speaker multi-lingual few-shot voice cloning system developed by THU-HCSI team for LIMMITS'24 Challenge. To achieve high speaker similarity and naturalness in both mono-lingual and cross-lingual scenarios, we build the system upon YourTTS and add several enhancements. For further improving speaker similarity and speech quality, we introduce speaker-aware text encoder and flow-based decoder with Transformer blocks. In addition, we denoise the few-shot data, mix up them with pre-training data, and adopt a speaker-balanced sampling strategy to guarantee effective fine-tuning for target speakers. The official evaluations in track 1 show that our system achieves the best speaker similarity MOS of 4.25 and obtains considerable naturalness MOS of 3.97.
本文介绍了由THU-HCSI团队为LIMMITS'24挑战开发的的多语种、多声道语音克隆系统。为了在单语种和跨语种场景下实现高说话者相似度和自然度,我们在YourTTS基础上进行了系统构建,并添加了几个增强功能。为了进一步提高说话者相似度和语音质量,我们引入了说话者感知的文本编码器和基于Transformer的流式解码器。此外,我们还对几 shot数据进行了去噪、混合处理,并采用了一种针对说话者的平衡采样策略,以确保对目标说话者的有效微调。在1号轨道的官方评估中,我们的系统实现了4.25的说话者相似度MOS和显著的自然度MOS。
https://arxiv.org/abs/2404.16619
Managing the semantic quality of the categorization in large textual datasets, such as Wikipedia, presents significant challenges in terms of complexity and cost. In this paper, we propose leveraging transformer models to distill semantic information from texts in the Wikipedia dataset and its associated categories into a latent space. We then explore different approaches based on these encodings to assess and enhance the semantic identity of the categories. Our graphical approach is powered by Convex Hull, while we utilize Hierarchical Navigable Small Worlds (HNSWs) for the hierarchical approach. As a solution to the information loss caused by the dimensionality reduction, we modulate the following mathematical solution: an exponential decay function driven by the Euclidean distances between the high-dimensional encodings of the textual categories. This function represents a filter built around a contextual category and retrieves items with a certain Reconsideration Probability (RP). Retrieving high-RP items serves as a tool for database administrators to improve data groupings by providing recommendations and identifying outliers within a contextual framework.
管理大型文本数据集中的分类语义质量是一项具有复杂性和成本挑战性的任务。在本文中,我们提出利用Transformer模型从维基百科数据集中的文本和相关的类别中提取语义信息,并将其转换为潜在空间。然后,我们探讨了基于这些编码的不同方法,以评估和增强类别的语义身份。我们的图形方法基于Convex Hull,而我们在Hierarchical Navigable Small Worlds (HNSWs)中使用分层方法。作为一种解决由于维度降低引起的信息损失的方法,我们调节以下数学解:由Euclidean距离驱动的指数衰减函数。这个函数围绕一个上下文类别构建一个滤波器,并检索具有特定重新考虑概率(RP)的项。检索高RP项目是一种数据库管理员通过提供建议和改进数据分组的方法。通过在上下文框架内识别异常值,这种工具可以帮助管理员优化数据分组。
https://arxiv.org/abs/2404.16442
Unsupervised Domain Adaptation (UDA) refers to the method that utilizes annotated source domain data and unlabeled target domain data to train a model capable of generalizing to the target domain data. Domain discrepancy leads to a significant decrease in the performance of general network models trained on the source domain data when applied to the target domain. We introduce a straightforward approach to mitigate the domain discrepancy, which necessitates no additional parameter calculations and seamlessly integrates with self-training-based UDA methods. Through the transfer of the target domain style to the source domain in the latent feature space, the model is trained to prioritize the target domain style during the decision-making process. We tackle the problem at both the image-level and shallow feature map level by transferring the style information from the target domain to the source domain data. As a result, we obtain a model that exhibits superior performance on the target domain. Our method yields remarkable enhancements in the state-of-the-art performance for synthetic-to-real UDA tasks. For example, our proposed method attains a noteworthy UDA performance of 76.93 mIoU on the GTA->Cityscapes dataset, representing a notable improvement of +1.03 percentage points over the previous state-of-the-art results.
无监督领域适应(UDA)是指利用已标注的源域数据和未标注的目标域数据来训练一个能够泛化到目标域数据的模型。领域差异导致在将基于源域数据的通用网络模型应用于目标域数据时,模型的性能显著下降。我们引入了一种直接的方法来减轻领域差异,这不需要额外的参数计算,并无缝地与基于自训练的UDA方法集成。通过将目标域的风格信息传递到源域的潜在特征空间中,模型在决策过程中优先考虑目标域的风格。我们通过从目标域数据中传递样式信息来解决该问题。结果,我们在目标域上获得了卓越的性能。我们的方法在合成-真实UDA任务上的最先进性能有了显著提高。例如,与之前的结果相比,我们提出的UDA性能达到了+1.03%的显著提高。
https://arxiv.org/abs/2404.16301
Prompt leakage in large language models (LLMs) poses a significant security and privacy threat, particularly in retrieval-augmented generation (RAG) systems. However, leakage in multi-turn LLM interactions along with mitigation strategies has not been studied in a standardized manner. This paper investigates LLM vulnerabilities against prompt leakage across 4 diverse domains and 10 closed- and open-source LLMs. Our unique multi-turn threat model leverages the LLM's sycophancy effect and our analysis dissects task instruction and knowledge leakage in the LLM response. In a multi-turn setting, our threat model elevates the average attack success rate (ASR) to 86.2%, including a 99% leakage with GPT-4 and claude-1.3. We find that some black-box LLMs like Gemini show variable susceptibility to leakage across domains - they are more likely to leak contextual knowledge in the news domain compared to the medical domain. Our experiments measure specific effects of 6 black-box defense strategies, including a query-rewriter in the RAG scenario. Our proposed multi-tier combination of defenses still has an ASR of 5.3% for black-box LLMs, indicating room for enhancement and future direction for LLM security research.
大规模语言模型(LLMs)中的提示泄露对安全和隐私构成重大威胁,尤其是在检索增强生成(RAG)系统中。然而,在多轮LLM交互中,以及缓解策略,对提示泄露的研究还没有以标准化方式进行。本文研究了4个不同领域和10个开源LLM和闭源LLM对提示泄露的漏洞。我们独特的多轮威胁模型利用了LLM的协同效应,并分析了LLM响应中的任务指令和知识泄露。在多轮设置中,我们的威胁模型将平均攻击成功率(ASR)提高至86.2%,包括GPT-4和claude-1.3的99%泄漏。我们发现,一些黑盒LLM,如Gemini,在领域之间表现出不同的泄漏倾向 - 他们在新闻领域比医疗领域更容易泄露上下文知识。我们的实验测量了6个黑盒防御策略的具体效果,包括在RAG场景中的查询重写器。我们提出的多层防御组合对黑盒LLM的ASR为5.3%,表明还有提高的空间和未来LLM安全研究的发展方向。
https://arxiv.org/abs/2404.16251
The advent of personalized content generation by LLMs presents a novel challenge: how to efficiently adapt text to meet individual preferences without the unsustainable demand of creating a unique model for each user. This study introduces an innovative online method that employs neural bandit algorithms to dynamically optimize soft instruction embeddings based on user feedback, enhancing the personalization of open-ended text generation by white-box LLMs. Through rigorous experimentation on various tasks, we demonstrate significant performance improvements over baseline strategies. NeuralTS, in particular, leads to substantial enhancements in personalized news headline generation, achieving up to a 62.9% improvement in terms of best ROUGE scores and up to 2.76% increase in LLM-agent evaluation against the baseline.
个性化内容生成由LLM的问世带来了一个新的挑战:如何高效地将文本适应于满足个人偏好,而不会产生每个用户都要求创建独特模型的不可持续需求。本研究介绍了一种创新的方法,该方法采用神经随机游走算法动态优化基于用户反馈的软指令嵌入,从而增强LLM在开放性文本生成中的个性化。通过在各种任务上进行严谨的实验,我们证明了与基线策略相比,具有显著的性能提升。特别是,NeuralTS在个性化新闻标题生成方面取得了很大的提升,最佳ROUGE得分提高了62.9%,LLM代理评估基准测试中的评估值增加了2.76%。
https://arxiv.org/abs/2404.16115
Modular deep learning is the state-of-the-art solution for lifting the curse of multilinguality, preventing the impact of negative interference and enabling cross-lingual performance in Multilingual Pre-trained Language Models. However, a trade-off of this approach is the reduction in positive transfer learning from closely related languages. In response, we introduce a novel method called language arithmetic, which enables training-free post-processing to address this limitation. Inspired by the task arithmetic framework, we apply learning via addition to the language adapters, transitioning the framework from a multi-task to a multilingual setup. The effectiveness of the proposed solution is demonstrated on three downstream tasks in a MAD-X-based set of cross-lingual schemes, acting as a post-processing procedure. Language arithmetic consistently improves the baselines with significant gains in the most challenging cases of zero-shot and low-resource applications. Our code and models are available at this https URL .
模块化深度学习是解决多语言问题的最先进解决方案,可以防止负干扰的影响,并实现跨语言性能。然而,这种方法的一个代价是减少了与相关语言的积极迁移。为了应对这个局限性,我们引入了一种名为语言代数的新方法,它允许无训练的后处理来解决这个问题。受到任务代数框架的启发,我们在语言适配器上进行加法训练,将框架从多任务设置转变为多语言环境。所提出解决方案在基于MAD-X的跨语言方案中的三个下游任务上的有效性得到了说明,充当了一个后处理过程。语言代数在最具挑战性的零 shot 和低资源应用中取得了显著的提高。我们的代码和模型可在此处访问:https://url.com/ 。
https://arxiv.org/abs/2404.15737
Cooperative Adaptive Cruise Control (CACC) represents a quintessential control strategy for orchestrating vehicular platoon movement within Connected and Automated Vehicle (CAV) systems, significantly enhancing traffic efficiency and reducing energy consumption. In recent years, the data-driven methods, such as reinforcement learning (RL), have been employed to address this task due to their significant advantages in terms of efficiency and flexibility. However, the delay issue, which often arises in real-world CACC systems, is rarely taken into account by current RL-based approaches. To tackle this problem, we propose a Delay-Aware Multi-Agent Reinforcement Learning (DAMARL) framework aimed at achieving safe and stable control for CACC. We model the entire decision-making process using a Multi-Agent Delay-Aware Markov Decision Process (MADA-MDP) and develop a centralized training with decentralized execution (CTDE) MARL framework for distributed control of CACC platoons. An attention mechanism-integrated policy network is introduced to enhance the performance of CAV communication and decision-making. Additionally, a velocity optimization model-based action filter is incorporated to further ensure the stability of the platoon. Experimental results across various delay conditions and platoon sizes demonstrate that our approach consistently outperforms baseline methods in terms of platoon safety, stability and overall performance.
合作自适应巡航控制(CACC)代表了一种在连接和自动驾驶车辆(CAV)系统中协调车辆编队运动的典型控制策略,显著提高了交通效率和降低了能源消耗。近年来,数据驱动的方法,如强化学习(RL),已经被采用来解决这个任务,因为它们在效率和灵活性方面具有显著优势。然而,当前基于RL的方法很少考虑到实世界CACC系统中经常出现的延迟问题。为了解决这个问题,我们提出了一个针对延迟敏感的多代理器强化学习(DAMARL)框架,旨在实现CACC的安全和稳定控制。我们使用多代理器延迟感知马尔可夫决策过程(MADA-MDP)来建模整个决策过程,并开发了一种集中训练和分布式执行(CTDE)的MARL框架,用于分布式控制CACC编队。引入了注意机制的策略网络,以提高CAV通信和决策的性能。此外,还引入了基于速度优化模型的动作滤波器,进一步确保编队的稳定性。在不同的延迟条件和编队大小等实验条件下,我们发现,我们的方法在编队安全、稳定和整体性能方面 consistently超过了基线方法。
https://arxiv.org/abs/2404.15696
Large Language Models (LLMs) have emerged as powerful candidates to inform clinical decision-making processes. While these models play an increasingly prominent role in shaping the digital landscape, two growing concerns emerge in healthcare applications: 1) to what extent do LLMs exhibit social bias based on patients' protected attributes (like race), and 2) how do design choices (like architecture design and prompting strategies) influence the observed biases? To answer these questions rigorously, we evaluated eight popular LLMs across three question-answering (QA) datasets using clinical vignettes (patient descriptions) standardized for bias evaluations. We employ red-teaming strategies to analyze how demographics affect LLM outputs, comparing both general-purpose and clinically-trained models. Our extensive experiments reveal various disparities (some significant) across protected groups. We also observe several counter-intuitive patterns such as larger models not being necessarily less biased and fined-tuned models on medical data not being necessarily better than the general-purpose models. Furthermore, our study demonstrates the impact of prompt design on bias patterns and shows that specific phrasing can influence bias patterns and reflection-type approaches (like Chain of Thought) can reduce biased outcomes effectively. Consistent with prior studies, we call on additional evaluations, scrutiny, and enhancement of LLMs used in clinical decision support applications.
大语言模型(LLMs)已成为告知临床决策过程的有影响力的候选人。尽管这些模型在塑造数字格局方面扮演越来越重要的角色,但在医疗领域应用中出现了两个不断增长的关注点:1)基于患者受保护属性(如种族)的社交偏见程度有多大,以及2)设计选择(如建筑设计和提示策略)如何影响观察到的偏见?为了回答这些问题,我们通过临床案例(患者描述)标准化评估了三个问题回答(QA)数据集中的八个流行LLM。我们采用红队策略分析 demographic(人口统计学)如何影响LLM输出,比较了通用模型和临床训练模型。我们广泛的实验揭示了各种差异(有些非常显著)。我们还观察到了几个反直觉的模式,例如更大模型不一定更不偏见,经过微调的模型在医学数据上不一定比通用模型更好。此外,我们的研究证明了提示设计对偏见模式的影响,表明了具体措辞可以影响偏见模式,并且类比思考方法(如 Chain of Thought)可以有效降低有偏见的结果。与之前的研究一致,我们呼吁对用于临床决策支持应用的LLM进行进一步评估、审查和增强。
https://arxiv.org/abs/2404.15149
It is challenging to improve automatic speech recognition (ASR) performance in noisy conditions with a single-channel speech enhancement (SE) front-end. This is generally attributed to the processing distortions caused by the nonlinear processing of single-channel SE front-ends. However, the causes of such degraded ASR performance have not been fully investigated. How to design single-channel SE front-ends in a way that significantly improves ASR performance remains an open research question. In this study, we investigate a signal-level numerical metric that can explain the cause of degradation in ASR performance. To this end, we propose a novel analysis scheme based on the orthogonal projection-based decomposition of SE errors. This scheme manually modifies the ratio of the decomposed interference, noise, and artifact errors, and it enables us to directly evaluate the impact of each error type on ASR performance. Our analysis reveals the particularly detrimental effect of artifact errors on ASR performance compared to the other types of errors. This provides us with a more principled definition of processing distortions that cause the ASR performance degradation. Then, we study two practical approaches for reducing the impact of artifact errors. First, we prove that the simple observation adding (OA) post-processing (i.e., interpolating the enhanced and observed signals) can monotonically improve the signal-to-artifact ratio. Second, we propose a novel training objective, called artifact-boosted signal-to-distortion ratio (AB-SDR), which forces the model to estimate the enhanced signals with fewer artifact errors. Through experiments, we confirm that both the OA and AB-SDR approaches are effective in decreasing artifact errors caused by single-channel SE front-ends, allowing them to significantly improve ASR performance.
在嘈杂条件下,使用单通道语音增强(SE)前端来提高自动语音识别(ASR)性能是一项具有挑战性的任务。通常,这种挑战是由于单通道SE前端非线性处理引起的处理失真。然而,尚未对这种失真导致的ASR性能下降进行全面调查。如何设计一种明显提高ASR性能的单通道SE前端仍然是一个开放的研究问题。在这项研究中,我们研究了一个可以解释ASR性能下降原因的信号级数值度量。为此,我们提出了基于SE误差正交投影分解的分析方案。这种方案手动修改了分解干扰、噪声和伪迹误差的比率,并使我们能够直接评估每种误差类型对ASR性能的影响。我们的分析揭示了伪迹误差对ASR性能的破坏性比其他类型的误差更加严重。然后,我们研究了两种减少伪迹误差影响的方法。首先,我们证明了简单的观察添加(OA)后处理(即插值增强和观察信号)可以单调地改善信号-伪迹比。其次,我们提出了名为伪迹增强信号-畸变比(AB-SDR)的新训练目标,该目标迫使模型使用更少的伪迹误差估计增强信号。通过实验,我们证实了OA和AB-SDR方法都能有效减少由单通道SE前端引起的伪迹误差,从而显著提高ASR性能。
https://arxiv.org/abs/2404.14860
Code translation tools are developed for automatic source-to-source translation. Although learning-based transpilers have shown impressive enhancement against rule-based counterparts, owing to their task-specific pre-training on extensive monolingual corpora. Their current performance still remains unsatisfactory for practical deployment, and the associated training resources are also prohibitively expensive. LLMs pre-trained on huge amounts of human-written code/text have shown remarkable performance in many code intelligence tasks due to their powerful generality, even without task-specific training. Thus, LLMs can potentially circumvent the above limitations, but they have not been exhaustively explored yet. This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks, finding that: although certain LLMs have outperformed current transpilers, they still have some accuracy issues, where most of the failures are induced by a lack of comprehension of source programs (38.51%), missing clear instructions on I/O types in translation (14.94%), and ignoring discrepancies between source and target programs (41.38%). Enlightened by the above findings, we propose UniTrans, an Unified code Translation framework, applicable to various LLMs, for unleashing their power in this field. Specifically, UniTrans first craft a series of test cases for target programs with the assistance of source programs. Next, it harnesses the above auto-generated test cases to augment the code translation and then evaluate their correctness via execution. Afterward, UniTrans further (iteratively) repairs incorrectly translated programs prompted by test case execution results. Extensive experiments are conducted on six translation datasets between Python, Java, and C++. Three recent LLMs of diverse sizes are tested with UniTrans, and all achieve substantial improvements.
代码翻译工具是为自动源代码到源代码翻译而开发的。尽管基于学习的方法已经展示了与基于规则的方法令人印象深刻的增强,但由于它们在广泛的单语料库上的任务特定预训练,它们的当前性能仍然不令人满意,相关的训练资源也极为昂贵。大规模的人类编写代码/文本预训练的LLM在许多代码智能任务中表现突出,因为它们具有强大的泛化能力,即使没有任务特定训练。因此,LLM有可能绕过上述限制,但它们尚未被充分探索。本文研究了各种LLM和基于学习的代码翻译工具,发现:虽然某些LLM已经超过了当前的翻译器,但它们仍然存在一些准确性问题,其中大多数失败是由对源程序缺乏理解(38.51%)引起的,缺少翻译器中I/O类型的明确说明(14.94%)和忽略源程序和目标程序之间的差异(41.38%)。鉴于上述发现,我们提出了UniTrans,一个统一代码翻译框架,适用于各种LLM,以释放它们在这个领域的潜力。具体来说,UniTrans首先使用源程序协助构建目标程序的一系列测试用例。接下来,它利用生成的测试用例来增强代码翻译,并通过执行来评估它们的正确性。然后,UniTrans进一步修复由测试用例执行结果催生的错误翻译程序。在Python、Java和C++的六个翻译数据集上进行了广泛的实验。对大小不同、多种日期的LLM进行了测试,所有都取得了显著的改进。
https://arxiv.org/abs/2404.14646
One key aspect differentiating data-driven single- and multi-channel speech enhancement and dereverberation methods is that both the problem formulation and complexity of the solutions are considerably more challenging in the latter case. Additionally, with limited computational resources, it is cumbersome to train models that require the management of larger datasets or those with more complex designs. In this scenario, an unverified hypothesis that single-channel methods can be adapted to multi-channel scenarios simply by processing each channel independently holds significant implications, boosting compatibility between sound scene capture and system input-output formats, while also allowing modern research to focus on other challenging aspects, such as full-bandwidth audio enhancement, competitive noise suppression, and unsupervised learning. This study verifies this hypothesis by comparing the enhancement promoted by a basic single-channel speech enhancement and dereverberation model with two other multi-channel models tailored to separate clean speech from noisy 3D mixes. A direction of arrival estimation model was used to objectively evaluate its capacity to preserve spatial information by comparing the output signals with ground-truth coordinate values. Consequently, a trade-off arises between preserving spatial information with a more straightforward single-channel solution at the cost of obtaining lower gains in intelligibility scores.
数据驱动的单通道和多通道语音增强和去噪方法的一个重要区别是,后者的问题表述和解决方案的复杂性大大增加。此外,在有限计算资源的情况下,训练需要管理更大数据集或更复杂设计的模型非常费力。在这种情况下,一个未经证实的假设是,单通道方法可以简单地适应多通道场景,只需对每个通道独立处理,这对声景捕捉系统和输入-输出格式之间的兼容性产生了重大影响,同时也允许现代研究集中精力于其他具有挑战性的方面,例如全带宽音频增强、竞争性噪声抑制和无监督学习。通过比较基本单通道语音增强和去噪模型与两个专门针对分离干净语音和嘈杂3D混合的Multi-Channel模型的增强效果,本研究验证了这个假设。采用到达方向估计模型通过比较输出信号与地面坐标值来客观评估其保留空间信息的能力。因此,在保留空间信息方面,更简单的单通道解决方案在获得较低的增益智能分数的同时,需要在清晰度分数上做出让步。
https://arxiv.org/abs/2404.14564
Learning-based underwater image enhancement (UIE) methods have made great progress. However, the lack of large-scale and high-quality paired training samples has become the main bottleneck hindering the development of UIE. The inter-frame information in underwater videos can accelerate or optimize the UIE process. Thus, we constructed the first large-scale high-resolution underwater video enhancement benchmark (UVEB) to promote the development of underwater this http URL contains 1,308 pairs of video sequences and more than 453,000 high-resolution with 38\% Ultra-High-Definition (UHD) 4K frame pairs. UVEB comes from multiple countries, containing various scenes and video degradation types to adapt to diverse and complex underwater environments. We also propose the first supervised underwater video enhancement method, UVE-Net. UVE-Net converts the current frame information into convolutional kernels and passes them to adjacent frames for efficient inter-frame information exchange. By fully utilizing the redundant degraded information of underwater videos, UVE-Net completes video enhancement better. Experiments show the effective network design and good performance of UVE-Net.
基于学习的 underwater图像增强(UIE)方法取得了很大的进展。然而,缺乏大规模和高质量的成对训练样本已成为阻碍UIE发展的主要瓶颈。水下视频中的帧间信息可以加速或优化UIE过程。因此,我们构建了第一个大规模高分辨率水下视频增强基准(UVEB)以促进水下图像增强的发展。 UVEB来自多个国家,包含各种场景和视频衰退类型,以适应多样和复杂的水下环境。我们还提出了第一个监督式水下视频增强方法,UVE-Net。UVE-Net将当前帧信息转换为卷积内核并传递给相邻帧进行有效的帧间信息交流。通过充分利用水下视频的冗余衰退信息,UVE-Net完成视频增强效果更好。实验结果表明,UVE-Net的有效的网络设计和良好的性能。
https://arxiv.org/abs/2404.14542
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an `observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.
大规模语言模型(LLMs)在处理广泛语境方面取得了显著的进步,其中键值(KV)缓存起着关键作用,增强了其性能。然而,随着输入长度的增加,KV缓存的增长对内存和时间效率提出了挑战。为解决这个问题,本文引入了SnapKV,一种创新且无需微调的途径,它在高效减小KV缓存大小的同时,在现实应用中提供了与基线相当的表现。我们发现,模型中的每个注意力头在生成过程中始终关注特定的提示关注特征。同时,这个稳健的规律可以从位于提示末尾的观察窗口中获得。基于这个洞察,SnapKV通过选择聚类的关键KV位置对每个注意力头自动压缩KV缓存。我们的方法在处理长输入序列时显著减少了计算开销和内存足迹。具体来说,当处理16K个词的输入时,SnapKV实现了与基线相同的解码速度和8.2倍内存效率的提高。同时,它在与基线模型处理16个长序列数据集时的表现相当。此外,使用HuggingFace实现,SnapKV可以在单个A100-80GB的GPU上处理多达380K个上下文令牌,在 Needle-in-a-Haystack 测试中的准确度下降仅微不足道。进一步的研究表明,SnapKV在实际应用中具有很大的潜力。
https://arxiv.org/abs/2404.14469
This paper reviews the NTIRE 2024 low light image enhancement challenge, highlighting the proposed solutions and results. The aim of this challenge is to discover an effective network design or solution capable of generating brighter, clearer, and visually appealing results when dealing with a variety of conditions, including ultra-high resolution (4K and beyond), non-uniform illumination, backlighting, extreme darkness, and night scenes. A notable total of 428 participants registered for the challenge, with 22 teams ultimately making valid submissions. This paper meticulously evaluates the state-of-the-art advancements in enhancing low-light images, reflecting the significant progress and creativity in this field.
本文回顾了NTIRE 2024低光图像增强挑战,重点介绍了所提出的解决方案和结果。该挑战的目标是发现一种有效的网络设计或解决方案,能够在处理各种情况下产生更亮、更清晰、更美观的结果,包括超高清分辨率(4K及更高)、非均匀照明、反光、极度黑暗和夜间场景。值得注意的是,共有428名参与者注册参加挑战,最终有22支队伍提出了有效的参赛作品。本文详细评估了提高低光图像效果的现有技术进步,反映了该领域在进步和创造力方面的重要性。
https://arxiv.org/abs/2404.14248
Extremely low-light text images are common in natural scenes, making scene text detection and recognition challenging. One solution is to enhance these images using low-light image enhancement methods before text extraction. However, previous methods often do not try to particularly address the significance of low-level features, which are crucial for optimal performance on downstream scene text tasks. Further research is also hindered by the lack of extremely low-light text datasets. To address these limitations, we propose a novel encoder-decoder framework with an edge-aware attention module to focus on scene text regions during enhancement. Our proposed method uses novel text detection and edge reconstruction losses to emphasize low-level scene text features, leading to successful text extraction. Additionally, we present a Supervised Deep Curve Estimation (Supervised-DCE) model to synthesize extremely low-light images based on publicly available scene text datasets such as ICDAR15 (IC15). We also labeled texts in the extremely low-light See In the Dark (SID) and ordinary LOw-Light (LOL) datasets to allow for objective assessment of extremely low-light image enhancement through scene text tasks. Extensive experiments show that our model outperforms state-of-the-art methods in terms of both image quality and scene text metrics on the widely-used LOL, SID, and synthetic IC15 datasets. Code and dataset will be released publicly at this https URL.
极低光文本图像在自然场景中很常见,使得场景文本检测和识别变得具有挑战性。一种解决方案是在文本提取之前使用低光图像增强方法增强这些图像。然而,之前的方法通常没有特别关注低级别特征的重要性,这些特征对于下游场景文本任务具有关键作用。此外,缺乏极低光文本数据集也进一步阻碍了进一步的研究。为了克服这些限制,我们提出了一个新颖的编码器-解码器框架,配备边缘感知注意模块,以在增强过程中关注场景文本区域。我们的方法利用新的文本检测和边缘重构损失来强调低级别场景文本特征,从而实现成功的文本提取。此外,我们还提出了一个基于已知场景文本数据集如ICDAR15( see In the Dark,SID)的监督深度曲线估计(Supervised-DCE)模型,用于基于公开可用的场景文本数据合成极低光图像。我们还对极低光See In the Dark(SID)和普通Low-Light(LOL)数据集中的文本进行了标注,以使场景文本任务通过场景文本评估极低光图像增强。大量的实验结果表明,我们的模型在广泛使用的LOL、SID和合成IC15数据集上的图像质量和场景文本指标都优于最先进的方法。代码和数据集将在这个https:// URL上发布。
https://arxiv.org/abs/2404.14135
In real-world scenarios, images captured often suffer from blurring, noise, and other forms of image degradation, and due to sensor limitations, people usually can only obtain low dynamic range images. To achieve high-quality images, researchers have attempted various image restoration and enhancement operations on photographs, including denoising, deblurring, and high dynamic range imaging. However, merely performing a single type of image enhancement still cannot yield satisfactory images. In this paper, to deal with the challenge above, we propose the Composite Refinement Network (CRNet) to address this issue using multiple exposure images. By fully integrating information-rich multiple exposure inputs, CRNet can perform unified image restoration and enhancement. To improve the quality of image details, CRNet explicitly separates and strengthens high and low-frequency information through pooling layers, using specially designed Multi-Branch Blocks for effective fusion of these frequencies. To increase the receptive field and fully integrate input features, CRNet employs the High-Frequency Enhancement Module, which includes large kernel convolutions and an inverted bottleneck ConvFFN. Our model secured third place in the first track of the Bracketing Image Restoration and Enhancement Challenge, surpassing previous SOTA models in both testing metrics and visual quality.
在现实场景中,捕获的图像经常受到模糊、噪声和其他图像退化形式的影响,由于传感器限制,人们通常只能获得低动态范围图像。为了获得高质量的图像,研究人员对照片进行了各种图像修复和增强操作,包括去噪、去模糊和高动态范围成像。然而,仅进行一种图像增强操作仍然无法产生令人满意的图像。在本文中,为了应对上述挑战,我们提出了复合优化网络(CRNet)来解决这个问题,利用多个曝光图像。通过完全整合信息丰富的多个曝光输入,CRNet可以执行统一图像修复和增强。为了提高图像细节质量,CRNet通过池化层明确区分和加强高和低频信息,使用专门设计的Multi-Branch Blocks对这两个频率进行有效的融合。为了增加接收范围并完全整合输入特征,CRNet采用High-Frequency Enhancement Module,包括大内核卷积和反向瓶颈ConvFFN。我们的模型在Bracketing Image Restoration and Enhancement Challenge的第一 track获得了第三名的成绩,在测试指标和视觉质量方面均超过了之前的最佳模型。
https://arxiv.org/abs/2404.14132
We revisit Tree-Ring Watermarking, a recent diffusion model watermarking method that demonstrates great robustness to various attacks. We conduct an in-depth study on it and reveal that the distribution shift unintentionally introduced by the watermarking process, apart from watermark pattern matching, contributes to its exceptional robustness. Our investigation further exposes inherent flaws in its original design, particularly in its ability to identify multiple distinct keys, where distribution shift offers no assistance. Based on these findings and analysis, we present RingID for enhanced multi-key identification. It consists of a novel multi-channel heterogeneous watermarking approach designed to seamlessly amalgamate distinctive advantages from diverse watermarks. Coupled with a series of suggested enhancements, RingID exhibits substantial advancements in multi-key identification.
我们复习了Tree-Ring Watermarking,一种最近的多扩散模型水印方法,它表现出对各种攻击的极大鲁棒性。我们对它进行了深入的研究,并发现除了水印模式匹配之外,水印过程无意中引入的分布偏移对它的非凡鲁棒性做出了贡献。我们的调查进一步揭示了其原始设计固有的缺陷,特别是在其无法识别多个不同密钥的能力上。基于这些发现和分析,我们提出了RingID,用于增强多密钥识别。它是一种新型的多通道异质水印方法,旨在无缝结合各种水印的显著优势。与一系列建议的改进相结合,RingID在多密钥识别方面取得了显著的进步。
https://arxiv.org/abs/2404.14055
Existing brain tumor segmentation methods usually utilize multiple Magnetic Resonance Imaging (MRI) modalities in brain tumor images for segmentation, which can achieve better segmentation performance. However, in clinical applications, some modalities are missing due to resource constraints, leading to severe degradation in the performance of methods applying complete modality segmentation. In this paper, we propose a Multimodal feature distillation with Convolutional Neural Network (CNN)-Transformer hybrid network (MCTSeg) for accurate brain tumor segmentation with missing modalities. We first design a Multimodal Feature Distillation (MFD) module to distill feature-level multimodal knowledge into different unimodality to extract complete modality information. We further develop a Unimodal Feature Enhancement (UFE) module to model the relationship between global and local information semantically. Finally, we build a Cross-Modal Fusion (CMF) module to explicitly align the global correlations among different modalities even when some modalities are missing. Complementary features within and across different modalities are refined via the CNN-Transformer hybrid architectures in both the UFE and CMF modules, where local and global dependencies are both captured. Our ablation study demonstrates the importance of the proposed modules with CNN-Transformer networks and the convolutional blocks in Transformer for improving the performance of brain tumor segmentation with missing modalities. Extensive experiments on the BraTS2018 and BraTS2020 datasets show that the proposed MCTSeg framework outperforms the state-of-the-art methods in missing modalities cases. Our code is available at: this https URL.
现有的脑肿瘤分割方法通常利用多种Magnetic Resonance Imaging(MRI)模态在脑肿瘤图像中进行分割,从而实现更好的分割性能。然而,在临床应用中,由于资源限制,某些模态是缺失的,导致应用完整模态分割方法的结果性能严重下降。在本文中,我们提出了一种使用卷积神经网络(CNN)-Transformer混合网络(MCTSeg)进行准确脑肿瘤分割的Multimodal Feature Distillation(MFD)模块。我们首先设计了一个Multimodal Feature Distillation(MFD)模块,将特征级别的多模态知识转化为不同单模态以提取完整模态信息。然后,我们进一步开发了一个Unimodal Feature Enhancement(UFE)模块,用于建模全局和局部信息之间的语义关系。最后,我们构建了一个Cross-Modal Fusion(CMF)模块,即使在某些模态缺失的情况下,也会明确对不同模态之间的全局相关性进行对齐。通过在UFE和CMF模块中的CNN-Transformer混合架构对 local 和 global dependencies进行捕捉,我们精细调整了各个模块内的互补特征。我们的消融研究证明了与 CNN-Transformer网络和Convolutional 块有关的模块对具有缺失模态的脑肿瘤分割方法的性能提升至关重要。在 BraTS2018 和 BraTS2020 数据集上的大量实验证明,所提出的 MCTSeg 框架在缺失模态情况下优于最先进的治疗方法。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2404.14019
Deep learning has shown the great power in the field of fault detection. However, for incipient faults with tiny amplitude, the detection performance of the current deep learning networks (DLNs) is not satisfactory. Even if prior information about the faults is utilized, DLNs can't successfully detect faults 3, 9 and 15 in Tennessee Eastman process (TEP). These faults are notoriously difficult to detect, lacking effective detection technologies in the field of fault detection. In this work, we propose Autoencoder-assisted Feature Ensemble Net (AE-FENet): a deep feature ensemble framework that uses the unsupervised autoencoder to conduct the feature transformation. Compared with the principle component analysis (PCA) technique adopted in the original Feature Ensemble Net (FENet), autoencoder can mine more exact features on incipient faults, which results in the better detection performance of AE-FENet. With same kinds of basic detectors, AE-FENet achieves a state-of-the-art average accuracy over 96% on faults 3, 9 and 15 in TEP, which represents a significant enhancement in performance compared to other methods. Plenty of experiments have been done to extend our framework, proving that DLNs can be utilized efficiently within this architecture.
深度学习在故障检测领域表现出了巨大的威力。然而,对于初始故障(幅值较小)的检测,当前的深度学习网络(DLNs)的检测性能并不令人满意。即使利用先前的故障信息,DLNs也无法成功检测田纳西东部过程(TEP)中的第3、9和15个故障。这些故障尤其难以检测,在故障检测领域缺乏有效的检测技术。在这项工作中,我们提出了自编码器辅助特征集成网络(AE-FENet): 一个深度特征集成框架,利用无监督的自动编码器进行特征转换。与原始特征集成网络(FENet)中采用的原则成分分析(PCA)技术相比,自动编码器可以挖掘更多的精确特征,从而使AE-FENet在初始故障检测方面的性能更佳。与相同类型的基本检测器相比,AE-FENet在TEP中的第3、9和15个故障上实现了超过96%的顶级平均准确率,这表明与其他方法相比,性能有了显著的提高。已经进行了很多实验来扩展我们的框架,证明DLNs可以有效地利用这种架构。
https://arxiv.org/abs/2404.13941