Increasing demands on medical imaging departments are taking a toll on the radiologist's ability to deliver timely and accurate reports. Recent technological advances in artificial intelligence have demonstrated great potential for automatic radiology report generation (ARRG), sparking an explosion of research. This survey paper conducts a methodological review of contemporary ARRG approaches by way of (i) assessing datasets based on characteristics, such as availability, size, and adoption rate, (ii) examining deep learning training methods, such as contrastive learning and reinforcement learning, (iii) exploring state-of-the-art model architectures, including variations of CNN and transformer models, (iv) outlining techniques integrating clinical knowledge through multimodal inputs and knowledge graphs, and (v) scrutinising current model evaluation techniques, including commonly applied NLP metrics and qualitative clinical reviews. Furthermore, the quantitative results of the reviewed models are analysed, where the top performing models are examined to seek further insights. Finally, potential new directions are highlighted, with the adoption of additional datasets from other radiological modalities and improved evaluation methods predicted as important areas of future development.
医疗影像部门的需求不断增加,对放射科医生及时和准确报告的能力产生了压力。最近的人工智能技术进步表明,自动放射学报告生成(ARRG)具有巨大的潜力,引发了爆炸性的研究。这份调查论文通过对当代ARRG方法的系统综述,评估了基于特征的数据集,研究了对比学习和支持性学习等深度学习训练方法,探讨了最先进的模型架构,包括CNN和Transformer模型的变体,以及通过多模态输入和知识图整合临床知识的 technique。此外,还审视了当前的模型评估技术,包括常用于NLP指标的质量和定性临床评价。进一步,审查了审查模型的定量结果,重点关注表现最好的模型,以寻求进一步的见解。最后,概述了潜在的新方向,包括其他放射学模态数据集的采用和评估方法的改进,预测为未来发展的关键领域。
https://arxiv.org/abs/2405.10842
Automated driving fundamentally requires knowledge about the surrounding geometry of the scene. Modern approaches use only captured images to predict occupancy maps that represent the geometry. Training these approaches requires accurate data that may be acquired with the help of LiDAR scanners. We show that the techniques used for current benchmarks and training datasets to convert LiDAR scans into occupancy grid maps yield very low quality, and subsequently present a novel approach using evidence theory that yields more accurate reconstructions. We demonstrate that these are superior by a large margin, both qualitatively and quantitatively, and that we additionally obtain meaningful uncertainty estimates. When converting the occupancy maps back to depth estimates and comparing them with the raw LiDAR measurements, our method yields a MAE improvement of 30% to 52% on nuScenes and 53% on Waymo over other occupancy ground-truth data. Finally, we use the improved occupancy maps to train a state-of-the-art occupancy prediction method and demonstrate that it improves the MAE by 25% on nuScenes.
自动驾驶从根本上需要关于场景周围的几何知识。现代方法仅使用捕获的图像预测占据地图,代表几何形状。为训练这些方法,需要准确的数据,这可能通过激光雷达扫描器获得。我们证明了当前基准测试数据集和训练数据集中的转换激光雷达扫描为占据网格图的方法产生非常低质量,然后使用证据理论提出了一种新方法,该方法产生更准确的重构。我们证明了这些方法的优越性,不仅在定性方面,而且在定量方面,并且我们还获得了有意义的置信度估计。将占据地图转换为深度估计并与原始激光雷达测量进行比较,我们的方法在nuScenes和Waymo上的MAE改进分别为30%到52%和53%。最后,我们使用改进的占据地图训练了最先进的占据预测方法,并证明了它提高了MAE by 25%在nuScenes上的表现。
https://arxiv.org/abs/2405.10575
In cardiac Magnetic Resonance Imaging (MRI) analysis, simultaneous myocardial segmentation and T2 quantification are crucial for assessing myocardial pathologies. Existing methods often address these tasks separately, limiting their synergistic potential. To address this, we propose SQNet, a dual-task network integrating Transformer and Convolutional Neural Network (CNN) components. SQNet features a T2-refine fusion decoder for quantitative analysis, leveraging global features from the Transformer, and a segmentation decoder with multiple local region supervision for enhanced accuracy. A tight coupling module aligns and fuses CNN and Transformer branch features, enabling SQNet to focus on myocardium regions. Evaluation on healthy controls (HC) and acute myocardial infarction patients (AMI) demonstrates superior segmentation dice scores (89.3/89.2) compared to state-of-the-art methods (87.7/87.9). T2 quantification yields strong linear correlations (Pearson coefficients: 0.84/0.93) with label values for HC/AMI, indicating accurate mapping. Radiologist evaluations confirm SQNet's superior image quality scores (4.60/4.58 for segmentation, 4.32/4.42 for T2 quantification) over state-of-the-art methods (4.50/4.44 for segmentation, 3.59/4.37 for T2 quantification). SQNet thus offers accurate simultaneous segmentation and quantification, enhancing cardiac disease diagnosis, such as AMI.
在心脏磁共振成像(MRI)分析中,同时进行心肌分割和T2定量分析对于评估心肌病非常重要。现有的方法通常将这些任务分别处理,限制了它们的协同作用潜力。为了解决这个问题,我们提出了SQNet,一种集成Transformer和卷积神经网络(CNN)组件的双任务网络。SQNet具有用于定量分析的T2精炼融合解码器,利用Transformer的全局特征,并具有多个局部区域监督的分割解码器,以提高准确性。一个紧耦合模块将CNN和Transformer分支特征对齐和融合,使SQNet能够专注于心肌区域。在健康对照(HC)和急性心肌梗死患者(AMI)上的评估表明,SQNet的分割散点得分(89.3/89.2)优于最先进的方法(87.7/87.9)。T2定量分析与HC/AMI标签值具有很强的线性相关性(Pearson系数:0.84/0.93),表明准确的映射。放射科医生的评估证实了SQNet在分割(4.60/4.58)和T2定量分析(4.32/4.42)方面的优越图像质量评分超过最先进的方法(4.50/4.44)。因此,SQNet能够准确同时进行分割和定量分析,提高心脏病的诊断,如AMI。
https://arxiv.org/abs/2405.10570
The evolution of Explainable Artificial Intelligence (XAI) has emphasised the significance of meeting diverse user needs. The approaches to identifying and addressing these needs must also advance, recognising that explanation experiences are subjective, user-centred processes that interact with users towards a better understanding of AI decision-making. This paper delves into the interrelations in multi-faceted XAI and examines how different types of explanations collaboratively meet users' XAI needs. We introduce the Intent Fulfilment Framework (IFF) for creating explanation experiences. The novelty of this paper lies in recognising the importance of "follow-up" on explanations for obtaining clarity, verification and/or substitution. Moreover, the Explanation Experience Dialogue Model integrates the IFF and "Explanation Followups" to provide users with a conversational interface for exploring their explanation needs, thereby creating explanation experiences. Quantitative and qualitative findings from our comparative user study demonstrate the impact of the IFF in improving user engagement, the utility of the AI system and the overall user experience. Overall, we reinforce the principle that "one explanation does not fit all" to create explanation experiences that guide the complex interaction through conversation.
Explainable Artificial Intelligence(XAI)的演变突出了满足多样用户需求的重要性。确定和解决这些需求的方法也必须进步,因为解释体验是主观、以用户为中心的过程,与用户更好地理解AI决策进行交互。本文深入探讨了多方面的XAI之间的相互关系,并探讨了不同类型的解释如何共同满足用户的XAI需求。我们引入了意图完成框架(IFF)来创建解释体验。本文的新颖之处在于认识到“后续解释”对于获得清晰度、验证和/或替代的重要性。此外,解释体验对话模型集成了IFF和“解释后续”功能,为用户提供了一个探索其解释需求的对话界面,从而创造了解释体验。我们进行了定量定性比较研究,比较了IFF在提高用户参与度、AI系统的实用性和整体用户体验方面的效果。总体而言,我们强调了“一个解释并不适合所有人”的原则,以创建通过对话引导复杂交互的解释体验。
https://arxiv.org/abs/2405.10446
This paper addresses the problem of diversity-aware sign language production, where we want to give an image (or sequence) of a signer and produce another image with the same pose but different attributes (\textit{e.g.} gender, skin color). To this end, we extend the variational inference paradigm to include information about the pose and the conditioning of the attributes. This formulation improves the quality of the synthesised images. The generator framework is presented as a UNet architecture to ensure spatial preservation of the input pose, and we include the visual features from the variational inference to maintain control over appearance and style. We generate each body part with a separate decoder. This architecture allows the generator to deliver better overall results. Experiments on the SMILE II dataset show that the proposed model performs quantitatively better than state-of-the-art baselines regarding diversity, per-pixel image quality, and pose estimation. Quantitatively, it faithfully reproduces non-manual features for signers.
本文解决了多样意识手语生产中的问题,即我们想要生成一个手语者的图像(或序列),并生成具有相同姿势的不同属性(例如性别、肤色)的另一个图像。为此,我们将变分推理范式扩展到包括姿势和属性的约束信息。这样的表示形式提高了合成图像的质量。生成器框架被设计为UNet架构,以确保输入姿势的空间保留,并包括来自变分推理的视觉特征,以保持对外观和风格的控制。我们为每个身体部分使用单独的解码器。这种架构允许生成器在总体结果方面取得更好的性能。在SMILE II数据集上的实验表明,与最先进的基准相比,所提出的模型在多样性、每像素图像质量和姿势估计方面表现出更好的性能。在定量上,它忠实于手语者的非手动特征。
https://arxiv.org/abs/2405.10423
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifically, ToF sensors with a uniform field of illumination, which can output dense depth but have low resolution, are typically used for close-range measurements. In contrast, LiDARs, which emit laser pulses and can only capture sparse depth, are usually employed for long-range detection. In the two cases, depth quality improvement for RGB guided ToF imaging corresponds to two sub-tasks: guided depth super-resolution and guided depth completion. In light of the recent significant boost to the field provided by deep learning, this paper comprehensively reviews the works related to RGB guided ToF imaging, including network structures, learning strategies, evaluation metrics, benchmark datasets, and objective functions. Besides, we present quantitative comparisons of state-of-the-art methods on widely used benchmark datasets. Finally, we discuss future trends and the challenges in real applications for further research.
将RGB相机集成到ToF成像系统中已成为感知现实世界的重要技术。RGB引导ToF成像系统对于多个应用场景至关重要,包括面部抗伪造、轮廓检测和轨迹预测。根据工作范围的不同,RGB引导ToF成像系统的实现方案是不同的。具体来说,具有均匀场照明的ToF传感器通常用于近距离测量。相反,激光雷达,它们只能捕获稀疏深度,通常用于远距离检测。在这两种情况下,RGB引导ToF成像系统的深度质量改进相当于两个子任务:引导深度超分辨率 和引导深度完成。 鉴于最近深度学习在领域提供的重大提升,本文全面回顾了与RGB引导ToF成像相关的论文,包括网络结构、学习策略、评估指标、基准数据集和目标函数。此外,我们还在广泛使用的基准数据集上对最先进的方法进行了定量比较。最后,我们讨论了在实际应用中未来的趋势和挑战,为进一步研究提供了指导。
https://arxiv.org/abs/2405.10357
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
视觉上下文学习(ICL)由于通过类比推理完成各种任务的能力而成为一个有前景的研究领域。然而,基于训练的视觉ICL在泛化到未见过的任务方面存在局限性,需要收集多样任务数据集。另一方面,基于推理的视觉ICL方法仅依赖文本提示,无法从给定的例子中捕捉到细微的上下文信息,并且将图像从图像到文本提示的转换过程中需要花费时间。为了应对这些挑战,我们提出了Analogist,一种新颖的基于推理的视觉ICL方法,利用预训练的文本到图像扩散模型来探索图像和文本提示技术。 在视觉提示方面,我们提出了自注意力克隆(SAC)方法,以引导图像示例之间的细粒度结构级类比。在文本提示方面,我们利用GPT-4V的视觉推理能力高效生成文本提示,并引入跨注意掩码(CAM)操作,以增强由文本提示引导的语义级类比的精度。我们的方法是出类拔萃的,不需要微调或优化。它也具有通用性和灵活性,能够以上下文方式执行各种视觉任务。大量实验证明,我们的方法在质量和数量上优于现有方法。
https://arxiv.org/abs/2405.10316
Over the past century, the Turkish language has undergone substantial changes, primarily driven by governmental interventions. In this work, our goal is to investigate the evolution of the Turkish language since the establishment of Türkiye in 1923. Thus, we first introduce Turkronicles which is a diachronic corpus for Turkish derived from the Official Gazette of Türkiye. Turkronicles contains 45,375 documents, detailing governmental actions, making it a pivotal resource for analyzing the linguistic evolution influenced by the state policies. In addition, we expand an existing diachronic Turkish corpus which consists of the records of the Grand National Assembly of Türkiye by covering additional years. Next, combining these two diachronic corpora, we seek answers for two main research questions: How have the Turkish vocabulary and the writing conventions changed since the 1920s? Our analysis reveals that the vocabularies of two different time periods diverge more as the time between them increases, and newly coined Turkish words take the place of their old counterparts. We also observe changes in writing conventions. In particular, the use of circumflex noticeably decreases and words ending with the letters "-b" and "-d" are successively replaced with "-p" and "-t" letters, respectively. Overall, this study quantitatively highlights the dramatic changes in Turkish from various aspects of the language in a diachronic perspective.
在过去的一个世纪里,土耳其语经历了显著的变化,主要是由政府干预驱动的。在这项研究中,我们的目标是调查自1923年土耳其共和国成立以来的土耳其语演变过程。因此,我们首先引入了土耳其语词典,这是土耳其语的动态语料库,来源于土耳其官方公报。土耳其语词典包含45,375个文件,详述政府行动,成为分析受国家政策影响的语言演变的关键资源。此外,我们扩展了一个现有的动态土耳其语语料库,包括覆盖额外年份的土耳其大国民会议记录。接下来,我们将这两个动态语料库相结合,寻求两个主要研究问题的答案:自20世纪以来土耳其词汇和写作规范有哪些变化?我们的分析显示,两个不同时间段的词汇分化越来越明显,新创的土耳其词汇取代了旧的同类词汇。我们还观察到写作规范的变化。特别是,使用折线的明显减少,而且单词 ending with the letters "-b" and "-d" successively replaced with "-p" and "-t" letters, respectively.总的来说,从语法的角度,这次研究定量揭示了土耳其从多个方面语言在动态过程中的剧变。
https://arxiv.org/abs/2405.10133
In this paper, we propose an integrated framework for multi-granular explanation of video summarization. This framework integrates methods for producing explanations both at the fragment level (indicating which video fragments influenced the most the decisions of the summarizer) and the more fine-grained visual object level (highlighting which visual objects were the most influential for the summarizer). To build this framework, we extend our previous work on this field, by investigating the use of a model-agnostic, perturbation-based approach for fragment-level explanation of the video summarization results, and introducing a new method that combines the results of video panoptic segmentation with an adaptation of a perturbation-based explanation approach to produce object-level explanations. The performance of the developed framework is evaluated using a state-of-the-art summarization method and two datasets for benchmarking video summarization. The findings of the conducted quantitative and qualitative evaluations demonstrate the ability of our framework to spot the most and least influential fragments and visual objects of the video for the summarizer, and to provide a comprehensive set of visual-based explanations about the output of the summarization process.
在本文中,我们提出了一个集成框架,用于多粒度解释视频摘要。该框架将制作解释的方法既扩展到片段级别(表明哪个视频片段对摘要器的决策影响最大),又扩展到更细粒度的视觉对象级别(突出哪个视觉对象对摘要器最具影响力)。为了建立这个框架,我们在这个领域的研究基础上进行了扩展,通过研究使用模型无关的扰动基于方法对视频摘要结果进行片段级别解释,并引入了一种将视频聚类分割结果与扰动基于解释方法相结合的新方法,用于生成物体级别的解释。所开发框架的性能是通过最先进的摘要方法和对两个基准数据集的评估来评估的。所开展的定量定性评估的结果表明,我们的框架能够识别出视频摘要器最具影响力的片段和视觉对象,并为摘要过程的输出提供全面的视觉基础解释。
https://arxiv.org/abs/2405.10082
In recent years considerable research in LiDAR semantic segmentation was conducted, introducing several new state of the art models. However, most research focuses on single-scan point clouds, limiting performance especially in long distance outdoor scenarios, by omitting time-sequential information. Moreover, varying-density and occlusions constitute significant challenges in single-scan approaches. In this paper we propose a LiDAR point cloud preprocessing and postprocessing method. This multi-stage approach, in conjunction with state of the art models in a multi-scan setting, aims to solve those challenges. We demonstrate the benefits of our method through quantitative evaluation with the given models in single-scan settings. In particular, we achieve significant improvements in mIoU performance of over 5 percentage point in medium range and over 10 percentage point in far range. This is essential for 3D semantic scene understanding in long distance as well as for applications where offline processing is permissible.
近年来,在LiDAR语义分割方面进行了大量研究,引入了几个最先进的模型。然而,大部分研究都关注于单扫描点云,在远距离户外场景中限制了性能,通过忽略时间序列信息。此外,变化密度和遮挡是单扫描方法中的重大挑战。在本文中,我们提出了一个LiDAR点云预处理和后处理方法。这种多阶段方法与多扫描设置中的先进模型相结合,旨在解决这些挑战。我们通过单扫描设置中给定模型的定量评估来证明我们方法的益处。特别是,在中距离范围内,我们的方法在mIoU性能上取得了显著的改善,超过5个百分点;在远距离范围内,我们的方法在mIoU性能上取得了超过10个百分点。这对于在远距离中实现3D语义图理解以及允许离线处理的应用非常重要。
https://arxiv.org/abs/2405.10046
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at this https URL and the homepage of this challenge is at this https URL.
在本文中,我们对在野中的NTIRE 2024挑战进行了回顾。RAIM挑战为图像修复领域树立了一个基准,包括各种真实世界图像,这些图像带有/不带参考真实世界地面真实情况下的图像修复情况。参与者被要求从复杂和未知退化中恢复真实捕获的图像,要求在修复结果中实现生成感知质量和一致性。挑战包括两个任务。任务一是使用真实参考数据对,有定量的评估可用。任务二是使用未配对图像,并进行了全面的用户研究。挑战吸引了超过200个注册,其中39个提交了超过400个结果。排名前几的方法提高了最先进的修复性能,并得到了所有18个评委的一致认可。提出的数据集可在此处访问:<https://url> 和<https://url>。
https://arxiv.org/abs/2405.09923
The advances in deep generative models have greatly accelerate the process of video procession such as video enhancement and synthesis. Learning spatio-temporal video models requires to capture the temporal dynamics of a scene, in addition to the visual appearance of individual frames. Illumination consistency, which reflects the variations of illumination in the dynamic video sequences, play a vital role in video processing. Unfortunately, to date, no well-accepted quantitative metric has been proposed for video illumination consistency evaluation. In this paper, we propose a illumination histogram consistency (IHC) metric to quantitatively and automatically evaluate the illumination consistency of the video sequences. IHC measures the illumination variation of any video sequence based on the illumination histogram discrepancies across all the frames in the video sequence. Specifically, given a video sequence, we first estimate the illumination map of each individual frame using the Retinex model; Then, using the illumination maps, the mean illumination histogram of the video sequence is computed by the mean operation across all the frames; Next, we compute the illumination histogram discrepancy between each individual frame and the mean illumination histogram and sum up all the illumination histogram discrepancies to represent the illumination variations of the video sequence. Finally, we obtain the IHC score from the illumination histogram discrepancies via normalization and subtraction operations. Experiments are conducted to illustrate the performance of the proposed IHC metric and its capability to measure the illumination variations in video sequences. The source code is available on \url{this https URL}.
深度生成模型的进步大大加速了视频处理的过程,如视频增强和合成。学习空间-时间视频模型需要捕获场景的时空动态,以及每个帧的视觉外观。光照一致性(反映动态视频序列中光照的变化)在视频处理中起着关键作用。然而,到目前为止,还没有一个经过良好接受的数量指标被提出来评估视频光照一致性。在本文中,我们提出了一个光照直方图一致性(IHC)指标,用于定量且自动评估视频序列的光照一致性。IHC根据视频序列中所有帧之间的光照直方图差异来测量任何视频序列的照明变化。具体来说,给定一个视频序列,我们首先使用Retinex模型估计每个帧的照明地图;然后,利用光照地图计算视频序列的平均光照直方图;接下来,我们计算每个单独帧与平均光照直方图之间的照明直方图差异并求和,以表示视频序列的照明变化。最后,我们通过归一化和减法操作获得IHC得分。实验结果证明了所提出的IHC指标的性能以及其测量视频序列照明变化的能力。源代码可在此处访问:https://this URL。
https://arxiv.org/abs/2405.09716
Heterogeneous, interconnected, systems-level, molecular data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics. It will lead to a paradigm shift in computational and biomedical understanding of data and diseases that will open up ways to solving some of the major bottlenecks in precision medicine and other domains.
异质、相互连接的、系统层面的分子数据在精准医学中越来越重要。我们需要利用它们更好地将患者划分到风险组中,发现新的生物标志物和靶点,重新利用已知的并发现新的药物以个性化医疗治疗。现有的方法论受限,需要进行范式转换以实现定量和定性突破。在本文中,我们调查了文献并呼吁开发一个全面的、通用的多尺度分子网络数据嵌入框架,这将使得它们在精准医学中的可解释利用实现线性时间内的突破。网络嵌入方法将节点映射到低维空间中的点,因此学习空间中的接近反映了网络的拓扑功能关系。它们最近在各种生物医学应用中利用少量的单细胞数据取得前所未有的性能。然而,迄今为止的研究都局限于问题的特殊变体和数据,其性能取决于底层拓扑功能网络生物学假设、生物医学应用和评估指标。多尺度分子数据的可用性、现代图嵌入范式和计算能力要求创建和训练高效、可解释、可控的模型,没有潜在的危险的意外行为,实现质量突破。我们建议开发一个通用的、全面的嵌入多尺度分子网络数据的框架,从模型到高效的软件实现,并将其应用于生物医学信息学。这将导致计算和生物医学对数据和疾病的理解实现范式转移,开辟解决一些精准医学和其他领域主要瓶颈的道路。
https://arxiv.org/abs/2405.09595
With the benefit of deep learning techniques, recent researches have made significant progress in image compression artifacts reduction. Despite their improved performances, prevailing methods only focus on learning a mapping from the compressed image to the original one but ignore the intrinsic attributes of the given compressed images, which greatly harms the performance of downstream parsing tasks. Different from these methods, we propose to decouple the intrinsic attributes into two complementary features for artifacts reduction,ie, the compression-insensitive features to regularize the high-level semantic representations during training and the compression-sensitive features to be aware of the compression degree. To achieve this, we first employ adversarial training to regularize the compressed and original encoded features for retaining high-level semantics, and we then develop the compression quality-aware feature encoder for compression-sensitive features. Based on these dual complementary features, we propose a Dual Awareness Guidance Network (DAGN) to utilize these awareness features as transformation guidance during the decoding phase. In our proposed DAGN, we develop a cross-feature fusion module to maintain the consistency of compression-insensitive features by fusing compression-insensitive features into the artifacts reduction baseline. Our method achieves an average 2.06 dB PSNR gains on BSD500, outperforming state-of-the-art methods, and only requires 29.7 ms to process one image on BSD500. Besides, the experimental results on LIVE1 and LIU4K also demonstrate the efficiency, effectiveness, and superiority of the proposed method in terms of quantitative metrics, visual quality, and downstream machine vision tasks.
得益于深度学习技术的优势,近年来图像压缩伪影减少的研究取得了显著进展。尽管其性能有所提高,但现有的方法仅关注从压缩图像到原始图像的映射学习,而忽略了给定压缩图像的固有属性,这大大削弱了下游解码任务的性能。与这些方法不同,我们提出了一种将固有属性解耦为两个互补特征的方法,以便在图像压缩伪影减少中实现压缩敏感特征的感知。为了实现这一目标,我们首先使用对抗训练来对压缩和原始编码特征进行规范,保留高级语义表示,然后我们为压缩敏感特征开发了压缩质量感知特征编码器。基于这些互补特征,我们提出了一个双感知指导网络(DAGN)来在解码阶段利用这些感知特征作为变换指导。在我们的DAGN中,我们开发了一个跨特征融合模块,通过将压缩敏感特征与 artifacts reduction 基线融合来保持压缩感知特征的一致性。我们的方法在BSD500上的平均PSNR增益达到2.06 dB,超越了最先进的方法,并且仅在BSD500上处理一张图片就需要29.7毫秒。此外,LIVE1和LIU4K的实验结果也证明了我们在数量指标、视觉质量和下游机器视觉任务方面的方法的有效性和优越性。
https://arxiv.org/abs/2405.09291
The task of generating dance from music is crucial, yet current methods, which mainly produce joint sequences, lead to outputs that lack intuitiveness and complicate data collection due to the necessity for precise joint annotations. We introduce a Dance Any Beat Diffusion model, namely DabFusion, that employs music as a conditional input to directly create dance videos from still images, utilizing conditional image-to-video generation principles. This approach pioneers the use of music as a conditioning factor in image-to-video synthesis. Our method unfolds in two stages: training an auto-encoder to predict latent optical flow between reference and driving frames, eliminating the need for joint annotation, and training a U-Net-based diffusion model to produce these latent optical flows guided by music rhythm encoded by CLAP. Although capable of producing high-quality dance videos, the baseline model struggles with rhythm alignment. We enhance the model by adding beat information, improving synchronization. We introduce a 2D motion-music alignment score (2D-MM Align) for quantitative assessment. Evaluated on the AIST++ dataset, our enhanced model shows marked improvements in 2D-MM Align score and established metrics. Video results can be found on our project page: this https URL.
将音乐生成舞蹈的任务至关重要,然而目前的做法,主要产生关节序列,导致输出缺乏直觉,且数据收集因精确关节注释的必要性而变得复杂。我们引入了基于音乐的舞蹈任何节奏扩散模型,即DabFusion,它使用音乐作为条件输入来直接从静态图像生成舞蹈视频,利用条件图像到视频生成原则。这种方法在图像到视频合成中首创了将音乐作为条件因素的使用。我们的方法分为两个阶段:训练自编码器预测参考帧和驱动帧之间的潜在光流,消除关节注释的需求,并训练基于U-Net的扩散模型根据音乐节奏编码生成这些潜在光流。尽管可以产生高质量的舞蹈视频,但基线模型在节奏对齐方面挣扎。我们通过添加节奏信息并提高同步来增强模型。我们还引入了2D运动音乐对齐评分(2D-MM Align)用于定量评估。在AIST++数据集上评估,我们增强后的模型在2D-MM Align评分和已建立的指标方面明显改进。视频结果可以在我们的项目页面上查看:https:// this URL。
https://arxiv.org/abs/2405.09266
It remains a challenge to effectively control the emotion rendering in text-to-speech (TTS) synthesis. Prior studies have primarily focused on learning a global prosodic representation at the utterance level, which strongly correlates with linguistic prosody. Our goal is to construct a hierarchical emotion distribution (ED) that effectively encapsulates intensity variations of emotions at various levels of granularity, encompassing phonemes, words, and utterances. During TTS training, the hierarchical ED is extracted from the ground-truth audio and guides the predictor to establish a connection between emotional and linguistic prosody. At run-time inference, the TTS model generates emotional speech and, at the same time, provides quantitative control of emotion over the speech constituents. Both objective and subjective evaluations validate the effectiveness of the proposed framework in terms of emotion prediction and control.
控制文本到语音合成(TTS)中情感表达仍然具有挑战性。以前的研究主要集中在在句子级别学习全局音调表示,这与语言音调高度相关。我们的目标是构建一个层级情感分布(ED),有效地捕捉情感在各个粒度级别上的强度变化,包括音素、单词和句子。在TTS训练过程中,从地面真实音频中提取层级ED,并引导预测器建立情感和语言音调之间的联系。在运行时推理,TTS模型生成情感语音,同时为语音成分提供情感定量控制。客观和主观评价证实了所提出的框架在情感预测和控制方面的有效性。
https://arxiv.org/abs/2405.09171
In this article, we explore the potential of using latent diffusion models, a family of powerful generative models, for the task of reconstructing naturalistic music from electroencephalogram (EEG) recordings. Unlike simpler music with limited timbres, such as MIDI-generated tunes or monophonic pieces, the focus here is on intricate music featuring a diverse array of instruments, voices, and effects, rich in harmonics and timbre. This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data, employing an end-to-end training approach directly on raw data without the need for manual pre-processing and channel selection. We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics. We additionally perform song classification based on the generated tracks. Our work contributes to the ongoing research in neural decoding and brain-computer interfaces, offering insights into the feasibility of using EEG data for complex auditory information reconstruction.
在本文中,我们探讨了使用潜在扩散模型(一种强大的生成模型家族)从脑电图(EEG)录音中重构自然主义音乐的潜力。与简单音乐且音色有限的作品(如MIDI生成的曲目或单旋律作品)相比,这里的重点是复杂音乐,具有多样化的乐器、声音和效果,丰富和谐和音色。本研究代表了一种使用非侵入性EEG数据实现高质量音乐重建的初步尝试,采用端到端训练方法直接在原始数据上进行,无需手动预处理和通道选择。我们将模型 training 放在公共 NMED-T 数据集上,并通过提出基于神经嵌入的指标进行定量评估。此外,我们还基于生成的曲目进行歌曲分类。我们的工作对神经解码和脑-机接口的持续研究做出了贡献,揭示了使用EEG数据进行复杂听觉信息重建的可行性。
https://arxiv.org/abs/2405.09062
Recent advancements in deep learning for 3D models have propelled breakthroughs in generation, detection, and scene understanding. However, the effectiveness of these algorithms hinges on large training datasets. We address the challenge by introducing Efficient 3D Seam Carving (E3SC), a novel 3D model augmentation method based on seam carving, which progressively deforms only part of the input model while ensuring the overall semantics are unchanged. Experiments show that our approach is capable of producing diverse and high-quality augmented 3D shapes across various types and styles of input models, achieving considerable improvements over previous methods. Quantitative evaluations demonstrate that our method effectively enhances the novelty and quality of shapes generated by other subsequent 3D generation algorithms.
近年来,在深度学习领域为3D模型取得突破性的进展,主要体现在生成、检测和场景理解方面的提升。然而,这些算法的有效性依赖于大型训练数据集。为了解决这一挑战,我们引入了Efficient 3D Seam Carving(E3SC),一种基于缝合切割的新3D模型增强方法,在确保整体语义不变的前提下,逐步改变输入模型的部分部分。实验结果表明,我们的方法能够为各种输入模型的多样性和高质量生成3D形状,并在很大程度上超过了以前的方法。定量的评估结果表明,我们的方法有效地增强了后续3D生成算法生成的形状的新奇度和质量。
https://arxiv.org/abs/2405.09050
Transformer-based long context generative models power emerging AI applications like hour-long video understanding and project-level coding agent. Deploying long context transformers (e.g., 100K to 10M tokens) is prohibitively expensive compared to short context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to \textit{one single source: the large size of the KV cache}. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as a running example, and describe how its large KV cache causes four types of deployment challenges: (1) prefilling long inputs takes much longer compute time and GPU memory than short inputs; (2) after prefilling, the large KV cache residing on the GPU HBM substantially restricts the number of concurrent users being served; (3) during decoding, repeatedly reading the KV cache from HBM to SM largely increases latency; (4) when KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
Transformer-based long context generative models are powering emerging AI applications such as hour-long video understanding and project-level coding agents. However, deploying long context transformers (e.g., 100K to 10M tokens) is cost-prohibitively high compared to shorter context (e.g., 4K tokens) model variants. Reducing the cost of long-context transformers is becoming a pressing research and engineering challenge starting from the year of 2024. This work describes a concurrent programming framework for quantitatively analyzing the efficiency challenges in serving multiple long-context requests under limited size of GPU high-bandwidth memory (HBM) regime. We give a detailed analysis of how all additional computational costs, compared to 4K context, trace back to a single source: the large size of the KV cache. We use a 34B GPT-3.5 level model of 50K context on A100 NVLink as an example and describe how its large KV cache causes four types of deployment challenges: (1) Prefilling long inputs takes much longer compute time and GPU memory than short inputs. (2) After prefilling, the large KV cache residing on the GPU HBM significantly restricts the number of concurrent users being served. (3) During decoding, repeatedly reading the KV cache from HBM to SM largely increases latency. (4) When KV cache memory overflows, swapping it from HBM to DDR causes significant context switching latency. We use this framework to analyze existing works and identify possibilities of combining them to build end-to-end systems. Overall, this work offers a foundational framework for analyzing long context transformer deployment and identifies directions towards reducing the inference cost of 1M context to be as cheap as 4K.
https://arxiv.org/abs/2405.08944
The increasing complexity of Artificial Intelligence models poses challenges to interpretability, particularly in the healthcare sector. This study investigates the impact of deep learning model complexity and Explainable AI (XAI) efficacy, utilizing four ResNet architectures (ResNet-18, 34, 50, 101). Through methodical experimentation on 4,369 lung X-ray images of COVID-19-infected and healthy patients, the research evaluates models' classification performance and the relevance of corresponding XAI explanations with respect to the ground-truth disease masks. Results indicate that the increase in model complexity is associated with a decrease in classification accuracy and AUC-ROC scores (ResNet-18: 98.4%, 0.997; ResNet-101: 95.9%, 0.988). Notably, in eleven out of twelve statistical tests performed, no statistically significant differences occurred between XAI quantitative metrics - Relevance Rank Accuracy and the proposed Positive Attribution Ratio - across trained models. These results suggest that increased model complexity does not consistently lead to higher performance or relevance of explanations for models' decision-making processes.
人工智能模型的复杂度增加对可解释性提出了挑战,特别是在医疗领域。这项研究调查了深度学习模型的复杂度和可解释AI(XAI)的有效性,利用了四个ResNet架构(ResNet-18、34、50和101)。通过在COVID-19感染者和健康患者的大规模肺X光片上进行实验,研究评估了模型的分类表现以及相应XAI解释与真实疾病口罩的关系。研究结果表明,模型复杂度的增加与分类准确性和AUC-ROC分数(ResNet-18:98.4%,0.997;ResNet-101:95.9%,0.988)的降低有关。值得注意的是,在12个统计测试中,XAI定量指标——相关性排名准确性和所提出的积极归因比——在训练模型上没有显著的差异。这些结果表明,增加模型复杂度并不一定导致模型性能或解释性的提高。
https://arxiv.org/abs/2405.08658