In the e-commerce domain, the accurate extraction of attribute-value pairs from product listings (e.g., Brand: Apple) is crucial for enhancing search and recommendation systems. The automation of this extraction process is challenging due to the vast diversity of product categories and their respective attributes, compounded by the lack of extensive, accurately annotated training datasets and the demand for low latency to meet the real-time needs of e-commerce platforms. To address these challenges, we introduce GenToC, a novel two-stage model for extracting attribute-value pairs from product titles. GenToC is designed to train with partially-labeled data, leveraging incomplete attribute-value pairs and obviating the need for a fully annotated dataset. Moreover, we introduce a bootstrapping method that enables GenToC to progressively refine and expand its training dataset. This enhancement substantially improves the quality of data available for training other neural network models that are typically faster but are inherently less capable than GenToC in terms of their capacity to handle partially-labeled data. By supplying an enriched dataset for training, GenToC significantly advances the performance of these alternative models, making them more suitable for real-time deployment. Our results highlight the unique capability of GenToC to learn from a limited set of labeled data and to contribute to the training of more efficient models, marking a significant leap forward in the automated extraction of attribute-value pairs from product titles. GenToC has been successfully integrated into India's largest B2B e-commerce platform, this http URL, achieving a significant increase of 21.1% in recall over the existing deployed system while maintaining a high precision of 89.5% in this challenging task.
在电子商务领域,从产品列表中准确提取属性值对(例如,品牌:苹果)对于提高搜索和推荐系统非常重要。自动化这个过程因产品类别的多样性和它们的相应属性而变得非常具有挑战性,再加上缺乏广泛、准确注释的训练数据以及满足电子商务平台实时需求的需求,使得该过程非常具有挑战性。为了应对这些挑战,我们引入了GenToC,一种用于从产品标题中提取属性值对的新颖两阶段模型。GenToC旨在利用部分标注数据进行训练,并利用不完整的属性值对省略了完全标注的数据集的需求。此外,我们还引入了一种通过bootstrap方法逐步优化和扩展训练数据集的 bootstrapping 方法。通过为训练提供丰富的数据,GenToC显著提高了这些替代模型的性能,使它们更适合用于处理部分标注数据。我们的结果突出了GenToC从有限标注数据中学习的独特能力,以及为更有效的模型提供培训的重要性的显著进展,这标志着自动提取产品标题中属性值对的自动化过程中向前迈进了一大步。GenToC 已经成功地将印度最大的 B2B电子商务平台集成到该http URL中,实现了超过现有部署系统21.1%的召回率,同时保持了89.5%的高精确度,在具有挑战性的任务中表现出色。
https://arxiv.org/abs/2405.10918
Knowledge-intensive tasks pose a significant challenge for Machine Learning (ML) techniques. Commonly adopted methods, such as Large Language Models (LLMs), often exhibit limitations when applied to such tasks. Nevertheless, there have been notable endeavours to mitigate these challenges, with a significant emphasis on augmenting LLMs through Knowledge Graphs (KGs). While KGs provide many advantages for representing knowledge, their development costs can deter extensive research and applications. Addressing this limitation, we introduce a framework for enriching embeddings of small-scale domain-specific Knowledge Graphs with well-established general-purpose KGs. Adopting our method, a modest domain-specific KG can benefit from a performance boost in downstream tasks when linked to a substantial general-purpose KG. Experimental evaluations demonstrate a notable enhancement, with up to a 44% increase observed in the Hits@10 metric. This relatively unexplored research direction can catalyze more frequent incorporation of KGs in knowledge-intensive tasks, resulting in more robust, reliable ML implementations, which hallucinates less than prevalent LLM solutions. Keywords: knowledge graph, knowledge graph completion, entity alignment, representation learning, machine learning
知识密集型任务对机器学习(ML)技术带来了巨大的挑战。通常采用的方法,如大型语言模型(LLMs),当应用于这类任务时往往表现出局限性。然而,已经取得了一些有意义的研究来缓解这些挑战,并重点关注通过知识图谱(KGs)增强LLMs。虽然KGs为表示知识提供了许多优势,但它们的发展成本可能会阻碍广泛的研究和应用。为解决这一局限,我们引入了一个用于丰富小规模领域特定知识图谱的嵌入框架,并与其对应的一般性通用知识图谱。采用我们方法的一个小规模领域特定KG可以在与大量通用KG链接时提高下游任务的性能。实验评估表明,在Hits@10指标上观察到了显著的增强,最高达到44%的观察值。这个相对尚未探索的研究方向可以催化更频繁地将知识图谱(KGs)应用于知识密集型任务,从而实现更健壮、可靠的ML实现,这远非流行的LLM解决方案所梦寐以求的。关键词:知识图谱,知识图谱完成,实体对齐,表示学习,机器学习
https://arxiv.org/abs/2405.10745
Robotic systems driven by artificial muscles present unique challenges due to the nonlinear dynamics of actuators and the complex designs of mechanical structures. Traditional model-based controllers often struggle to achieve desired control performance in such systems. Deep reinforcement learning (DRL), a trending machine learning technique widely adopted in robot control, offers a promising alternative. However, integrating DRL into these robotic systems faces significant challenges, including the requirement for large amounts of training data and the inevitable sim-to-real gap when deployed to real-world robots. This paper proposes an efficient reinforcement learning control framework with sim-to-real transfer to address these challenges. Bootstrap and augmentation enhancements are designed to improve the data efficiency of baseline DRL algorithms, while a sim-to-real transfer technique, namely randomization of muscle dynamics, is adopted to bridge the gap between simulation and real-world deployment. Extensive experiments and ablation studies are conducted utilizing two string-type artificial muscle-driven robotic systems including a two degree-of-freedom robotic eye and a parallel robotic wrist, the results of which demonstrate the effectiveness of the proposed learning control strategy.
由于执行器和机械结构的非线性动力学以及复杂设计,基于模型的控制器往往难以在此类系统中实现期望的控制性能。深度强化学习(DRL)作为一种热门的机器学习技术,为机器人控制提供了有前景的替代方案。然而,将DRL集成到这些机器人系统面临着巨大的挑战,包括需要大量训练数据以及将模型部署到现实世界机器人时不可避免的模拟-真实世界差距。本文提出了一种有效的强化学习控制框架,通过模拟-真实世界迁移来解决这些挑战。通过 bootstrap 和 augmentation 增强基线 DRL 算法提高数据效率,同时采用一种称为肌肉动力随机化技术的 sim-to-real 迁移方法来弥合模拟和真实世界部署之间的差距。使用两种基于人工肌肉的机器人系统,包括具有两个自由度的机器人眼和一个并行机器人手腕,进行了广泛的实验和消融研究。实验结果表明,所提出的学习控制策略的有效性得到了充分验证。
https://arxiv.org/abs/2405.10576
Advances in endoscopy use in surgeries face challenges like inadequate lighting. Deep learning, notably the Denoising Diffusion Probabilistic Model (DDPM), holds promise for low-light image enhancement in the medical field. However, DDPMs are computationally demanding and slow, limiting their practical medical applications. To bridge this gap, we propose a lightweight DDPM, dubbed LighTDiff. It adopts a T-shape model architecture to capture global structural information using low-resolution images and gradually recover the details in subsequent denoising steps. We further prone the model to significantly reduce the model size while retaining performance. While discarding certain downsampling operations to save parameters leads to instability and low efficiency in convergence during the training, we introduce a Temporal Light Unit (TLU), a plug-and-play module, for more stable training and better performance. TLU associates time steps with denoised image features, establishing temporal dependencies of the denoising steps and improving denoising outcomes. Moreover, while recovering images using the diffusion model, potential spectral shifts were noted. We further introduce a Chroma Balancer (CB) to mitigate this issue. Our LighTDiff outperforms many competitive LLIE methods with exceptional computational efficiency.
内窥镜在手术中的进步面临着诸如照明不足等问题。深度学习在医学领域有潜力用于低光图像增强。然而,深度学习模型具有计算密集性和较慢的训练速度,限制了其在实际医疗应用中的实用价值。为了填补这一空白,我们提出了一个轻量级的DDPM,名为LighTDiff。它采用T形模型架构来捕捉低分辨率图像中的全局结构信息,并逐步在去噪步骤中恢复细节。我们还将模型的模型大小显著减小,同时保留性能。 虽然通过丢弃某些下采样操作来保存参数会导致在训练过程中不稳定和高效率,我们引入了一个Temporal Light Unit(TLU),一个可插拔的模块,用于更稳定的训练和更好的性能。TLU将时间步与去噪图像特征相关联,建立了去噪步骤的时间依赖关系,提高了去噪效果。此外,通过扩散模型恢复图像时,还发现了潜在的频率偏移。我们进一步引入了一个Chroma Balancer(CB)来减轻这个问题。 我们的LighTDiff在许多竞争的LLIE方法中表现优异,具有出色的计算效率。
https://arxiv.org/abs/2405.10550
Federated learning (FL) represents a pivotal shift in machine learning (ML) as it enables collaborative training of local ML models coordinated by a central aggregator, all without the need to exchange local data. However, its application on edge devices is hindered by limited computational capabilities and data communication challenges, compounded by the inherent complexity of Deep Learning (DL) models. Model pruning is identified as a key technique for compressing DL models on devices with limited resources. Nonetheless, conventional pruning techniques typically rely on manually crafted heuristics and demand human expertise to achieve a balance between model size, speed, and accuracy, often resulting in sub-optimal solutions. In this study, we introduce an automated federated learning approach utilizing informed pruning, called AutoFLIP, which dynamically prunes and compresses DL models within both the local clients and the global server. It leverages a federated loss exploration phase to investigate model gradient behavior across diverse datasets and losses, providing insights into parameter significance. Our experiments showcase notable enhancements in scenarios with strong non-IID data, underscoring AutoFLIP's capacity to tackle computational constraints and achieve superior global convergence.
联邦学习(FL)在机器学习(ML)中具有关键性的转变,因为它允许由中央聚合器协调本地ML模型的协同训练,而无需交换本地数据。然而,在边缘设备上应用FL存在计算能力和数据通信挑战的限制,再加上Deep Learning(DL)模型的固有复杂性。模型剪枝被认为是压缩具有有限资源设备的DL模型的关键技术。然而,传统的剪枝技术通常依赖于人工创建的启发式,并需要人类专业知识来达到模型大小、速度和准确性的平衡,往往导致次优解决方案。在本研究中,我们引入了一种自动化的联邦学习方法,利用智能剪枝,称为AutoFLIP,它可以在本地客户端和全局服务器上动态地剪枝和压缩DL模型。它利用联邦损失探索阶段研究了模型梯度行为,提供了对参数重要性的洞察。我们的实验展示了在具有强大非IID数据的情况下显著的增强,突出了AutoFLIP解决计算限制和实现卓越全局收敛的能力。
https://arxiv.org/abs/2405.10271
Text-to-speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Luganda Common Voice recordings of multiple speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality than the model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations, as well as background noise. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pre-trained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality.
文本转语音(TTS)发展对于非洲语言如卢加丹仍然是有限的,主要原因是因为高质量、单声道录音对于训练TTS模型至关重要。之前的工作主要集中在利用多个年龄在20-49岁的说话者的卢加丹共同声音录音。尽管生成的声音是可以理解的,但它们仍然比训练在工作室级别录音上的模型质量较低。这是因为用于提高共同声音录音质量的数据预处理方法不足。此外,由于变调的存在以及背景噪音,声音收敛更困难。在本文中,我们证明了通过在共同声音训练中使用多声道接近调的说话者,卢加丹TTS的质量可以得到改善。具体来说,我们选择了六名女性说话者,通过主观听力和比较它们的录音,确定了他们的共同声音。除了剪去录音的开头和结尾处的无声部分外,我们还应用了一个预训练的语音增强模型来降低背景噪音并提高音频质量。我们还利用了一个预训练的非侵入性自我监督Mean Opinion Score(MOS)估计模型,该模型可以过滤具有估计MOS超过3.5的录音,表明具有高感知质量。来自九名母语为卢加丹的参与者的主观MOS评估表明,我们的TTS模型实现了比现有模型更高的MOS值(3.55),而现有的模型报告的MOS值为2.5。此外,对于公平的比较,基于多声道的接近调的模型训练胜过基于单声道的模型(3.13 MOS)或双声道的模型(3.22 MOS)。这展示了通过从单一说话者的不足数据中补充多声道接近调的数据来提高TTS质量的有效性。
https://arxiv.org/abs/2405.10211
Monaural Speech enhancement on drones is challenging because the ego-noise from the rotating motors and propellers leads to extremely low signal-to-noise ratios at onboard microphones. Although recent masking-based deep neural network methods excel in monaural speech enhancement, they struggle in the challenging drone noise scenario. Furthermore, existing drone noise datasets are limited, causing models to overfit. Considering the harmonic nature of drone noise, this paper proposes a frequency domain bottleneck adapter to enable transfer learning. Specifically, the adapter's parameters are trained on drone noise while retaining the parameters of the pre-trained Frequency Recurrent Convolutional Recurrent Network (FRCRN) fixed. Evaluation results demonstrate the proposed method can effectively enhance speech quality. Moreover, it is a more efficient alternative to fine-tuning models for various drone types, which typically requires substantial computational resources.
在无人机上进行单声道语音增强是一个具有挑战性的任务,因为旋转电机和螺旋桨的自元噪声导致机载麦克风中的信号-噪声比非常低。尽管基于遮罩的深度神经网络方法在单声道语音增强方面表现出色,但在具有挑战性的无人机噪音场景中,它们的表现不佳。此外,现有的无人机噪音数据集有限,导致模型过拟合。考虑到无人机噪音的谐波特性,本文提出了一种频率域瓶颈适配器,以实现迁移学习。具体来说,适配器的参数在保留前预训练的FRCRN参数的同时,在无人机噪音上进行训练。评估结果表明,与单声道语音增强相比,所提出的方法可以有效增强语音质量。此外,它是为各种无人机类型进行模型微调的更有效选择,而通常需要大量的计算资源。
https://arxiv.org/abs/2405.10022
Infrared (IR) image super-resolution faces challenges from homogeneous background pixel distributions and sparse target regions, requiring models that effectively handle long-range dependencies and capture detailed local-global information. Recent advancements in Mamba-based (Selective Structured State Space Model) models, employing state space models, have shown significant potential in visual tasks, suggesting their applicability for IR enhancement. In this work, we introduce IRSRMamba: Infrared Image Super-Resolution via Mamba-based Wavelet Transform Feature Modulation Model, a novel Mamba-based model designed specifically for IR image super-resolution. This model enhances the restoration of context-sparse target details through its advanced dependency modeling capabilities. Additionally, a new wavelet transform feature modulation block improves multi-scale receptive field representation, capturing both global and local information efficiently. Comprehensive evaluations confirm that IRSRMamba outperforms existing models on multiple benchmarks. This research advances IR super-resolution and demonstrates the potential of Mamba-based models in IR image processing. Code are available at \url{this https URL}.
红外图像超分辨率面临来自均匀背景像素分布和稀疏目标区域的挑战,需要处理长距离依赖并捕获详细局部全局信息的模型。最近基于Mamba(选择性结构状态空间模型)的模型在视觉任务方面的进展表明,它们适用于红外增强。在这项工作中,我们介绍了IRSRMamba:通过Mamba波谷变换特征调制模型进行红外图像超分辨率,这是一种专为红外图像超分辨率设计的Mamba模型。通过其先进的依赖建模能力,该模型通过恢复稀疏目标细节来增强语境。此外,一个新的波谷变换特征调制块通过多尺度接收场表示捕获全局和局部信息,有效捕捉信息。全面的评估证实,IRSRMamba在多个基准测试上优于现有模型。这项研究推动了红外图像超分辨率,并展示了基于Mamba模型的在红外图像处理中的潜力。代码可在此处访问:\url{这个链接}
https://arxiv.org/abs/2405.09873
Temporal knowledge graphs (TKGs) can effectively model the ever-evolving nature of real-world knowledge, and their completeness and enhancement can be achieved by reasoning new events from existing ones. However, reasoning accuracy is adversely impacted due to an imbalance between new and recurring events in the datasets. To achieve more accurate TKG reasoning, we propose an attention masking-based contrastive event network (AMCEN) with local-global temporal patterns for the two-stage prediction of future events. In the network, historical and non-historical attention mask vectors are designed to control the attention bias towards historical and non-historical entities, acting as the key to alleviating the imbalance. A local-global message-passing module is proposed to comprehensively consider and capture multi-hop structural dependencies and local-global temporal evolution for the in-depth exploration of latent impact factors of different event types. A contrastive event classifier is used to classify events more accurately by incorporating local-global temporal patterns into contrastive learning. Therefore, AMCEN refines the prediction scope with the results of the contrastive event classification, followed by utilizing attention masking-based decoders to finalize the specific outcomes. The results of our experiments on four benchmark datasets highlight the superiority of AMCEN. Especially, the considerable improvements in Hits@1 prove that AMCEN can make more precise predictions about future occurrences.
temporal knowledge graphs(TKGs)可以有效地建模现实世界中知识的不断演变,通过从现有知识中推理新事件,可以实现TKG的完整性和增强。然而,由于数据集中新事件和 recurring 事件之间的不平衡,推理准确性受到损害。为了实现更精确的TKG推理,我们提出了一个基于注意力掩码的对比事件网络(AMCEN),用于两阶段预测未来事件。在网络中,历史和非历史关注掩码向量被设计成控制关注偏差,起到减轻不平衡的作用。 我们提出了一个局部-全局消息传递模块,全面考虑和捕捉不同事件类型中潜在影响因素的 multi-hop 结构关系和局部-全局时间演化。使用对比学习将局部-全局时间模式融入其中,更准确地分类事件。因此,AMCEN通过对比事件分类的结果优化了预测范围,然后利用基于注意力掩码的解码器来确定具体结果。 我们实验的四种基准数据集的结果表明,AMCEN具有优越性。特别是,Hits@1 的显著提高证明了AMCEN在预测未来事件方面具有更精确的把握。
https://arxiv.org/abs/2405.10346
The advances in deep generative models have greatly accelerate the process of video procession such as video enhancement and synthesis. Learning spatio-temporal video models requires to capture the temporal dynamics of a scene, in addition to the visual appearance of individual frames. Illumination consistency, which reflects the variations of illumination in the dynamic video sequences, play a vital role in video processing. Unfortunately, to date, no well-accepted quantitative metric has been proposed for video illumination consistency evaluation. In this paper, we propose a illumination histogram consistency (IHC) metric to quantitatively and automatically evaluate the illumination consistency of the video sequences. IHC measures the illumination variation of any video sequence based on the illumination histogram discrepancies across all the frames in the video sequence. Specifically, given a video sequence, we first estimate the illumination map of each individual frame using the Retinex model; Then, using the illumination maps, the mean illumination histogram of the video sequence is computed by the mean operation across all the frames; Next, we compute the illumination histogram discrepancy between each individual frame and the mean illumination histogram and sum up all the illumination histogram discrepancies to represent the illumination variations of the video sequence. Finally, we obtain the IHC score from the illumination histogram discrepancies via normalization and subtraction operations. Experiments are conducted to illustrate the performance of the proposed IHC metric and its capability to measure the illumination variations in video sequences. The source code is available on \url{this https URL}.
深度生成模型的进步大大加速了视频处理的过程,如视频增强和合成。学习空间-时间视频模型需要捕获场景的时空动态,以及每个帧的视觉外观。光照一致性(反映动态视频序列中光照的变化)在视频处理中起着关键作用。然而,到目前为止,还没有一个经过良好接受的数量指标被提出来评估视频光照一致性。在本文中,我们提出了一个光照直方图一致性(IHC)指标,用于定量且自动评估视频序列的光照一致性。IHC根据视频序列中所有帧之间的光照直方图差异来测量任何视频序列的照明变化。具体来说,给定一个视频序列,我们首先使用Retinex模型估计每个帧的照明地图;然后,利用光照地图计算视频序列的平均光照直方图;接下来,我们计算每个单独帧与平均光照直方图之间的照明直方图差异并求和,以表示视频序列的照明变化。最后,我们通过归一化和减法操作获得IHC得分。实验结果证明了所提出的IHC指标的性能以及其测量视频序列照明变化的能力。源代码可在此处访问:https://this URL。
https://arxiv.org/abs/2405.09716
This paper introduces the Global-Local Image Perceptual Score (GLIPS), an image metric designed to assess the photorealistic image quality of AI-generated images with a high degree of alignment to human visual perception. Traditional metrics such as FID and KID scores do not align closely with human evaluations. The proposed metric incorporates advanced transformer-based attention mechanisms to assess local similarity and Maximum Mean Discrepancy (MMD) to evaluate global distributional similarity. To evaluate the performance of GLIPS, we conducted a human study on photorealistic image quality. Comprehensive tests across various generative models demonstrate that GLIPS consistently outperforms existing metrics like FID, SSIM, and MS-SSIM in terms of correlation with human scores. Additionally, we introduce the Interpolative Binning Scale (IBS), a refined scaling method that enhances the interpretability of metric scores by aligning them more closely with human evaluative standards. The proposed metric and scaling approach not only provides more reliable assessments of AI-generated images but also suggest pathways for future enhancements in image generation technologies.
本文介绍了全局局部图像感知分数(GLIPS),一种用于评估高度与人类视觉感知一致的AI生成的图像的图像指标。传统指标如FID和KID得分与人类评价的关系不密切。所提出的指标采用先进的Transformer基注意力机制来评估局部相似度和最大均方差(MMD)以评估全局分布相似度。为了评估GLIPS的性能,我们在图像质量方面进行了一项人类研究。对各种生成模型进行的全局测试表明,GLIPS在与人类评分的一致性方面显著优于现有指标如FID、SSIM和MS-SSIM。此外,我们还引入了平滑分割尺度(IBS),一种通过更紧密地与人类评价标准对齐来提高指标分数解释性的平滑分割方法。所提出的指标和分割方法不仅为AI生成图像提供了更可靠的评估,还提出了未来图像生成技术改进的路径。
https://arxiv.org/abs/2405.09426
The automation of writing imaging reports is a valuable tool for alleviating the workload of radiologists. Crucial steps in this process involve the cross-modal alignment between medical images and reports, as well as the retrieval of similar historical cases. However, the presence of presentation-style vocabulary (e.g., sentence structure and grammar) in reports poses challenges for cross-modal alignment. Additionally, existing methods for similar historical cases retrieval face suboptimal performance owing to the modal gap issue. In response, this paper introduces a novel method, named Factual Serialization Enhancement (FSE), for chest X-ray report generation. FSE begins with the structural entities approach to eliminate presentation-style vocabulary in reports, providing specific input for our model. Then, uni-modal features are learned through cross-modal alignment between images and factual serialization in reports. Subsequently, we present a novel approach to retrieve similar historical cases from the training set, leveraging aligned image features. These features implicitly preserve semantic similarity with their corresponding reference reports, enabling us to calculate similarity solely among aligned features. This effectively eliminates the modal gap issue for knowledge retrieval without the requirement for disease labels. Finally, the cross-modal fusion network is employed to query valuable information from these cases, enriching image features and aiding the text decoder in generating high-quality reports. Experiments on MIMIC-CXR and IU X-ray datasets from both specific and general scenarios demonstrate the superiority of FSE over state-of-the-art approaches in both natural language generation and clinical efficacy metrics.
写作影像报告的自动化是一个减轻放射科医生工作负担的有价值的工具。这个过程的关键步骤包括医学图像和报告之间的跨模态对齐以及检索类似历史病例。然而,报告中出现的表现式词汇(例如句子结构和语法)对跨模态对齐提出了挑战。此外,由于模态缺口问题,现有的类似历史病例检索方法表现不佳。为了应对这个问题,本文介绍了一种名为Factual Serialization Enhancement(FSE)的新方法,用于生成胸部X光报告。FSE首先采用结构实体方法消除报告中的表现式词汇,为我们的模型提供具体输入。然后,通过图像和报告之间的跨模态对齐学习单模态特征。接着,我们提出了一种从训练集中检索类似历史病例的新方法,利用对齐的图像特征。这些特征隐含地保留与其相应参考报告的语义相似性,使我们能够仅在对齐的特征之间计算相似度。这有效地消除了无需疾病标签的知识检索中的模态缺口问题。最后,跨模态融合网络被用于从这些病例中查询有价值的信息,丰富图像特征并帮助文本解码器生成高质量的报告。在特定和通用情景下的MIMIC-CXR和IU X-ray数据集的实验证明,FSE在自然语言生成和临床有效性指标方面优于最先进的 approaches。
https://arxiv.org/abs/2405.09586
In this article, we focus on the critical tasks of plant protection in arable farms, addressing a modern challenge in agriculture: integrating ecological considerations into the operational strategy of precision weeding robots like \bbot. This article presents the recent advancements in weed management algorithms and the real-world performance of \bbot\ at the University of Bonn's Klein-Altendorf campus. We present a novel Rolling-view observation model for the BonnBot-Is weed monitoring section which leads to an average absolute weeding performance enhancement of $3.4\%$. Furthermore, for the first time, we show how precision weeding robots could consider bio-diversity-aware concerns in challenging weeding scenarios. We carried out comprehensive weeding experiments in sugar-beet fields, covering both weed-only and mixed crop-weed situations, and introduced a new dataset compatible with precision weeding. Our real-field experiments revealed that our weeding approach is capable of handling diverse weed distributions, with a minimal loss of only $11.66\%$ attributable to intervention planning and $14.7\%$ to vision system limitations highlighting required improvements of the vision system.
在本文中,我们重点讨论了农田保护任务中的关键任务,解决了农业领域的一个现代挑战:将生态考虑因素整合到像\bbot这样的精确喷雾机器人操作策略中。本文介绍了喷雾管理算法的最新进展以及\bbot\在 Bonn 大学 Klein-Altendorf 校园的实地表现。我们提出了 BonnBot-Is 杂草监测部分的滚动查看观察模型,使得平均绝对喷雾性能提高了 3.4%。此外,我们还展示了精确喷雾机器人如何考虑挑战性喷雾场景中的生物多样性关注。我们在糖菜田进行了全面的喷雾实验,涵盖了只有杂草和混合种植作物的情况,并引入了一个与精确喷雾兼容的新数据集。我们的实地实验表明,我们的喷雾方法能够处理不同的杂草分布,干预计划的损失只有 11.66%,而视觉系统限制引起的损失为 14.7%。
https://arxiv.org/abs/2405.09118
We introduce BEVRender, a novel learning-based approach for the localization of ground vehicles in Global Navigation Satellite System (GNSS)-denied off-road scenarios. These environments are typically challenging for conventional vision-based state estimation due to the lack of distinct visual landmarks and the instability of vehicle poses. To address this, BEVRender generates high-quality local bird's eye view (BEV) images of the local terrain. Subsequently, these images are aligned with a geo-referenced aerial map via template-matching to achieve accurate cross-view registration. Our approach overcomes the inherent limitations of visual inertial odometry systems and the substantial storage requirements of image-retrieval localization strategies, which are susceptible to drift and scalability issues, respectively. Extensive experimentation validates BEVRender's advancement over existing GNSS-denied visual localization methods, demonstrating notable enhancements in both localization accuracy and update frequency. The code for BEVRender will be made available soon.
我们提出了BEVRender,一种新的基于学习的在GNSS拒绝的离线场景中定位地面车辆的新方法。这些环境通常对传统视觉状态估计方法具有挑战性,因为缺乏明显的视觉地标和车辆姿态的不稳定性。为了应对这个问题, BEVRender生成高质量的局部鸟瞰(BEV)图像,并通过模板匹配与地理参考的无人机地图对它们进行对齐,以实现准确的跨视图配准。我们的方法克服了视觉惯性导航系统的固有局限性和图像检索定位策略需要大量存储空间的问题,这些问题容易受到漂移和可扩展性的影响。大量的实验证实,BEVRender在现有GNSS拒绝的视觉本地化方法中取得了显著的进步,证明了在定位精度和更新频率方面的显著改进。BEVRender的代码即将发布。
https://arxiv.org/abs/2405.09001
With recent advances in deep learning, numerous algorithms have been developed to enhance video quality, reduce visual artefacts and improve perceptual quality. However, little research has been reported on the quality assessment of enhanced content - the evaluation of enhancement methods is often based on quality metrics that were designed for compression applications. In this paper, we propose a novel blind deep video quality assessment (VQA) method specifically for enhanced video content. It employs a new Recurrent Memory Transformer (RMT) based network architecture to obtain video quality representations, which is optimised through a novel content-quality-aware contrastive learning strategy based on a new database containing 13K training patches with enhanced content. The extracted quality representations are then combined through linear regression to generate video-level quality indices. The proposed method, RMT-BVQA, has been evaluated on the VDPVE (VQA Dataset for Perceptual Video Enhancement) database through a five-fold cross validation. The results show its superior correlation performance when compared to ten existing no-reference quality metrics.
随着深度学习的 recent 进步,已经开发了众多算法来提高视频质量、减少视觉伪影和提高感知质量。然而,关于增强内容质量评估的研究仍然很少。增强内容的质量评估通常基于为压缩应用设计的质量度量指标。在本文中,我们提出了一个专门针对增强视频内容的新颖盲深度视频质量评估(VQA)方法。它采用了一种基于新数据库(包含 13K 训练补丁)的内容质量感知循环记忆转置(RMT)网络架构来获得视频质量表示,并通过一种基于新数据库(包含13K训练补丁)的内容质量感知对比学习策略来优化它。提取出的质量表示通过线性回归合并生成视频级的质量索引。所提出的方法,RMT-BVQA,通过五倍交叉验证对 VDPVE(用于感知视频增强的数据集)数据库进行了评估。结果表明,与十个现有无参考质量度量指标相比,其相关性能优越。
https://arxiv.org/abs/2405.08621
Multi-objective combinatorial optimization (MOCO) problems are prevalent in various real-world applications. Most existing neural methods for MOCO problems rely solely on decomposition and utilize precise hypervolume to enhance diversity. However, these methods often approximate only limited regions of the Pareto front and spend excessive time on diversity enhancement because of ambiguous decomposition and time-consuming hypervolume calculation. To address these limitations, we design a Geometry-Aware Pareto set Learning algorithm named GAPL, which provides a novel geometric perspective for neural MOCO via a Pareto attention model based on hypervolume expectation maximization. In addition, we propose a hypervolume residual update strategy to enable the Pareto attention model to capture both local and non-local information of the Pareto set/front. We also design a novel inference approach to further improve quality of the solution set and speed up hypervolume calculation and local subset selection. Experimental results on three classic MOCO problems demonstrate that our GAPL outperforms state-of-the-art neural baselines via superior decomposition and efficient diversity enhancement.
多目标组合优化(MOCO)问题在各种现实应用中普遍存在。大多数现有的神经方法仅基于分解,并利用精确的半径来增强多样性。然而,由于模糊的分解和耗时的半径计算,这些方法通常只近似Pareto前沿的有限区域,并且花费大量时间进行多样性增强。为了克服这些限制,我们设计了一种基于超体积期望最大化基于Pareto注意模型的Geometry-Aware Pareto集学习算法,为神经MOCO提供了新颖的几何视角。此外,我们还提出了一种半径残差更新策略,使Pareto注意模型能够捕捉到Pareto集/前的局部和非局部信息。我们还设计了一种新的推理方法,以进一步提高解决方案集的质量和加快半径计算和局部子集选择。在三个经典的MOCO问题上的实验结果表明,我们的GAPL通过卓越的分解和高效的多样性增强超越了最先进的神经 baseline。
https://arxiv.org/abs/2405.08604
Underwater imaging often suffers from low quality due to factors affecting light propagation and absorption in water. To improve image quality, some underwater image enhancement (UIE) methods based on convolutional neural networks (CNN) and Transformer have been proposed. However, CNN-based UIE methods are limited in modeling long-range dependencies, and Transformer-based methods involve a large number of parameters and complex self-attention mechanisms, posing efficiency challenges. Considering computational complexity and severe underwater image degradation, a state space model (SSM) with linear computational complexity for UIE, named WaterMamba, is proposed. We propose spatial-channel omnidirectional selective scan (SCOSS) blocks comprising spatial-channel coordinate omnidirectional selective scan (SCCOSS) modules and a multi-scale feedforward network (MSFFN). The SCOSS block models pixel and channel information flow, addressing dependencies. The MSFFN facilitates information flow adjustment and promotes synchronized operations within SCCOSS modules. Extensive experiments showcase WaterMamba's cutting-edge performance with reduced parameters and computational resources, outperforming state-of-the-art methods on various datasets, validating its effectiveness and generalizability. The code will be released on GitHub after acceptance.
由于影响水下成像光传播和吸收的因素,水下成像通常会导致低质量。为提高图像质量,已经提出了基于卷积神经网络(CNN)和Transformer的一些水下图像增强(UIE)方法。然而,基于CNN的UIE方法在建模长距离依赖方面有限,而基于Transformer的方法参数数量较大且具有复杂的自注意机制,导致效率挑战。在考虑计算复杂性和严重的水下图像退化的情况下,我们提出了一个具有线性计算复杂度的水下图像增强状态空间模型(WaterMamba)。我们提出了包括空间通道坐标全方向选择扫描(SCCOSS)模块和多尺度全向导网络(MSFFN)的时空通道全方向选择扫描(SCOSS)块。SCOSS块建模像素和通道信息流,解决依赖关系。MSFFN促进信息流调整和SCOSS模块内的同步操作。大量实验展示了WaterMamba在较低参数和计算资源下的尖端性能,其在各种数据集上的表现优于最先进的方法,验证了其有效性和通用性。代码将在接受审核后发布到GitHub上。
https://arxiv.org/abs/2405.08419
Pretrained language models (LMs) showcase significant capabilities in processing molecular text, while concurrently, message passing neural networks (MPNNs) demonstrate resilience and versatility in the domain of molecular science. Despite these advancements, we find there are limited studies investigating the bidirectional interactions between molecular structures and their corresponding textual representations. Therefore, in this paper, we propose two strategies to evaluate whether an information integration can enhance the performance: contrast learning, which involves utilizing an MPNN to supervise the training of the LM, and fusion, which exploits information from both models. Our empirical analysis reveals that the integration approaches exhibit superior performance compared to baselines when applied to smaller molecular graphs, while these integration approaches do not yield performance enhancements on large scale graphs.
预训练语言模型(LMs)在处理分子文本时表现出显著的能力,同时,消息传递神经网络(MPNN)在分子科学领域表现出弹性和多样性。尽管有这些进步,但我们发现研究分子结构与其相应文本表示之间双向互动的有限研究。因此,在本文中,我们提出了两种评估信息整合是否能够提高性能的方法:对比学习,涉及利用MPNN监督LM的训练,和融合,利用两个模型的信息。我们的实证分析表明,当应用于较小的分子图时,整合方法显示出比基线更优越的性能,而当应用于大图时,这些整合方法并没有提高性能。
https://arxiv.org/abs/2405.08334
As an important subtopic of image enhancement, color transfer aims to enhance the color scheme of a source image according to a reference one while preserving the semantic context. To implement color transfer, the palette-based color mapping framework was proposed. \textcolor{black}{It is a classical solution that does not depend on complex semantic analysis to generate a new color scheme. However, the framework usually requires manual settings, blackucing its practicality.} The quality of traditional palette generation depends on the degree of color separation. In this paper, we propose a new palette-based color transfer method that can automatically generate a new color scheme. With a redesigned palette-based clustering method, pixels can be classified into different segments according to color distribution with better applicability. {By combining deep learning-based image segmentation and a new color mapping strategy, color transfer can be implemented on foreground and background parts independently while maintaining semantic consistency.} The experimental results indicate that our method exhibits significant advantages over peer methods in terms of natural realism, color consistency, generality, and robustness.
作为图像增强的一个重要子主题,色彩转移的目的是根据参考图像增强源图像的颜色方案,同时保留语义上下文。为了实现色彩转移,基于调色板的颜色映射框架被提出。\textcolor{black}{这是一种经典的解决方案,不需要进行复杂的语义分析来生成新的颜色方案。然而,该框架通常需要手动设置,降低了其实用性。} 传统调色板生成的质量取决于色彩分离的程度。在本文中,我们提出了一种新的基于调色板的颜色转移方法,可以自动生成新的颜色方案。通过重新设计的基于调色板的分聚方法,可以根据色彩分布将像素分类为不同的片段,具有更好的应用效果。{通过将基于深度学习的图像分割和新的颜色映射策略相结合,可以在前景和背景部分独立地实现色彩转移,同时保持语义一致性。} 实验结果表明,我们的方法在自然真实感、色彩一致性、泛化和鲁棒性方面显著优于同类方法。
https://arxiv.org/abs/2405.08263
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene, with one of the main challenges being the presence of dense and occluded objects. Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments and are unable to identify obscured objects. To this end, we first construct two multimodal dense and occlusion vehicle detection datasets for large-scale events, utilizing RGB and height map modalities. Based on these datasets, we propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet for short. MuDet hierarchically enhances the completeness of discriminable information within and across modalities and differentiates between simple and complex samples. MuDet includes three main modules: Unimodal Feature Hierarchical Enhancement (Uni-Enh), Multimodal Cross Learning (Mul-Lea), and Hard-easy Discriminative (He-Dis) Pattern. Uni-Enh and Mul-Lea enhance the features within each modality and facilitate the cross-integration of features from two heterogeneous modalities. He-Dis effectively separates densely occluded vehicle targets with significant intra-class differences and minimal inter-class differences by defining and thresholding confidence values, thereby suppressing the complex background. Experimental results on two re-labeled multimodal benchmark datasets, the 4K-SAI-LCS dataset, and the ISPRS Potsdam dataset, demonstrate the robustness and generalization of the MuDet. The codes of this work are available openly at \url{this https URL}.
在大规模灾害事件中,最优救援路线的规划取决于灾害现场的物体检测能力,其中主要挑战是存在密集和遮挡的对象。现有的方法,通常基于RGB模态,在拥挤的环境中很难区分具有相似颜色和纹理的目标,也无法识别遮挡的对象。为此,我们首先为大规模事件构建了两个多模态密集和遮挡车辆检测数据集,利用RGB和高度图模态。基于这些数据集,我们提出了一个多模态协作网络用于密集和遮挡车辆检测,MuDet短。MuDet通过分层增强模态之间的可区分信息完整性并区分简单和复杂样本来提高模型的可解释性。MuDet包括三个主要模块:单模态特征级联增强(Uni-Enh)、多模态跨学习(Mul-Lea)和硬-容易区分(He-Dis)模式。Uni-Enh和Mul-Lea通过在每种模块内增强特征和在两个异质模态之间促进特征的跨整合来提高模型的性能。He-Dis通过定义和阈值信心值有效地将密集遮挡的车辆目标与具有显著内部类差异的简单目标区分开来,从而抑制复杂背景。在两个重新标注的多模态基准数据集(4K-SAI-LCS数据集和ISPRS Potsdam数据集)上的实验结果证明了MuDet的稳健性和泛化能力。本工作的代码公开可见,在\url{这个链接}处。
https://arxiv.org/abs/2405.08251