We propose In-Context Translation (ICT), a general learning framework to unify visual recognition (e.g., semantic segmentation), low-level image processing (e.g., denoising), and conditional image generation (e.g., edge-to-image synthesis). Thanks to unification, ICT significantly reduces the inherent inductive bias that comes with designing models for specific tasks, and it maximizes mutual enhancement across similar tasks. However, the unification across a large number of tasks is non-trivial due to various data formats and training pipelines. To this end, ICT introduces two designs. Firstly, it standardizes input-output data of different tasks into RGB image pairs, e.g., semantic segmentation data pairs an RGB image with its segmentation mask in the same RGB format. This turns different tasks into a general translation task between two RGB images. Secondly, it standardizes the training of different tasks into a general in-context learning, where "in-context" means the input comprises an example input-output pair of the target task and a query image. The learning objective is to generate the "missing" data paired with the query. The implicit translation process is thus between the query and the generated image. In experiments, ICT unifies ten vision tasks and showcases impressive performance on their respective benchmarks. Notably, compared to its competitors, e.g., Painter and PromptDiffusion, ICT trained on only 4 RTX 3090 GPUs is shown to be more efficient and less costly in training.
我们提出了In-Context Translation(ICT)学习框架,这是一个通用的学习框架,旨在统一视觉识别(例如语义分割)、低级图像处理(例如去噪)和条件图像生成(例如边缘到图像合成)。由于统一,ICT显著减少了为特定任务设计模型所固有的归纳偏差,并最大化了具有类似任务的相互增强。然而,在大量任务上的统一并非易事,由于各种数据格式和训练途径。为此,ICT引入了两种设计。首先,它将不同任务的输入输出数据标准化为RGB图像对,例如语义分割数据的输入和分割掩码在同一RGB格式下。这使得不同任务成为两个RGB图像之间的通用翻译任务。其次,它将不同任务的训练标准化为通用In-Context学习,其中“In-Context”意味着输入包括目标任务的实例输入-输出对和查询图像。学习目标是要生成与查询图像相配对的“缺失”数据。因此,隐式翻译过程在查询和生成图像之间。在实验中,ICT统一了十个视觉任务,并在其各自的基准测试中展示了令人印象深刻的性能。值得注意的是,与竞争对手,如Painter和PromptDiffusion相比,ICT仅在4个RTX 3090 GPU上训练,就被证明在训练过程中更有效且成本更低。
https://arxiv.org/abs/2404.09633
The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.
化学反应预测(CRPs)任务的进步在推动药物发现和材料科学方面具有重要影响。然而,其效果受到化学反应空间中庞大而不可预测的挑战的限制,尤其是在现有方法对数据固有知识利用的局限性。为了应对这些挑战,我们引入了一种数据审议的自反馈知识抽取方法。这种方法从分子表示的迭代优化开始,并促进从化学反应类型(RTs)中提取知识。然后,我们采用自适应提示学习将先验知识注入大型语言模型(LLM)。因此,我们取得了显著的提高:逆推预测准确度提高了14.2%,试剂预测准确度提高了74.2%,并且模型在处理多任务化学反应方面的能力得到了扩展。这项研究为科学研究中的知识抽取提供了一个新的范例,并展示了LLM在CRPs中的潜在优势。
https://arxiv.org/abs/2404.09606
Automatic 3D facial texture generation has gained significant interest recently. Existing approaches may not support the traditional physically based rendering pipeline or rely on 3D data captured by Light Stage. Our key contribution is a progressive latent space refinement approach that can bootstrap from 3D Morphable Models (3DMMs)-based texture maps generated from facial images to generate high-quality and diverse PBR textures, including albedo, normal, and roughness. It starts with enhancing Generative Adversarial Networks (GANs) for text-guided and diverse texture generation. To this end, we design a self-supervised paradigm to overcome the reliance on ground truth 3D textures and train the generative model with only entangled texture maps. Besides, we foster mutual enhancement between GANs and Score Distillation Sampling (SDS). SDS boosts GANs with more generative modes, while GANs promote more efficient optimization of SDS. Furthermore, we introduce an edge-aware SDS for multi-view consistent facial structure. Experiments demonstrate that our method outperforms existing 3D texture generation methods regarding photo-realistic quality, diversity, and efficiency.
自动3D面部纹理生成最近受到了广泛关注。现有的方法可能不支持传统的基于物理渲染管道,或者依赖于由光 stages捕获的3D数据。我们关键的贡献是一种渐进式的潜在空间细化方法,可以从基于面部图像生成的3D可塑模型(3DMM)纹理贴图开始,生成高质量和多样性的PBR纹理,包括Albedo、法和粗糙度。它从增强引导生成对抗网络(GANs)用于文本指导和大胆纹理生成开始。为此,我们设计了一个自监督的范式,以克服对真实3D纹理的依赖,并仅使用纠缠纹理贴图训练生成模型。此外,我们促进了GANs和评分蒸馏采样(SDS)之间的相互增强。SDS通过增加生成模式来提升GANs,而GANs通过更有效地优化SDS来推动SDS。此外,我们还引入了一个边缘感知的SDS,用于多视角一致的面部结构。实验证明,我们的方法在照片写实质量、多样性和效率方面超过了现有的3D纹理生成方法。
https://arxiv.org/abs/2404.09540
Low-dose computed tomography (LDCT) has become the technology of choice for diagnostic medical imaging, given its lower radiation dose compared to standard CT, despite increasing image noise and potentially affecting diagnostic accuracy. To address this, advanced deep learning-based LDCT denoising algorithms have been developed, primarily using Convolutional Neural Networks (CNNs) or Transformer Networks with the Unet architecture. This architecture enhances image detail by integrating feature maps from the encoder and decoder via skip connections. However, current methods often overlook enhancements to the Unet architecture itself, focusing instead on optimizing encoder and decoder structures. This approach can be problematic due to the significant differences in feature map characteristics between the encoder and decoder, where simple fusion strategies may not effectively reconstruct this http URL this paper, we introduce WiTUnet, a novel LDCT image denoising method that utilizes nested, dense skip pathways instead of traditional skip connections to improve feature integration. WiTUnet also incorporates a windowed Transformer structure to process images in smaller, non-overlapping segments, reducing computational load. Additionally, the integration of a Local Image Perception Enhancement (LiPe) module in both the encoder and decoder replaces the standard multi-layer perceptron (MLP) in Transformers, enhancing local feature capture and representation. Through extensive experimental comparisons, WiTUnet has demonstrated superior performance over existing methods in key metrics such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Root Mean Square Error (RMSE), significantly improving noise removal and image quality.
低剂量CT(LDCT)已成为诊断医学成像的首选技术,尽管其与标准CT相比辐射剂量较低,但图像噪声增加,可能会影响诊断准确性。为解决这个问题,已经开发了高级基于深度学习的LDCT去噪算法,主要使用卷积神经网络(CNN)或具有Unet架构的Transformer网络。这种架构通过级联密集跳过路径将编码器和解码器的特征图进行集成,从而增强图像细节。然而,目前的方法通常忽视了Unet架构本身的增长,而是专注于优化编码器和解码器结构。这种方法由于编码器和解码器之间特征图特征的显著差异而具有问题,简单的融合策略可能无法有效地重构本文中的这个URL。我们引入了WiTUnet,一种新颖的LDCT图像去噪方法,它利用嵌套的密集跳过路径而不是传统的跳过连接来提高特征集成。WiTUnet还引入了一个窗口化的Transformer结构来处理更小的、非重叠的图像段,降低计算负载。此外,在编码器和解码器中引入了Local Image Perception Enhancement(LiPe)模块,用换位图像感知增强代替了Transformer中的标准多层感知器,增强了局部特征捕捉和表示。通过广泛的实验比较,WiTUnet在关键指标如峰值信号-噪声比(PSNR)、结构相似性(SSIM)和根均方误差(RMSE)方面已经表现出优越的性能,显著提高了去噪效果和图像质量。
https://arxiv.org/abs/2404.09533
Multi-modal image fusion aims to combine information from different modes to create a single image with comprehensive information and detailed textures. However, fusion models based on convolutional neural networks encounter limitations in capturing global image features due to their focus on local convolution operations. Transformer-based models, while excelling in global feature modeling, confront computational challenges stemming from their quadratic complexity. Recently, the Selective Structured State Space Model has exhibited significant potential for long-range dependency modeling with linear complexity, offering a promising avenue to address the aforementioned dilemma. In this paper, we propose FusionMamba, a novel dynamic feature enhancement method for multimodal image fusion with Mamba. Specifically, we devise an improved efficient Mamba model for image fusion, integrating efficient visual state space model with dynamic convolution and channel attention. This refined model not only upholds the performance of Mamba and global modeling capability but also diminishes channel redundancy while enhancing local enhancement capability. Additionally, we devise a dynamic feature fusion module (DFFM) comprising two dynamic feature enhancement modules (DFEM) and a cross modality fusion mamba module (CMFM). The former serves for dynamic texture enhancement and dynamic difference perception, whereas the latter enhances correlation features between modes and suppresses redundant intermodal information. FusionMamba has yielded state-of-the-art (SOTA) performance across various multimodal medical image fusion tasks (CT-MRI, PET-MRI, SPECT-MRI), infrared and visible image fusion task (IR-VIS) and multimodal biomedical image fusion dataset (GFP-PC), which is proved that our model has generalization ability. The code for FusionMamba is available at this https URL.
多模态图像融合旨在将不同模式的信息结合在一起,以创建具有全面信息和详细纹理的单张图像。然而,基于卷积神经网络的融合模型由于集中于局部卷积操作,在捕捉全局图像特征方面遇到了限制。Transformer-based模型虽然在全局特征建模方面表现优异,但由于其四元复杂度,面临着计算挑战。最近,基于选择性结构状态空间模型的长距离依赖建模已经表现出很大的潜力,为解决上述问题提供了一个有前途的途径。在本文中,我们提出了FusionMamba,一种新颖的多模态图像融合动态特征增强方法,与Mamba相结合。具体来说,我们设计了一个改进高效的Mamba图像融合模型,将高效的视觉状态空间模型与动态卷积和通道关注相结合。这个平滑的模型不仅保持了Mamba和全局建模能力,还减少了通道冗余,增强了局部增强能力。此外,我们还设计了一个动态特征融合模块(DFFM),包括两个动态特征增强模块(DFEM)和一种跨模态融合Mamba模块(CMFM)。前者用于动态纹理增强和动态差异感知,而后者用于模式之间的相关特征,并抑制冗余的跨模态信息。FusionMamba在各种多模态医学图像融合任务(CT-MRI,PET-MRI,SPECT-MRI)和非热成像和可见图像融合任务(IR-VIS)以及多模态生物医学图像融合数据集(GFP-PC)中均取得了最先进的(SOTA)性能,证明了我们的模型具有良好的泛化能力。FusionMamba的代码可以从该https URL获得。
https://arxiv.org/abs/2404.09498
Ensembling different large language models (LLMs) to unleash their complementary potential and harness their individual strengths is highly valuable. Nevertheless, vocabulary discrepancies among various LLMs have constrained previous studies to either selecting or blending completely generated outputs. This limitation hinders the dynamic correction and enhancement of outputs during the generation process, resulting in a limited capacity for effective ensemble. To address this issue, we propose a novel method to Ensemble LLMs via Vocabulary Alignment (EVA). EVA bridges the lexical gap among various LLMs, enabling meticulous ensemble at each generation step. Specifically, we first learn mappings between the vocabularies of different LLMs with the assistance of overlapping tokens. Subsequently, these mappings are employed to project output distributions of LLMs into a unified space, facilitating a fine-grained ensemble. Finally, we design a filtering strategy to exclude models that generate unfaithful tokens. Experimental results on commonsense reasoning, arithmetic reasoning, machine translation, and data-to-text generation tasks demonstrate the superiority of our approach compared with individual LLMs and previous ensemble methods conducted on complete outputs. Further analyses confirm that our approach can leverage knowledge from different language models and yield consistent improvement.
组合不同的大型语言模型(LLMs)以发挥其互补潜力并利用其个体优势是非常有价值的。然而,不同LLM之间的词汇差异已经限制了以前的研究,只能选择或混合完全生成的输出。这种限制阻碍了在生成过程中动态纠正和增强输出,导致有效的集成能力有限。为了应对这个问题,我们提出了一种通过词汇对齐(EVA)对LLMs进行集成的新方法。EVA弥合了各种LLM之间的词汇差距,使每个生成步骤都能进行仔细的集成。具体来说,我们首先利用重叠标记学习不同LLM词汇之间的映射。然后,这些映射被用于将LLM的输出分布投影到统一的空间,促进细粒度的集成。最后,我们设计了一个过滤策略,以排除生成不诚实粒子的模型。在常识推理、算术推理、机器翻译和数据到文本生成等任务上的实验结果表明,与单个LLM和以前基于完整输出的集成方法相比,我们的方法具有优越性。进一步的分析证实了我们的方法可以从不同的语言模型中获得知识,并产生一致的改进。
https://arxiv.org/abs/2404.09492
The eighth AI City Challenge highlighted the convergence of computer vision and artificial intelligence in areas like retail, warehouse settings, and Intelligent Traffic Systems (ITS), presenting significant research opportunities. The 2024 edition featured five tracks, attracting unprecedented interest from 726 teams in 47 countries and regions. Track 1 dealt with multi-target multi-camera (MTMC) people tracking, highlighting significant enhancements in camera count, character number, 3D annotation, and camera matrices, alongside new rules for 3D tracking and online tracking algorithm encouragement. Track 2 introduced dense video captioning for traffic safety, focusing on pedestrian accidents using multi-camera feeds to improve insights for insurance and prevention. Track 3 required teams to classify driver actions in a naturalistic driving analysis. Track 4 explored fish-eye camera analytics using the FishEye8K dataset. Track 5 focused on motorcycle helmet rule violation detection. The challenge utilized two leaderboards to showcase methods, with participants setting new benchmarks, some surpassing existing state-of-the-art achievements.
第八届AI城市挑战突出了计算机视觉和人工智能在零售、仓库设置和智能交通系统(ITS)等领域的汇聚,为研究提供了重要机会。2024版活动设有五个赛道,吸引了来自726支来自47个国家和地区的队伍,前所未有的关注。赛道1涉及多目标多摄像头(MTMC)人员跟踪,重点关注摄像头数量、角色数量、3D注释和相机矩阵的显著提升,以及为3D跟踪和在线跟踪算法鼓励的新规则。赛道2介绍了用于交通安全的密集视频字幕,利用多摄像头数据改善对保险和预防的洞察。赛道3要求团队对自然驾驶分析中的驾驶员动作进行分类。赛道4探索了使用FishEye8K数据集的鱼眼相机数据分析。赛道5关注摩托车头盔违规检测。挑战使用了两个排行榜来展示方法,参与者设定了新基准,有些甚至超越了现有最先进的成就。
https://arxiv.org/abs/2404.09432
In image fusion tasks, images from different sources possess distinct characteristics. This has driven the development of numerous methods to explore better ways of fusing them while preserving their respective characteristics. Mamba, as a state space model, has emerged in the field of natural language processing. Recently, many studies have attempted to extend Mamba to vision tasks. However, due to the nature of images different from casual language sequences, the limited state capacity of Mamba weakens its ability to model image information. Additionally, the sequence modeling ability of Mamba is only capable of spatial information and cannot effectively capture the rich spectral information in images. Motivated by these challenges, we customize and improve the vision Mamba network designed for the image fusion task. Specifically, we propose the local-enhanced vision Mamba block, dubbed as LEVM. The LEVM block can improve local information perception of the network and simultaneously learn local and global spatial information. Furthermore, we propose the state sharing technique to enhance spatial details and integrate spatial and spectral information. Finally, the overall network is a multi-scale structure based on vision Mamba, called LE-Mamba. Extensive experiments show the proposed methods achieve state-of-the-art results on multispectral pansharpening and multispectral and hyperspectral image fusion datasets, and demonstrate the effectiveness of the proposed approach. Code will be made available.
在图像融合任务中,来自不同来源的图像具有独特的特征。这推动了开发大量方法来更好地融合它们,同时保留各自的特征。Mamba作为一个状态空间模型,在自然语言处理领域涌现出来。最近,许多研究试图将Mamba扩展到视觉任务。然而,由于不同图像的性质,Mamba的有限状态能力使其难以建模图像信息。此外,Mamba的序列建模能力仅限于空间信息,无法有效捕捉图像中的丰富光谱信息。为了应对这些挑战,我们定制和提高了为图像融合任务设计的视觉Mamba网络。具体来说,我们提出了名为LEVM的局部增强视觉Mamba块。LEVM块可以提高网络对局部信息感知,同时学习局部和全局空间信息。此外,我们还提出了共享状态技术,以增强空间细节并整合空间和光谱信息。最后,整个网络是基于视觉Mamba的多尺度结构,称为LE-Mamba。大量实验证明,所提出的方法在多光谱 pansharpening 和多光谱和 hyperspectral 图像融合数据集上实现了最先进的性能,并展示了所提出方法的有效性。代码将提供。
https://arxiv.org/abs/2404.09293
In recent years, the integration of large language models (LLMs) has revolutionized the field of robotics, enabling robots to communicate, understand, and reason with human-like proficiency. This paper explores the multifaceted impact of LLMs on robotics, addressing key challenges and opportunities for leveraging these models across various domains. By categorizing and analyzing LLM applications within core robotics elements -- communication, perception, planning, and control -- we aim to provide actionable insights for researchers seeking to integrate LLMs into their robotic systems. Our investigation focuses on LLMs developed post-GPT-3.5, primarily in text-based modalities while also considering multimodal approaches for perception and control. We offer comprehensive guidelines and examples for prompt engineering, facilitating beginners' access to LLM-based robotics solutions. Through tutorial-level examples and structured prompt construction, we illustrate how LLM-guided enhancements can be seamlessly integrated into robotics applications. This survey serves as a roadmap for researchers navigating the evolving landscape of LLM-driven robotics, offering a comprehensive overview and practical guidance for harnessing the power of language models in robotics development.
近年来,大型语言模型(LLMs)的集成已经彻底颠覆了机器人领域,使机器人能够以类似于人类的能力进行交流、理解和推理。本文探讨了LLMs在机器人领域多方面的影响,包括如何应对这些模型在各个领域的关键挑战和如何利用这些模型进行跨领域应用。通过将LLM应用归类为机器人核心元素——通信、感知、规划和控制——我们为研究人员提供有关如何将LLMs集成到他们的机器人系统中的具体建议和示例。我们的研究重点在于使用GPT-3.5之后开发的LLM,主要集中在基于文本的模态,同时考虑采用多模态方法进行感知和控制。我们提供了全面指导,帮助初学者轻松获得LLM-基于机器人的解决方案。通过教程级别的示例和结构化提示构建,我们阐明了LLM引导的增强如何无缝地融入机器人应用中。 本文是一份研究指南,旨在帮助研究人员应对LLM驱动机器人领域不断变化的地形,为利用语言模型的力量在机器人开发中提供全面概述和实践指导。
https://arxiv.org/abs/2404.09228
Navigation for thoracoabdominal puncture surgery is used to locate the needle entry point on the patient's body surface. The traditional reflective ball navigation method is difficult to position the needle entry point on the soft, irregular, smooth chest and abdomen. Due to the lack of clear characteristic points on the body surface using structured light technology, it is difficult to identify and locate arbitrary needle insertion points. Based on the high stability and high accuracy requirements of surgical navigation, this paper proposed a novel method, a muti-modal 3D small object medical marker detection method, which identifies the center of a small single ring as the needle insertion point. Moreover, this novel method leverages Fourier transform enhancement technology to augment the dataset, enrich image details, and enhance the network's capability. The method extracts the Region of Interest (ROI) of the feature image from both enhanced and original images, followed by generating a mask map. Subsequently, the point cloud of the ROI from the depth map is obtained through the registration of ROI point cloud contour fitting. In addition, this method employs Tukey loss for optimal precision. The experimental results show this novel method proposed in this paper not only achieves high-precision and high-stability positioning, but also enables the positioning of any needle insertion point.
导航定位胸腹部穿刺手术中的针头入口点是在患者身体表面的定位。传统的反射球导航方法很难在柔软、不规则、平滑的胸腹部上定位针头入口点。由于使用结构光技术来观察人体表面的结构化光点缺乏明确特征点,因此很难识别和定位任意针头插入点。根据高精度和高准确度手术导航的要求,本文提出了一个新方法,一种多模态3D小物体医学标记检测方法,将小单环的圆心确定为针头插入点。此外,这种新方法利用傅里叶变换增强技术来丰富数据集,增加图像细节,并增强网络的功能。该方法从增强和原始图像中提取目标区域,然后生成掩膜图。接下来,通过目标点云轮廓拟合来获得ROI点云。此外,该方法采用Tukey损失来实现最佳精度。实验结果表明,本文提出的新方法不仅实现了高精度和高稳定性的定位,而且还能够定位任何针头插入点。
https://arxiv.org/abs/2404.08990
Degraded underwater images decrease the accuracy of underwater object detection. However, existing methods for underwater image enhancement mainly focus on improving the indicators in visual aspects, which may not benefit the tasks of underwater image detection, and may lead to serious degradation in performance. To alleviate this problem, we proposed a bidirectional-guided method for underwater object detection, referred to as BG-YOLO. In the proposed method, network is organized by constructing an enhancement branch and a detection branch in a parallel way. The enhancement branch consists of a cascade of an image enhancement subnet and an object detection subnet. And the detection branch only consists of a detection subnet. A feature guided module connects the shallow convolution layer of the two branches. When training the enhancement branch, the object detection subnet in the enhancement branch guides the image enhancement subnet to be optimized towards the direction that is most conducive to the detection task. The shallow feature map of the trained enhancement branch will be output to the feature guided module, constraining the optimization of detection branch through consistency loss and prompting detection branch to learn more detailed information of the objects. And hence the detection performance will be refined. During the detection tasks, only detection branch will be reserved so that no additional cost of computation will be introduced. Extensive experiments demonstrate that the proposed method shows significant improvement in performance of the detector in severely degraded underwater scenes while maintaining a remarkable detection speed.
降解的水下图像降低水下物体检测的准确性。然而,水下图像增强的主要方法主要关注提高视觉方面的指标,这可能不会对水下物体检测任务产生好处,甚至可能导致性能严重下降。为解决这个问题,我们提出了一个双向引导的水下物体检测方法,称为BG-YOLO。在所提出的方法中,网络通过构建一个增强分支和一个检测分支来组织。增强分支包括图像增强子网和一个物体检测子网。而检测分支仅包括一个检测子网。一个特征引导模块连接了两个分支的浅卷积层。在训练增强分支时,增强分支中的物体检测子网指导图像增强子网朝着最有益于检测任务的方向进行优化。训练后的增强分支的浅特征图将输出到特征引导模块,通过一致损失约束检测分支,并通过提示检测分支学习更详细的信息来提高检测性能。因此,检测性能将得到改进。在检测任务期间,只保留检测分支以避免引入额外的计算成本。大量实验证明,在严重降解的水下场景中,所提出的方法显示出明显的检测器性能提升,同时保持出色的检测速度。
https://arxiv.org/abs/2404.08979
Localizing text in low-light environments is challenging due to visual degradations. Although a straightforward solution involves a two-stage pipeline with low-light image enhancement (LLE) as the initial step followed by detector, LLE is primarily designed for human vision instead of machine and can accumulate errors. In this work, we propose an efficient and effective single-stage approach for localizing text in dark that circumvents the need for LLE. We introduce a constrained learning module as an auxiliary mechanism during the training stage of the text detector. This module is designed to guide the text detector in preserving textual spatial features amidst feature map resizing, thus minimizing the loss of spatial information in texts under low-light visual degradations. Specifically, we incorporate spatial reconstruction and spatial semantic constraints within this module to ensure the text detector acquires essential positional and contextual range knowledge. Our approach enhances the original text detector's ability to identify text's local topological features using a dynamic snake feature pyramid network and adopts a bottom-up contour shaping strategy with a novel rectangular accumulation technique for accurate delineation of streamlined text features. In addition, we present a comprehensive low-light dataset for arbitrary-shaped text, encompassing diverse scenes and languages. Notably, our method achieves state-of-the-art results on this low-light dataset and exhibits comparable performance on standard normal light datasets. The code and dataset will be released.
在低光环境中定位文本具有挑战性,因为会出现视觉退化。尽管简单的解决方案涉及两个步骤:首先进行低光图像增强(LLE),然后是检测器,但LLE主要针对人类视觉而不是机器,并可能累积错误。在这项工作中,我们提出了一个高效且有效的单阶段方法来在黑暗中定位文本,绕过了需要LLE的步骤。我们在文本检测器的训练阶段引入了一个约束学习模块作为附加机制。这个模块的设计旨在指导文本检测器在特征图缩放过程中保留文本空间特征,从而在低光视觉退化下最小化文本中的空间信息损失。具体来说,我们在这个模块中引入了空间重构和空间语义约束,以确保文本检测器获得了关键的位置和上下文范围知识。我们的方法通过动态蛇特征金字塔网络增强了原始文本检测器的能力,并采用了一种新颖的矩形累积技术,实现了对平滑文本特征的准确边界描绘。此外,我们还提出了一个涵盖任意形状文本的全面低光数据集,包括各种场景和语言。值得注意的是,我们的方法在低光数据集上取得了最先进的成果,同时在标准正常光线数据集上的表现与标准 normal 光线数据集相当。代码和数据集将公开发布。
https://arxiv.org/abs/2404.08965
The relentless pursuit of enhancing Large Language Models (LLMs) has led to the advent of Super Retrieval-Augmented Generation (Super RAGs), a novel approach designed to elevate the performance of LLMs by integrating external knowledge sources with minimal structural modifications. This paper presents the integration of Super RAGs into the Mistral 8x7B v1, a state-of-the-art LLM, and examines the resultant improvements in accuracy, speed, and user satisfaction. Our methodology uses a fine-tuned instruct model setup and a cache tuning fork system, ensuring efficient and relevant data retrieval. The evaluation, conducted over several epochs, demonstrates significant enhancements across all metrics. The findings suggest that Super RAGs can effectively augment LLMs, paving the way for more sophisticated and reliable AI systems. This research contributes to the field by providing empirical evidence of the benefits of Super RAGs and offering insights into their potential applications.
The relentless pursuit of enhancing Large Language Models (LLMs) has led to the advent of Super Retrieval-Augmented Generation (Super RAGs), a novel approach designed to elevate the performance of LLMs by integrating external knowledge sources with minimal structural modifications. This paper presents the integration of Super RAGs into the Mistral 8x7B v1, a state-of-the-art LLM, and examines the resultant improvements in accuracy, speed, and user satisfaction. Our methodology uses a fine-tuned instruct model setup and a cache tuning fork system, ensuring efficient and relevant data retrieval. The evaluation, conducted over several epochs, demonstrates significant enhancements across all metrics. The findings suggest that Super RAGs can effectively augment LLMs, paving the way for more sophisticated and reliable AI systems. This research contributes to the field by providing empirical evidence of the benefits of Super RAGs and offering insights into their potential applications.
https://arxiv.org/abs/2404.08940
Efficient and accurate camouflaged object detection (COD) poses a challenge in the field of computer vision. Recent approaches explored the utility of edge information for network co-supervision, achieving notable advancements. However, these approaches introduce an extra branch for complex edge extraction, complicate the model architecture and increases computational demands. Addressing this issue, our work replicates the effect that animal's camouflage can be easily revealed under a shifting spotlight, and leverages it for network co-supervision to form a compact yet efficient single-branch network, the Co-Supervised Spotlight Shifting Network (CS$^3$Net). The spotlight shifting strategy allows CS$^3$Net to learn additional prior within a single-branch framework, obviating the need for resource demanding multi-branch design. To leverage the prior of spotlight shifting co-supervision, we propose Shadow Refinement Module (SRM) and Projection Aware Attention (PAA) for feature refinement and enhancement. To ensure the continuity of multi-scale features aggregation, we utilize the Extended Neighbor Connection Decoder (ENCD) for generating the final predictions. Empirical evaluations on public datasets confirm that our CS$^3$Net offers an optimal balance between efficiency and performance: it accomplishes a 32.13% reduction in Multiply-Accumulate (MACs) operations compared to leading efficient COD models, while also delivering superior performance.
高效的准确伪装物体检测(COD)在计算机视觉领域构成了挑战。最近的方法探索了边缘信息在网络协同监督中的作用,取得了显著的进展。然而,这些方法引入了一个复杂的边缘提取分支,复杂了模型架构,增加了计算需求。为解决这个问题,我们的工作复制了动物伪装在不断变化的光束下很容易被揭示的效果,并利用它来进行网络协同监督,形成了紧凑而高效的单分支网络,即协同光照变化网络(CS$^3$Net)。光照变化策略允许CS$^3$Net在单分支框架中学习额外的先验,无需资源密集的多分支设计。为了利用光照变化协同监督的预训练,我们提出了阴影优化模块(SRM)和投影感知关注(PAA)用于特征精化和增强。为了确保多尺度特征聚合的连续性,我们利用扩展邻居连接解码器(ENCD)生成最终预测。在公开数据集上的实证评估证实,我们的CS$^3$Net在效率和性能之间实现了最优平衡:它比最有效的COD模型降低了32.13%的Multiply-Accumulate(MACs)操作,同时具有卓越的性能。
https://arxiv.org/abs/2404.08936
As a newly emerging advance in deep generative models, diffusion models have achieved state-of-the-art results in many fields, including computer vision, natural language processing, and molecule design. The remote sensing community has also noticed the powerful ability of diffusion models and quickly applied them to a variety of tasks for image processing. Given the rapid increase in research on diffusion models in the field of remote sensing, it is necessary to conduct a comprehensive review of existing diffusion model-based remote sensing papers, to help researchers recognize the potential of diffusion models and provide some directions for further exploration. Specifically, this paper first introduces the theoretical background of diffusion models, and then systematically reviews the applications of diffusion models in remote sensing, including image generation, enhancement, and interpretation. Finally, the limitations of existing remote sensing diffusion models and worthy research directions for further exploration are discussed and summarized.
作为深度生成模型新兴领域的一个,扩散模型已经在许多领域取得了最先进的成果,包括计算机视觉、自然语言处理和分子设计。遥感社区也注意到了扩散模型的强大能力,并迅速将其应用于图像处理等各种任务。在遥感领域扩散模型研究的快速增加下,有必要对基于扩散模型的遥感论文进行全面回顾,以帮助研究人员认识到扩散模型的潜力并为进一步探索提供一些方向。具体来说,本文首先介绍了扩散模型的理论背景,然后系统地综述了扩散模型在遥感中的应用,包括图像生成、增强和解释。最后,讨论了现有遥感扩散模型的局限性,并为进一步的研究提供了有价值的方向。
https://arxiv.org/abs/2404.08926
Remote sensing change detection (CD) is a pivotal technique that pinpoints changes on a global scale based on multi-temporal images. With the recent expansion of deep learning, supervised deep learning-based CD models have shown satisfactory performance. However, CD sample labeling is very time-consuming as it is densely labeled and requires expert knowledge. To alleviate this problem, we introduce ChangeAnywhere, a novel CD sample generation method using the semantic latent diffusion model and single-temporal images. Specifically, ChangeAnywhere leverages the relative ease of acquiring large single-temporal semantic datasets to generate large-scale, diverse, and semantically annotated bi-temporal CD datasets. ChangeAnywhere captures the two essentials of CD samples, i.e., change implies semantically different, and non-change implies reasonable change under the same semantic constraints. We generated ChangeAnywhere-100K, the largest synthesis CD dataset with 100,000 pairs of CD samples based on the proposed method. The ChangeAnywhere-100K significantly improved both zero-shot and few-shot performance on two CD benchmark datasets for various deep learning-based CD models, as demonstrated by transfer experiments. This paper delineates the enormous potential of ChangeAnywhere for CD sample generation and demonstrates the subsequent enhancement of model performance. Therefore, ChangeAnywhere offers a potent tool for remote sensing CD. All codes and pre-trained models will be available at this https URL.
遥感变化检测(CD)是一种关键的技术,它通过多时态图像的全球范围来确定变化。随着深度学习的最近扩展,基于监督的深度学习CD模型已经表现出令人满意的性能。然而,CD样本标注非常耗时,因为它密集标注,需要专业知识。为了减轻这个问题,我们引入了ChangeAnywhere,一种使用语义潜在扩散模型和单时态图像的新型CD样本生成方法。具体来说,ChangeAnywhere利用获取大型单时态语义数据集的相对容易性来生成大规模、多样化和语义标注的生物两时态CD数据集。ChangeAnywhere抓住了CD样本的两个关键要素,即变化表示语义不同,不变表示在相同语义约束下合理的改变。我们基于该方法生成了ChangeAnywhere-100K,即基于提出的最大合成CD数据集,其中100,000对CD样本。ChangeAnywhere在各种基于深度学习的CD模型上显著提高了零击和少数击性能,如迁移实验所示。本文概述了ChangeAnywhere在CD样本生成方面的巨大潜力,并展示了后续模型性能的提高。因此,ChangeAnywhere为遥感CD提供了强大的工具。所有代码和预训练模型都可以在https://这个网址找到。
https://arxiv.org/abs/2404.08892
During endovascular interventions, physicians have to perform accurate and immediate operations based on the available real-time information, such as the shape and position of guidewires observed on the fluoroscopic images, haptic information and the patients' physiological signals. For this purpose, real-time and accurate guidewire segmentation and tracking can enhance the visualization of guidewires and provide visual feedback for physicians during the intervention as well as for robot-assisted interventions. Nevertheless, this task often comes with the challenge of elongated deformable structures that present themselves with low contrast in the noisy fluoroscopic image sequences. To address these issues, a two-stage deep learning framework for real-time guidewire segmentation and tracking is proposed. In the first stage, a Yolov5s detector is trained, using the original X-ray images as well as synthetic ones, which is employed to output the bounding boxes of possible target guidewires. More importantly, a refinement module based on spatiotemporal constraints is incorporated to robustly localize the guidewire and remove false detections. In the second stage, a novel and efficient network is proposed to segment the guidewire in each detected bounding box. The network contains two major modules, namely a hessian-based enhancement embedding module and a dual self-attention module. Quantitative and qualitative evaluations on clinical intra-operative images demonstrate that the proposed approach significantly outperforms our baselines as well as the current state of the art and, in comparison, shows higher robustness to low quality images.
在血管内干预期间,医生必须根据可用的实时信息进行准确和快速的手术操作,例如在荧光图像上观察到的导管形状和位置、触觉信息和患者的生理信号。为此,实时和准确的导管分割和跟踪可以增强导管的可视化效果,为医生在干预期间以及机器人辅助干预提供视觉反馈。然而,这项任务通常伴随着低对比度的复杂变形结构,这些结构在噪音干扰的荧光图像序列中很难观察到。为解决这些问题,我们提出了一个两阶段深度学习框架来实现实时导管分割和跟踪。 在第一阶段,使用原始的X光图像以及合成图像训练Yolov5s检测器,用于输出可能目标导管的边界框。更重要的是,基于空间时间约束的优化模块被引入,用于稳健地定位导管并消除虚假检测。 在第二阶段,我们提出了一种新颖且高效的网络来分割每个检测到的边界框中的导管。该网络包含两个主要模块,即基于梯度的嵌入模块和自注意双通道模块。对临床手术图像进行定量和定性评估显示,与我们的基线方法以及当前的最新技术相比,所提出的方法显著优越,并且在低质量图像上的鲁棒性更高。
https://arxiv.org/abs/2404.08805
The use of machine learning in fluid dynamics is becoming more common to expedite the computation when solving forward and inverse problems of partial differential equations. Yet, a notable challenge with existing convolutional neural network (CNN)-based methods for data fidelity enhancement is their reliance on specific low-fidelity data patterns and distributions during the training phase. In addition, the CNN-based method essentially treats the flow reconstruction task as a computer vision task that prioritizes the element-wise precision which lacks a physical and mathematical explanation. This dependence can dramatically affect the models' effectiveness in real-world scenarios, especially when the low-fidelity input deviates from the training data or contains noise not accounted for during training. The introduction of diffusion models in this context shows promise for improving performance and generalizability. Unlike direct mapping from a specific low-fidelity to a high-fidelity distribution, diffusion models learn to transition from any low-fidelity distribution towards a high-fidelity one. Our proposed model - Physics-informed Residual Diffusion, demonstrates the capability to elevate the quality of data from both standard low-fidelity inputs, to low-fidelity inputs with injected Gaussian noise, and randomly collected samples. By integrating physics-based insights into the objective function, it further refines the accuracy and the fidelity of the inferred high-quality data. Experimental results have shown that our approach can effectively reconstruct high-quality outcomes for two-dimensional turbulent flows from a range of low-fidelity input conditions without requiring retraining.
机器学习在流体动力学中的应用变得越来越普遍,以加速求解偏微分方程的前向和反问题。然而,现有基于卷积神经网络(CNN)的数据质量增强方法的一个显著挑战是,在训练阶段依赖于特定的低质量数据模式和分布。此外,基于CNN的方法本质上将流体重建任务视为一个计算机视觉任务,这缺乏物理和数学解释。这种依赖可能导致在现实场景中模型效果的大幅下降,尤其是低质量输入与训练数据不一致或包含在训练过程中没有考虑到的噪声时。在這種情況下引入扩散模型具有提高性能和泛化能力的潜力。与直接从特定低质量到高质量分布的直接映射不同,扩散模型学会了从任何低质量分布转移到高质量分布。我们提出的模型——物理约束的残差扩散,展示了将数据质量从标准低质量输入提升到低质量输入带有注入高斯噪声和随机收集样本的能力。通过将物理基于性的见解整合到目标函数中,它进一步提高了预测高质量数据的准确性和准确性。实验结果表明,我们的方法可以有效地从一系列低质量输入条件下重构二维湍流的高质量结果,而无需重新训练。
https://arxiv.org/abs/2404.08412
Driven by the success of Masked Language Modeling (MLM), the realm of self-supervised learning for computer vision has been invigorated by the central role of Masked Image Modeling (MIM) in driving recent breakthroughs. Notwithstanding the achievements of MIM across various downstream tasks, its overall efficiency is occasionally hampered by the lengthy duration of the pre-training phase. This paper presents a perspective that the optimization of masked tokens as a means of addressing the prevailing issue. Initially, we delve into an exploration of the inherent properties that a masked token ought to possess. Within the properties, we principally dedicated to articulating and emphasizing the `data singularity' attribute inherent in masked tokens. Through a comprehensive analysis of the heterogeneity between masked tokens and visible tokens within pre-trained models, we propose a novel approach termed masked token optimization (MTO), specifically designed to improve model efficiency through weight recalibration and the enhancement of the key property of masked tokens. The proposed method serves as an adaptable solution that seamlessly integrates into any MIM approach that leverages masked tokens. As a result, MTO achieves a considerable improvement in pre-training efficiency, resulting in an approximately 50% reduction in pre-training epochs required to attain converged performance of the recent approaches.
由于Masked Language Modeling(MLM)的成功,计算机视觉领域自监督学习(self-supervised learning)得到了重新激发,其中Masked Image Modeling(MIM)在推动最近突破中发挥了关键作用。尽管MIM在各种下游任务上的成就已经取得,但其在整个预训练阶段的长时延偶尔会阻碍其整体效率。本文提出了一种观点,即通过将遮罩标记作为解决当前问题的手段来优化遮罩标记。首先,我们进行了对遮罩标记应具有的固有属性的探索。在这些属性中,我们主要致力于阐述和强调遮罩标记中固有的`数据孤立性`属性。通过对预训练模型中遮罩标记与可见标记之间的异质性的全面分析,我们提出了一个名为遮罩标记优化(MTO)的新方法,特别针对通过权重重新调整和增强遮罩标记的关键特性来提高模型效率。所提出的方法是一个可适应的解决方案,可将其无缝集成到任何利用遮罩标记的MIM方法中。因此,MTO在预训练效率上取得了显著的改进,从而将预训练 epoch 减少约50%。
https://arxiv.org/abs/2404.08330
Previous language model pre-training methods have uniformly applied a next-token prediction loss to all training tokens. Challenging this norm, we posit that "Not all tokens in a corpus are equally important for language model training". Our initial analysis delves into token-level training dynamics of language model, revealing distinct loss patterns for different tokens. Leveraging these insights, we introduce a new language model called Rho-1. Unlike traditional LMs that learn to predict every next token in a corpus, Rho-1 employs Selective Language Modeling (SLM), which selectively trains on useful tokens that aligned with the desired distribution. This approach involves scoring pretraining tokens using a reference model, and then training the language model with a focused loss on tokens with higher excess loss. When continual pretraining on 15B OpenWebMath corpus, Rho-1 yields an absolute improvement in few-shot accuracy of up to 30% in 9 math tasks. After fine-tuning, Rho-1-1B and 7B achieved state-of-the-art results of 40.6% and 51.8% on MATH dataset, respectively - matching DeepSeekMath with only 3% of the pretraining tokens. Furthermore, when pretraining on 80B general tokens, Rho-1 achieves 6.8% average enhancement across 15 diverse tasks, increasing both efficiency and performance of the language model pre-training.
在 previous language模型预训练方法中,所有训练 tokens 都均匀应用了下一个词预测损失。挑战这种规范,我们主张“并不是语料库中的所有标记都是对语言模型训练同样重要的”。我们对语言模型的初始分析深入研究了不同标记的词级训练动态,揭示了不同标记的损失模式。利用这些见解,我们引入了一个名为 Rho-1 的新语言模型。与传统语言模型不同,Rho-1 采用选择性语言模型(SLM)进行训练,选择性地在符合期望分布的有用标记上进行训练。这种方法涉及使用参考模型对预训练标记进行评分,然后对具有较高 excess loss 的标记进行关注性的损失训练语言模型。在继续在 15B OpenWebMath 语料库上进行预训练时,Rho-1 在几个简单的数学任务中的准确性提高了约 30%。在微调后,Rho-1-1B 和 7B 在 MATH 数据集上实现了 40.6% 和 51.8% 的最先进结果, respectively - 只使用预训练标记的 3% 就达到了与 DeepSeekMath 的成绩相同。此外,当在 80B 通用标记上进行预训练时,Rho-1 在 15 个不同的任务上实现了 6.8% 的平均增强,提高了语言模型预训练的效率和性能。
https://arxiv.org/abs/2404.07965