As human-machine interaction continues to evolve, the capacity for environmental perception is becoming increasingly crucial. Integrating the two most common types of sensory data, images, and point clouds, can enhance detection accuracy. However, currently, no model exists that can simultaneously detect an object's position in both point clouds and images and ascertain their corresponding relationship. This information is invaluable for human-machine interactions, offering new possibilities for their enhancement. In light of this, this paper introduces an end-to-end Consistency Object Detection (COD) algorithm framework that requires only a single forward inference to simultaneously obtain an object's position in both point clouds and images and establish their correlation. Furthermore, to assess the accuracy of the object correlation between point clouds and images, this paper proposes a new evaluation metric, Consistency Precision (CP). To verify the effectiveness of the proposed framework, an extensive set of experiments has been conducted on the KITTI and DAIR-V2X datasets. The study also explored how the proposed consistency detection method performs on images when the calibration parameters between images and point clouds are disturbed, compared to existing post-processing methods. The experimental results demonstrate that the proposed method exhibits excellent detection performance and robustness, achieving end-to-end consistency detection. The source code will be made publicly available at this https URL.
随着人机交互的不断进化,环境感知的能力变得越来越重要。将两种最常见的感官数据,图像和点云,集成在一起可以提高检测精度。然而,目前尚无同时检测物体在点云和图像中的位置以及确定它们相应关系的产品。这些信息对于人机交互非常重要,为它们的增强提供了新的可能性。因此,本文介绍了一个端到端的一致性物体检测(COD)算法框架,只需要一个前向推理即可同时获得物体在点云和图像中的位置并确定它们之间的关联。此外,为了评估点云和图像之间物体关联的准确性,本文提出了一个新的评估指标,一致性精度(CP)。为了验证所提出的框架的有效性,在KITTI和DAIR-V2X数据集上进行了大量实验。研究还探讨了在图像和点云之间的校准参数受到干扰时,所提出的一致性检测方法在图片上的表现,与现有后处理方法进行了比较。实验结果表明,与现有方法相比,所提出的检测方法具有卓越的检测性能和鲁棒性,实现了端到端的一致性检测。源代码将公开在https://这个网址上。
https://arxiv.org/abs/2405.01258
We propose TRAMBA, a hybrid transformer and Mamba architecture for acoustic and bone conduction speech enhancement, suitable for mobile and wearable platforms. Bone conduction speech enhancement has been impractical to adopt in mobile and wearable platforms for several reasons: (i) data collection is labor-intensive, resulting in scarcity; (ii) there exists a performance gap between state of-art models with memory footprints of hundreds of MBs and methods better suited for resource-constrained systems. To adapt TRAMBA to vibration-based sensing modalities, we pre-train TRAMBA with audio speech datasets that are widely available. Then, users fine-tune with a small amount of bone conduction data. TRAMBA outperforms state-of-art GANs by up to 7.3% in PESQ and 1.8% in STOI, with an order of magnitude smaller memory footprint and an inference speed up of up to 465 times. We integrate TRAMBA into real systems and show that TRAMBA (i) improves battery life of wearables by up to 160% by requiring less data sampling and transmission; (ii) generates higher quality voice in noisy environments than over-the-air speech; (iii) requires a memory footprint of less than 20.0 MB.
我们提出了TRAMBA,一种适用于移动和可穿戴平台的混合变压器和Mamba架构的语音和骨传导增强,为语音和骨传导增强提供了一种高效且可扩展的方法。骨传导增强在移动和可穿戴平台上的实现一直是不实用的几个原因:首先,数据收集工作量很大,导致数据稀缺;其次,与具有数百MB内存开销的先进模型相比,更适用于资源受限系统的方法之间存在性能差距。为了将TRAMBA适应振动感知模式,我们使用音频语音数据集预先训练TRAMBA。然后,用户通过少量的骨传导数据进行微调。TRAMBA在PESQ和STOI方面的性能优于最先进的GAN,其内存开销减小了 orders of magnitude,并且推理速度加快了465倍。我们将TRAMBA集成到实际系统中,并证明了TRAMBA(i)通过要求更少的数据采样和传输来提高可穿戴设备的电池寿命,提高了160%;(ii)在嘈杂的环境中产生的声音质量高于通过空气传播的声音;(iii) 内存开销小于20.0 MB。
https://arxiv.org/abs/2405.01242
Large Language Models (LLMs) have achieved remarkable success across diverse tasks, yet they remain vulnerable to adversarial attacks, notably the well-documented \textit{jailbreak} attack. Recently, the Greedy Coordinate Gradient (GCG) attack has demonstrated efficacy in exploiting this vulnerability by optimizing adversarial prompts through a combination of gradient heuristics and greedy search. However, the efficiency of this attack has become a bottleneck in the attacking process. To mitigate this limitation, in this paper we rethink the generation of adversarial prompts through an optimization lens, aiming to stabilize the optimization process and harness more heuristic insights from previous iterations. Specifically, we introduce the \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) attack, which incorporates a momentum term into the gradient heuristic. Experimental results showcase the notable enhancement achieved by MAP in gradient-based attacks on aligned language models. Our code is available at this https URL.
大语言模型(LLMs)在各种任务上取得了显著的成功,然而它们仍然容易受到对抗攻击,尤其是著名的 \textit{jailbreak} 攻击。最近, Greedy Coordinate Gradient (GCG) 攻击通过优化对抗提示并通过梯度启发式和贪心搜索相结合,成功地利用了这一漏洞。然而,这种攻击的效率在攻击过程中成为了一个瓶颈。为了减轻这一限制,本文通过优化视角重新思考了生成对抗提示的过程,旨在稳定优化过程并从之前的迭代中获得更多的启发式洞察。具体来说,我们引入了 \textbf{M}omentum \textbf{A}ccelerated G\textbf{C}G (\textbf{MAC}) 攻击,该攻击在梯度启发式上的优化中引入了动量项。实验结果展示了 MAP 在基于梯度的攻击对对齐语言模型上的显著增强。我们的代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2405.01229
Change detection as an interdisciplinary discipline in the field of computer vision and remote sensing at present has been receiving extensive attention and research. Due to the rapid development of society, the geographic information captured by remote sensing satellites is changing faster and more complex, which undoubtedly poses a higher challenge and highlights the value of change detection tasks. We propose MFDS-Net: Multi-Scale Feature Depth-Supervised Network for Remote Sensing Change Detection with Global Semantic and Detail Information (MFDS-Net) with the aim of achieving a more refined description of changing buildings as well as geographic information, enhancing the localisation of changing targets and the acquisition of weak features. To achieve the research objectives, we use a modified ResNet_34 as backbone network to perform feature extraction and DO-Conv as an alternative to traditional convolution to better focus on the association between feature information and to obtain better training results. We propose the Global Semantic Enhancement Module (GSEM) to enhance the processing of high-level semantic information from a global perspective. The Differential Feature Integration Module (DFIM) is proposed to strengthen the fusion of different depth feature information, achieving learning and extraction of differential features. The entire network is trained and optimized using a deep supervision mechanism. The experimental outcomes of MFDS-Net surpass those of current mainstream change detection networks. On the LEVIR dataset, it achieved an F1 score of 91.589 and IoU of 84.483, on the WHU dataset, the scores were F1: 92.384 and IoU: 86.807, and on the GZ-CD dataset, the scores were F1: 86.377 and IoU: 76.021. The code is available at this https URL
目前,在计算机视觉和遥感领域,变化检测作为跨学科领域已经受到了广泛关注和研究。由于社会的快速发展,遥感卫星捕获的地理信息变化得更快、更复杂,无疑给变化检测任务带来了更高的挑战,同时也突出了变化检测任务的价值。我们提出了MFDS-Net:多尺度特征自监督网络用于远程 sensing变化检测全局语义和详细信息(MFDS-Net)作为目标,实现对变化建筑和地理信息的更精确描述,提高变化目标的定位,并获得弱特征的提取。为了实现研究目标,我们使用修改后的ResNet_34作为骨干网络进行特征提取,将DO-Conv作为传统卷积的替代方案,更好关注特征信息与特征信息的关联,以获得更好的训练结果。我们提出了全局语义增强模块(GSEM)用于从全局角度增强高级语义信息。差分特征融合模块(DFIM)提出了用于加强不同深度特征信息的融合,实现对差分特征的学习和提取。整个网络使用深度监督机制进行训练和优化。MFDS-Net的实验结果超越了当前主流变化检测网络。在LEVIR数据集上,其得分达到了91.589,IoU为84.483;在WHU数据集上,得分分别为F1: 92.384和IoU: 86.807;在GZ-CD数据集上,得分分别为F1: 86.377和IoU: 76.021。代码可在此https://url
https://arxiv.org/abs/2405.01065
Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data, presenting practical benefits in contexts constrained by data privacy concerns. Current DFML methods primarily focus on the data recovery from these pre-trained models. However, they suffer from slow recovery speed and overlook gaps inherent in heterogeneous pre-trained models. In response to these challenges, we introduce the Faster and Better Data-Free Meta-Learning (FREE) framework, which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks. Specifically, within the module Faster Inversion via Meta-Generator, each pre-trained model is perceived as a distinct task. The meta-generator can rapidly adapt to a specific task in just five steps, significantly accelerating the data recovery. Furthermore, we propose Better Generalization via Meta-Learner and introduce an implicit gradient alignment algorithm to optimize the meta-learner. This is achieved as aligned gradient directions alleviate potential conflicts among tasks from heterogeneous pre-trained models. Empirical experiments on multiple benchmarks affirm the superiority of our approach, marking a notable speed-up (20$\times$) and performance enhancement (1.42\% $\sim$ 4.78\%) in comparison to the state-of-the-art.
数据免费元学习(DFML)旨在从一组预训练模型中提取知识,而无需要求原始数据,在受到数据隐私问题限制的上下文中具有实际应用价值。当前的DFML方法主要集中在从预训练模型中数据恢复。然而,它们受到缓慢的数据恢复速度和忽视预训练模型异质性的不足的缺陷。为了应对这些挑战,我们引入了更快的数据免费元学习(FREE)框架,它包含:(i)一个元生成器,用于从预训练模型中迅速恢复训练任务;(ii)一个元学习器,用于对新预见到的任务进行泛化。具体来说,在Faster Inversion via Meta-Generator模块中,每个预训练模型都被视为一个独立任务。元生成器可以在只需五个步骤的情况下迅速适应特定任务,大大加快了数据恢复速度。此外,我们提出了更好的泛化通过元学习器,并引入了隐含梯度对齐算法来优化元学习器。通过对齐梯度方向消除了异质预训练模型任务之间的潜在冲突,使得我们的方法在多个基准上的实验结果证实了其优越性,与最先进的方法相比,速度更快(20$\times$),性能更卓越(1.42\% $\sim$ 4.78\%)。
https://arxiv.org/abs/2405.00984
Background and Purpose: Identifying the thromboembolism source in ischemic stroke is crucial for treatment and secondary prevention yet is often undetermined. This study describes a self-supervised deep learning approach in digital pathology of emboli for classifying ischemic stroke clot origin from histopathological images. Methods: The dataset included whole slide images (WSI) from the STRIP AI Kaggle challenge, consisting of retrieved clots from ischemic stroke patients following mechanical thrombectomy. Transformer-based deep learning models were developed using transfer learning and self-supervised pretraining for classifying WSI. Customizations included an attention pooling layer, weighted loss function, and threshold optimization. Various model architectures were tested and compared, and model performances were primarily evaluated using weighted logarithmic loss. Results: The model achieved a logloss score of 0.662 in cross-validation and 0.659 on the test set. Different model backbones were compared, with the swin_large_patch4_window12_384 showed higher performance. Thresholding techniques for clot origin classification were employed to balance false positives and negatives. Conclusion: The study demonstrates the extent of efficacy of transformer-based deep learning models in identifying ischemic stroke clot origins from histopathological images and emphasizes the need for refined modeling techniques specifically adapted to thrombi WSI. Further research is needed to improve model performance, interpretability, validate its effectiveness. Future enhancement could include integrating larger patient cohorts, advanced preprocessing strategies, and exploring ensemble multimodal methods for enhanced diagnostic accuracy.
背景和目的:确定动脉粥样硬化性中风血栓的来源对于治疗和二次预防至关重要,但通常很难确定。这项研究描述了一种自监督的深度学习方法,用于病理图像中的血栓分类,以确定动脉粥样硬化性中风血栓的来源。 方法:数据集包括从STRIP AI Kaggle挑战中获取的整张图像(WSI),这些 WSI 是来自接受机械取栓治疗的患者。使用迁移学习和自监督预训练的Transformer-based深度学习模型进行分类。自定义包括注意力池化层、加权损失函数和阈值优化。各种模型架构都被测试和比较,主要通过加权对数损失进行评估来评估模型性能。 结果:在交叉验证中,模型实现了logloss分数为0.662,在测试集中为0.659。对不同的模型骨干进行了比较, swin_large_patch4_window12_384 显示了更高的性能。采用阈值技术平衡 false positives 和 negatives。 结论:本研究证明了Transformer-based深度学习模型在从病理图像中确定动脉粥样硬化性中风血栓来源方面的有效性。强调了需要专门针对血栓 WSI 的精细建模技术。还需要进一步研究提高模型性能、可解释性和验证其有效性。未来的增强可以包括纳入更大的患者队列、采用更先进的预处理策略和探索集成多模态方法以提高诊断准确性。
https://arxiv.org/abs/2405.00908
The lane graph is a key component for building high-definition (HD) maps and crucial for downstream tasks such as autonomous driving or navigation planning. Previously, He et al. (2022) explored the extraction of the lane-level graph from aerial imagery utilizing a segmentation based approach. However, segmentation networks struggle to achieve perfect segmentation masks resulting in inaccurate lane graph extraction. We explore additional enhancements to refine this segmentation-based approach and extend it with a diffusion probabilistic model (DPM) component. This combination further improves the GEO F1 and TOPO F1 scores, which are crucial indicators of the quality of a lane graph, in the undirected graph in non-intersection areas. We conduct experiments on a publicly available dataset, demonstrating that our method outperforms the previous approach, particularly in enhancing the connectivity of such a graph, as measured by the TOPO F1 score. Moreover, we perform ablation studies on the individual components of our method to understand their contribution and evaluate their effectiveness.
道路图是构建高清晰度(HD)地图的关键组件,对于自动驾驶或导航规划等下游任务至关重要。之前,He等人(2022)利用基于分割的方法从无人机影像中提取道路级图。然而,分割网络很难实现完美的分割掩码,导致道路级图提取不准确。我们探讨了使用基于分割的改进方法,并将其与扩散概率模型(DPM)组件相结合,以进一步改进该方法。这种组合进一步提高了无向图中的GEO F1和TOPO F1分数,这些分数是衡量道路图质量的关键指标。我们在公开可用数据集上进行实验,证明我们的方法超越了以前的方法,特别是在增强道路图的连通性方面,根据TOPO F1得分。此外,我们对我们方法的个人组件进行了消融研究,以了解它们的贡献并评估其效果。
https://arxiv.org/abs/2405.00620
Fundus photography, in combination with the ultra-wide-angle fundus (UWF) techniques, becomes an indispensable diagnostic tool in clinical settings by offering a more comprehensive view of the retina. Nonetheless, UWF fluorescein angiography (UWF-FA) necessitates the administration of a fluorescent dye via injection into the patient's hand or elbow unlike UWF scanning laser ophthalmoscopy (UWF-SLO). To mitigate potential adverse effects associated with injections, researchers have proposed the development of cross-modality medical image generation algorithms capable of converting UWF-SLO images into their UWF-FA counterparts. Current image generation techniques applied to fundus photography encounter difficulties in producing high-resolution retinal images, particularly in capturing minute vascular lesions. To address these issues, we introduce a novel conditional generative adversarial network (UWAFA-GAN) to synthesize UWF-FA from UWF-SLO. This approach employs multi-scale generators and an attention transmit module to efficiently extract both global structures and local lesions. Additionally, to counteract the image blurriness issue that arises from training with misaligned data, a registration module is integrated within this framework. Our method performs non-trivially on inception scores and details generation. Clinical user studies further indicate that the UWF-FA images generated by UWAFA-GAN are clinically comparable to authentic images in terms of diagnostic reliability. Empirical evaluations on our proprietary UWF image datasets elucidate that UWAFA-GAN outperforms extant methodologies. The code is accessible at this https URL.
fundus摄影与超广角 fundus(UWF)技术相结合,在临床实践中成为一项不可或缺的诊断工具,因为它能提供对视网膜更全面的观察。然而,UWF 荧光血管造影(UWF-FA)需要通过注射荧光染料到患者手中或肘部来实施,而 UWF 扫描激光视网膜检查(UWF-SLO)不需要这样做。为了减轻注射可能带来的不良反应,研究人员提出了开发能够将 UWF-SLO 图像转换为 UWF-FA 图像的跨模态医疗图像生成算法。目前应用于 fundus 摄影的图像生成技术在生成高分辨率视网膜图像方面遇到困难,特别是在捕捉细微血管病变方面。为了应对这些问题,我们引入了一种名为 UWAFA-GAN 的条件生成对抗网络(GAN)用于从 UWF-SLO 合成 UWF-FA。这种方法采用多尺度生成器和关注传输模块来有效地提取全局结构和局部病变。此外,为了对抗训练数据不对齐导致的图像模糊问题,我们在该框架中引入了注册模块。我们的方法在 inception 分数和详细信息生成方面非同寻常。通过对我们专有 UWF 图像数据集的临床用户研究,证实了 UWAFA-GAN 生成的 UWF-FA 图像在诊断可靠性方面与真实图像相当。我们专有 UWF 图像数据集的实证评估证实了 UWAFA-GAN 优于现有方法。代码可在此链接访问:https://www.example.com/
https://arxiv.org/abs/2405.00542
Addressing the challenge of automated geometry math problem-solving in artificial intelligence (AI) involves understanding multi-modal information and mathematics. Current methods struggle with accurately interpreting geometry diagrams, which hinders effective problem-solving. To tackle this issue, we present the Geometry problem sOlver with natural Language Description (GOLD) model. GOLD enhances the extraction of geometric relations by separately processing symbols and geometric primitives within the diagram. Subsequently, it converts the extracted relations into natural language descriptions, efficiently utilizing large language models to solve geometry math problems. Experiments show that the GOLD model outperforms the Geoformer model, the previous best method on the UniGeo dataset, by achieving accuracy improvements of 12.7% and 42.1% in calculation and proving subsets. Additionally, it surpasses the former best model on the PGPS9K and Geometry3K datasets, PGPSNet, by obtaining accuracy enhancements of 1.8% and 3.2%, respectively.
解决自动化几何数学问题求解中的挑战涉及理解和处理多模态信息和数学。现有方法在准确解释几何图方面遇到困难,这阻碍了有效的求解。为解决这个问题,我们提出了Geometry problem solver with natural Language Description (GOLD)模型。GOLD通过分别处理图中的符号和几何基本要素来增强提取几何关系。然后,它将提取的关系转换成自然语言描述,有效利用大型语言模型来解决几何数学问题。实验证明,GOLD模型在计算和证明子集的准确性上超过了UniGeo数据集上之前最佳方法的12.7%和42.1%。此外,它在PGPS9K和Geometry3K数据集以及PGPSNet上超过了最佳模型,分别获得了1.8%和3.2%的准确性提升。
https://arxiv.org/abs/2405.00494
We introduce AdaMoLE, a novel method for fine-tuning large language models (LLMs) through an Adaptive Mixture of Low-Rank Adaptation (LoRA) Experts. Moving beyond conventional methods that employ a static top-k strategy for activating experts, AdaMoLE dynamically adjusts the activation threshold using a dedicated threshold network, adaptively responding to the varying complexities of different tasks. By replacing a single LoRA in a layer with multiple LoRA experts and integrating a gating function with the threshold mechanism, AdaMoLE effectively selects and activates the most appropriate experts based on the input context. Our extensive evaluations across a variety of commonsense reasoning and natural language processing tasks show that AdaMoLE exceeds baseline performance. This enhancement highlights the advantages of AdaMoLE's adaptive selection of LoRA experts, improving model effectiveness without a corresponding increase in the expert count. The experimental validation not only confirms AdaMoLE as a robust approach for enhancing LLMs but also suggests valuable directions for future research in adaptive expert selection mechanisms, potentially broadening the scope for optimizing model performance across diverse language processing tasks.
我们提出了AdaMoLE,一种通过自适应混合低秩适应(LoRA)专家来微调大型语言模型(LLMs)的新方法。超越了采用静态top-k策略激活专家的常规方法,AdaMoLE通过专用的阈值网络动态调整激活阈值,并根据不同任务的不断变化复杂性进行自适应响应。通过用多个LoRA专家替换一个层中的单个LoRA,并集成阈机制的gating函数,AdaMoLE有效地基于输入上下文选择和激活最合适的专家。我们对多种常识推理和自然语言处理任务进行的广泛评估表明,AdaMoLE超越了基线性能。这种增强突出了AdaMoLE自适应选择LoRA专家的优势,在不增加专家数量的情况下提高了模型的效果。实验验证不仅证实了AdaMoLE是一种增强LLM的稳健方法,而且为未来的自适应专家选择机制研究提供了有价值的方向,可能拓宽了优化模型性能跨多样化语言处理任务的范围。
https://arxiv.org/abs/2405.00361
Large language models (LLMs) suffer from low efficiency as the mismatch between the requirement of auto-regressive decoding and the design of most contemporary GPUs. Specifically, billions to trillions of parameters must be loaded to the GPU cache through its limited memory bandwidth for computation, but only a small batch of tokens is actually computed. Consequently, the GPU spends most of its time on memory transfer instead of computation. Recently, parallel decoding, a type of speculative decoding algorithms, is becoming more popular and has demonstrated impressive efficiency improvement in generation. It introduces extra decoding heads to large models, enabling them to predict multiple subsequent tokens simultaneously and verify these candidate continuations in a single decoding step. However, this approach deviates from the training objective of next token prediction used during pre-training, resulting in a low hit rate for candidate tokens. In this paper, we propose a new speculative decoding algorithm, Clover, which integrates sequential knowledge into the parallel decoding process. This enhancement improves the hit rate of speculators and thus boosts the overall efficiency. Clover transmits the sequential knowledge from pre-speculated tokens via the Regressive Connection, then employs an Attention Decoder to integrate these speculated tokens. Additionally, Clover incorporates an Augmenting Block that modifies the hidden states to better align with the purpose of speculative generation rather than next token prediction. The experiment results demonstrate that Clover outperforms the baseline by up to 91% on Baichuan-Small and 146% on Baichuan-Large, respectively, and exceeds the performance of the previously top-performing method, Medusa, by up to 37% on Baichuan-Small and 57% on Baichuan-Large, respectively.
大语言模型(LLMs)在自动回归解码与大多数当代GPU设计之间的不匹配方面存在效率低的问题。具体来说,通过其有限计算内存带宽加载数十亿到数百亿个参数到GPU缓存中,但只有很少的token被实际计算。因此,GPU大部分时间都在内存传输上而不是计算。最近,并行解码,一种类 speculation decoding 算法,变得越来越受欢迎,并在生成方面取得了令人印象深刻的效率提升。它引入了额外的解码头到大型模型,使它们能够同时预测多个后续token,并在一个解码步骤中验证这些候选继续。然而,这种方法与用于预训练的下一个token预测的训练目标相偏离,导致对于候选token的命中率较低。在本文中,我们提出了一个新的类 speculation decoding 算法,Clover,该算法将序列知识集成到并行解码过程中。这种增强提高了投机者的命中率,从而提高了整体效率。Clover通过反向连接传输预speculated tokens的序列知识,然后采用注意力解码器将这些speculated tokens集成起来。此外,Clover还引入了一个增强块,用于修改隐藏状态,使其更符合预测投机生成而不是下一个token的预测。实验结果表明,Clover在巴ichuan-Small和巴ichuan-Large上的性能均优于基线,分别提高了91%和146%,超过了之前最佳方法Medusa在巴ichuan-Small和巴ichuan-Large上的性能,提高了37%和57%。
https://arxiv.org/abs/2405.00263
In the evolving landscape of computer vision, foundation models have emerged as pivotal tools, exhibiting exceptional adaptability to a myriad of tasks. Among these, the Segment Anything Model (SAM) by Meta AI has distinguished itself in image segmentation. However, SAM, like its counterparts, encounters limitations in specific niche applications, prompting a quest for enhancement strategies that do not compromise its inherent capabilities. This paper introduces ASAM, a novel methodology that amplifies SAM's performance through adversarial tuning. We harness the potential of natural adversarial examples, inspired by their successful implementation in natural language processing. By utilizing a stable diffusion model, we augment a subset (1%) of the SA-1B dataset, generating adversarial instances that are more representative of natural variations rather than conventional imperceptible perturbations. Our approach maintains the photorealism of adversarial examples and ensures alignment with original mask annotations, thereby preserving the integrity of the segmentation task. The fine-tuned ASAM demonstrates significant improvements across a diverse range of segmentation tasks without necessitating additional data or architectural modifications. The results of our extensive evaluations confirm that ASAM establishes new benchmarks in segmentation tasks, thereby contributing to the advancement of foundational models in computer vision. Our project page is in this https URL.
在计算机视觉不断演变的领域中,基础模型已经成为了重要的工具,表现出了对各种任务非常出色的适应性。在这些基础模型中,元人工智能(Meta AI)的分割 Anything Model(SAM)在图像分割方面表现出了卓越的适应性。然而,与它的同类模型一样,SAM在特定领域应用中遇到了局限性,催生了不牺牲其固有能力的增强策略的寻求。本文介绍了 ASAM,一种通过对抗调整来提高 SAM性能的新型方法。我们利用自然 adversarial 示例的潜力,受到它们在自然语言处理领域成功实施的影响。通过使用稳定的扩散模型,我们扩展了 SA-1B 数据集中的子集(1%),生成了更具有代表性的自然变化实例,而不是传统的不可见的扰动。我们的方法保持了对 adversarial 示例的 photorealism,并确保与原始掩码注释保持一致,从而保留分割任务的完整性。经过我们进行的广泛评估,ASAM 在各种分割任务上都取得了显著的改进,而无需增加数据或架构修改。我们广泛的评估结果证实,ASAM在分割任务上建立了新的基准,从而促进了计算机视觉基础模型的发展。我们的项目页面在这个链接。
https://arxiv.org/abs/2405.00256
Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.
在文档图像中的表格检测是一个关键的任务,涉及表格的识别和定位。尽管最近在深度学习领域的进步大大提高了这一任务的准确性,但仍然高度依赖大型带标签数据集进行有效的训练。为克服这一挑战,已经出现了几種半监督方法,通常采用基于卷积神经网络(CNN)的检测器以及非最大抑制(NMS)等后处理技术。然而,该领域的最新进展已经将重点转向基于Transformer的技术,消除了NMS的需要,并强调了对象查询和注意机制。之前的研究集中在两个关键领域以提高基于Transformer的检测器的质量:优化对象查询和优化注意机制。然而,增加对象查询可能会引入冗余,而调整注意机制可能会增加复杂性。为了应对这些挑战,我们引入了一种半监督方法,使用了SAM-DETR,一种用于精确将对象查询与目标特征对齐的新颖方法。我们的方法在减少误检率和提高表格检测性能方面取得了显著的降幅,特别是在具有多样表格结构的复杂文档中。这项工作在半监督环境中提供了更高效和准确的表格检测。
https://arxiv.org/abs/2405.00187
In recent years, zero-shot learning has attracted the focus of many researchers, due to its flexibility and generality. Many approaches have been proposed to achieve the zero-shot classification of the point clouds for 3D object understanding, following the schema of CLIP. However, in the real world, the point clouds could be extremely sparse, dramatically limiting the effectiveness of the 3D point cloud encoders, and resulting in the misalignment of point cloud features and text embeddings. To the point cloud encoders to fit the extremely sparse point clouds without re-running the pre-training procedure which could be time-consuming and expensive, in this work, we propose an unsupervised model adaptation approach to enhance the point cloud encoder for the extremely sparse point clouds. We propose a novel fused-cross attention layer that expands the pre-trained self-attention layer with additional learnable tokens and attention blocks, which effectively modifies the point cloud features while maintaining the alignment between point cloud features and text embeddings. We also propose a complementary learning-based self-distillation schema that encourages the modified features to be pulled apart from the irrelevant text embeddings without overfitting the feature space to the observed text embeddings. Extensive experiments demonstrate that the proposed approach effectively increases the zero-shot capability on extremely sparse point clouds, and overwhelms other state-of-the-art model adaptation approaches.
近年来,由于其灵活性和普适性,零样本学习(Zero-Shot Learning)吸引了许多研究人员的关注。为了实现3D物体理解中点云的零样本分类,许多方法提出了基于CLIP的方案。然而,在现实生活中,点云可能非常稀疏,极大地限制了3D点云编码器的有效性,并导致点云特征与文本嵌入之间的不匹配。为了适应稀疏的点云,避免重新进行预训练,我们在这个工作中提出了一个无监督的模型适应方法,以增强适应稀疏点云的点云编码器。我们提出了一个新颖的融合跨注意层,通过增加可学习标记和注意力模块,扩展了预训练的自注意力层,有效地修改点云特征,同时保持点云特征与文本嵌入之间的对齐。我们还提出了一个基于互补学习的自监督损失模式,鼓励修改后的特征从相关的文本嵌入中分离出来,以避免对特征空间对观察到的文本嵌入过拟合。大量实验证明,与最先进的模型适应方法相比,所提出的方案在稀疏点云上显著提高了零样本能力,并超越了其他方法。
https://arxiv.org/abs/2404.19639
Diffusion models have emerged as effective tools for generating diverse and high-quality content. However, their capability in high-resolution image generation, particularly for panoramic images, still faces challenges such as visible seams and incoherent transitions. In this paper, we propose TwinDiffusion, an optimized framework designed to address these challenges through two key innovations: Crop Fusion for quality enhancement and Cross Sampling for efficiency optimization. We introduce a training-free optimizing stage to refine the similarity of the adjacent image areas, as well as an interleaving sampling strategy to yield dynamic patches during the cropping process. A comprehensive evaluation is conducted to compare TwinDiffusion with the existing methods, considering factors including coherence, fidelity, compatibility, and efficiency. The results demonstrate the superior performance of our approach in generating seamless and coherent panoramas, setting a new standard in quality and efficiency for panoramic image generation.
扩散模型已成为生成多样且高质量内容的有效工具。然而,其在高分辨率图像生成方面,特别是全景图像,仍然面临着一些挑战,如可见的拼接和不相干的过渡。在本文中,我们提出了TwinDiffusion,一种通过两个关键创新来解决这些挑战的优化框架:裁剪融合用于质量增强和交叉采样用于效率优化。我们引入了一个无需训练的优化阶段来精炼相邻图像区域的相似性,以及一个跨采样策略,以便在裁剪过程中产生动态补丁。对现有方法进行了全面的评估,考虑了包括一致性、忠实性、兼容性和效率在内的因素。结果表明,我们的方法在生成无缝和一致的全景图像方面表现出卓越的性能,为全景图像生成树立了新的质量和技术标准。
https://arxiv.org/abs/2404.19475
Ensuring intelligible speech communication for hearing assistive devices in low-latency scenarios presents significant challenges in terms of speech enhancement, coding and transmission. In this paper, we propose novel solutions for low-latency joint speech transmission and enhancement, leveraging deep neural networks (DNNs). Our approach integrates two state-of-the-art DNN architectures for low-latency speech enhancement and low-latency analog joint source-channel-based transmission, creating a combined low-latency system and jointly training both systems in an end-to-end approach. Due to the computational demands of the enhancement system, this order is suitable when high computational power is unavailable in the decoder, like hearing assistive devices. The proposed system enables the configuration of total latency, achieving high performance even at latencies as low as 3 ms, which is typically challenging to attain. The simulation results provide compelling evidence that a joint enhancement and transmission system is superior to a simple concatenation system in diverse settings, encompassing various wireless channel conditions, latencies, and background noise scenarios.
为了在低延迟场景中确保听觉辅助设备的可理解语音通信,我们在本文中提出了新的解决方案,利用深度神经网络(DNN)进行低延迟语音增强和传输。我们的方法将两个最先进的DNN架构(低延迟语音增强和低延迟模拟联合源-通道-基于传输)集成到一个综合的低延迟系统中,并使用端到端方法同时训练这两个系统。由于增强系统的计算需求较高,当解码器的计算能力无法满足要求时(例如助听辅助设备),这种顺序是合适的。所提出的系统可以实现总延迟的配置,即使在延迟高达3毫秒时,性能也仍然很高,这是通常很难达到的。仿真结果提供了有力的证据,表明联合增强和传输系统在各种场景中优于简单的串联系统,包括各种无线信道条件、延迟和背景噪声场景。
https://arxiv.org/abs/2404.19375
A vision-based drone-to-drone detection system is crucial for various applications like collision avoidance, countering hostile drones, and search-and-rescue operations. However, detecting drones presents unique challenges, including small object sizes, distortion, occlusion, and real-time processing requirements. Current methods integrating multi-scale feature fusion and temporal information have limitations in handling extreme blur and minuscule objects. To address this, we propose a novel coarse-to-fine detection strategy based on vision transformers. We evaluate our approach on three challenging drone-to-drone detection datasets, achieving F1 score enhancements of 7%, 3%, and 1% on the FL-Drones, AOT, and NPS-Drones datasets, respectively. Additionally, we demonstrate real-time processing capabilities by deploying our model on an edge-computing device. Our code will be made publicly available.
基于视觉的无人机对无人机检测系统对于各种应用,如避障、应对敌对无人机和搜索与救援任务至关重要。然而,检测无人机存在独特的挑战,包括小物体尺寸、畸变、遮挡和实时处理需求。目前将多尺度特征融合和时间信息相结合的方法在处理极端模糊和微小物体方面存在局限。为了应对这一挑战,我们提出了一个基于视觉变压器的全新粗-到细检测策略。我们在FL-Drones、AOT和NPS-Drones等三个具有挑战性的无人机对无人机检测数据集上进行了评估,分别实现了FL-Drones数据集的F1得分提高7%、AOT数据集的F1得分提高3%和NPS-Drones数据集的F1得分提高1%。此外,通过将我们的模型部署在边缘计算设备上,我们还展示了实时处理能力。我们的代码将公开发布。
https://arxiv.org/abs/2404.19276
Health literacy is crucial to supporting good health and is a major national goal. Audio delivery of information is becoming more popular for informing oneself. In this study, we evaluate the effect of audio enhancements in the form of information emphasis and pauses with health texts of varying difficulty and we measure health information comprehension and retention. We produced audio snippets from difficult and easy text and conducted the study on Amazon Mechanical Turk (AMT). Our findings suggest that emphasis matters for both information comprehension and retention. When there is no added pause, emphasizing significant information can lower the perceived difficulty for difficult and easy texts. Comprehension is higher (54%) with correctly placed emphasis for the difficult texts compared to not adding emphasis (50%). Adding a pause lowers perceived difficulty and can improve retention but adversely affects information comprehension.
健康素养对于支持良好的健康和实现国家战略至关重要。获取信息的音频交付变得越来越受欢迎,以告知自己。在这项研究中,我们评估了音频增强在形式为信息强调和停顿的健康文本上的效果,并测量了健康信息的理解和保留。我们在难易不同的文本上产生了音频片段,并在亚马逊 Mechanical Turk(AMT)上进行了研究。我们的研究结果表明,强调对于信息理解和保留都至关重要。没有添加停顿时,强调重要信息可以降低对于难易文本的感知难度。在难文本上,正确的强调位置可以提高理解程度(54%),而不添加强调时(50%),理解程度更高。添加停顿可以降低感知难度并提高保留,但同时也会对信息理解产生不利影响。
https://arxiv.org/abs/2404.19119
This paper introduces YOLOv8-TO, a novel approach for reverse engineering of topology-optimized structures into interpretable geometric parameters using the YOLOv8 instance segmentation model. Density-based topology optimization methods require post-processing to convert the optimal density distribution into a parametric representation for design exploration and integration with CAD tools. Traditional methods such as skeletonization struggle with complex geometries and require manual intervention. YOLOv8-TO addresses these challenges by training a custom YOLOv8 model to automatically detect and reconstruct structural components from binary density distributions. The model is trained on a diverse dataset of both optimized and random structures generated using the Moving Morphable Components method. A custom reconstruction loss function based on the dice coefficient of the predicted geometry is used to train the new regression head of the model via self-supervised learning. The method is evaluated on test sets generated from different topology optimization methods, including out-of-distribution samples, and compared against a skeletonization approach. Results show that YOLOv8-TO significantly outperforms skeletonization in reconstructing visually and structurally similar designs. The method showcases an average improvement of 13.84% in the Dice coefficient, with peak enhancements reaching 20.78%. The method demonstrates good generalization to complex geometries and fast inference times, making it suitable for integration into design workflows using regular workstations. Limitations include the sensitivity to non-max suppression thresholds. YOLOv8-TO represents a significant advancement in topology optimization post-processing, enabling efficient and accurate reverse engineering of optimized structures for design exploration and manufacturing.
本文介绍了一种名为YOLOv8-TO的新方法,用于使用YOLOv8实例分割模型将拓扑优化结构反向工程为可解释的几何参数。密度基于拓扑优化方法需要后处理将最优密度分布转换为设计探索和CAD工具集成所需的参数表示。传统方法如骨架化在复杂几何图形上挣扎,并需要手动干预。YOLOv8-TO通过训练自适应检测和重构结构的YOLOv8模型来解决这些挑战。模型在通过自监督学习训练的新回归头的基础上进行训练,同时使用基于 dice 系数的自适应重构损失函数进行训练。该方法在从不同拓扑优化方法产生的测试集中进行评估,包括离散样本,并将其与骨架化方法进行比较。结果表明,YOLOv8-TO在重构视觉和结构相似的设计方面显著优于骨架化方法。该方法在 Dice 系数上展示了13.84%的改进,峰值增强达到20.78%。该方法具有良好的对复杂几何的泛化能力,并且具有快速的推理时间,使其适用于使用常规工作台进行设计工作流程的集成。局限性包括对非最大抑制阈值的敏感性。YOLOv8-TO在拓扑优化后处理方面取得了显著的进展,实现了对优化结构的高效且准确的逆向工程,以进行设计探索和制造。
https://arxiv.org/abs/2404.18763
Recent developments in neural rendering techniques have greatly enhanced the rendering of photo-realistic 3D scenes across both academic and commercial fields. The latest method, known as 3D Gaussian Splatting (3D-GS), has set new benchmarks for rendering quality and speed. Nevertheless, the limitations of 3D-GS become pronounced in synthesizing new viewpoints, especially for views that greatly deviate from those seen during training. Additionally, issues such as dilation and aliasing arise when zooming in or out. These challenges can all be traced back to a single underlying issue: insufficient sampling. In our paper, we present a bootstrapping method that significantly addresses this problem. This approach employs a diffusion model to enhance the rendering of novel views using trained 3D-GS, thereby streamlining the training process. Our results indicate that bootstrapping effectively reduces artifacts, as well as clear enhancements on the evaluation metrics. Furthermore, we show that our method is versatile and can be easily integrated, allowing various 3D reconstruction projects to benefit from our approach.
近年来,神经渲染技术的发展已经极大地提高了照片写实3D场景的渲染效果,无论是学术还是商业领域。最新的方法,称为3D高斯展平(3D-GS)方法,为渲染质量和速度设定了新的基准。然而,3D-GS的局限性在生成新的视角时表现得尤为突出,特别是对于在训练过程中观察到的视角有着很大偏离的视角。此外,在缩放时会出现扩散和混叠等问题。这些问题都可以追溯到一个根本问题:不足的采样。在本文中,我们提出了一个 bootstrapping 方法,显著地解决了这个问题。这种方法采用扩散模型来增强使用训练后的3D-GS生成新视角,从而简化训练过程。我们的结果表明,通过bootstrap有效地减少了伪影,同时在评估指标上显示出明显的增强。此外,我们还证明了我们的方法是灵活的,可以轻松地与其他3D重建项目集成,从而使各种3D项目都能从我们的方法中受益。
https://arxiv.org/abs/2404.18669