MLLMs have demonstrated remarkable comprehension and reasoning capabilities with complex language and visual data. These advances have spurred the vision of establishing a generalist robotic MLLM proficient in understanding complex human instructions and accomplishing various embodied tasks. However, developing MLLMs for real-world robots is challenging due to the typically limited computation and memory capacities available on robotic platforms. In contrast, the inference of MLLMs involves storing billions of parameters and performing tremendous computation, imposing significant hardware demands. In our paper, we propose a Dynamic Early-Exit Framework for Robotic Vision-Language-Action Model (DeeR-VLA, or simply DeeR) that automatically adjusts the size of the activated MLLM based on each situation at hand. The approach leverages a multi-exit architecture in MLLMs, which allows the model to terminate processing once a proper size of the model has been activated for a specific situation, thus avoiding further redundant computation. Additionally, we develop novel algorithms that establish early-termination criteria for DeeR, conditioned on predefined demands such as average computational cost (i.e., power consumption), as well as peak computational consumption (i.e., latency) and GPU memory usage. These enhancements ensure that DeeR operates efficiently under varying resource constraints while maintaining competitive performance. On the CALVIN robot manipulation benchmark, DeeR demonstrates significant reductions in computational costs of LLM by 5.2-6.5x and GPU memory of LLM by 2-6x without compromising performance. Code and checkpoints are available at this https URL.
MLLMs(多模态大语言模型)已经展现出了处理复杂语言和视觉数据时的卓越理解和推理能力。这些进展激发了建立一种全能型机器人MLLM的愿景,这种模型能够理解复杂的指令并完成各种实际任务。然而,为现实世界的机器人开发MLLM颇具挑战性,因为通常机器人平台上的计算和内存容量是有限的。相反,MLLMs的推理需要存储数十亿个参数并进行大量的计算,这对硬件提出了较高的要求。 在我们的论文中,我们提出了一种用于机器人视觉-语言-动作模型(DeeR-VLA,或简称DeeR)的动态提前退出框架,该框架能够根据每一种情况自动调整激活的MLLM大小。这种方法利用了MLLMs中的多出口架构,允许模型一旦为特定情况激活了适当规模的部分便停止处理过程,从而避免进一步冗余计算。此外,我们还开发了新的算法来建立DeeR的提前终止标准,这些标准基于预定义的需求如平均计算成本(即功耗)、峰值计算消耗(即延迟)以及GPU内存使用量。这些改进确保了DeeR在不同资源约束下能够高效运行并保持竞争力。 在CALVIN机器人操作基准测试中,DeeR展示了显著的计算成本降低,降低了5.2-6.5倍,并将LLM的GPU内存需求减少了2-6倍,同时并未影响性能。代码和检查点可以在提供的此https URL处获取。
https://arxiv.org/abs/2411.02359
This study investigates the impact of integrating DevSecOps and Generative Artificial Intelligence (GAI) on software delivery performance within technology firms. Utilizing a qualitative research methodology, the research involved semi-structured interviews with industry practitioners and analysis of case studies from organizations that have successfully implemented these methodologies. The findings reveal significant enhancements in research and development (R&D) efficiency, improved source code management, and heightened software quality and security. The integration of GAI facilitated automation of coding tasks and predictive analytics, while DevSecOps ensured that security measures were embedded throughout the development lifecycle. Despite the promising results, the study identifies gaps related to the generalizability of the findings due to the limited sample size and the qualitative nature of the research. This paper contributes valuable insights into the practical implementation of DevSecOps and GAI, highlighting their potential to transform software delivery processes in technology firms. Future research directions include quantitative assessments of the impact on specific business outcomes and comparative studies across different industries.
本研究调查了在科技公司中将DevSecOps和生成式人工智能(GAI)相结合对软件交付性能的影响。采用定性研究方法,该研究包括与行业从业者进行的半结构化访谈以及对成功实施这些方法的组织案例的研究分析。研究结果表明,在研发效率、源代码管理和软件质量和安全性方面有显著提升。GAI的集成促进了编码任务和预测分析的自动化,而DevSecOps则确保了安全措施贯穿整个开发周期。尽管取得了令人鼓舞的结果,但该研究指出由于样本量有限以及研究性质为定性,因此在结果普适性上存在不足。本文提供了关于DevSecOps和GAI实际实施的重要见解,强调它们有潜力改变科技公司的软件交付过程。未来的研究方向包括对特定业务成果影响的定量评估及跨不同行业的比较研究。
https://arxiv.org/abs/2411.02255
The accuracy of face recognition systems has improved significantly in the past few years, thanks to the large amount of data collected and the advancement in neural network architectures. However, these large-scale datasets are often collected without explicit consent, raising ethical and privacy concerns. To address this, there have been proposals to use synthetic datasets for training face recognition models. Yet, such models still rely on real data to train the generative models and generally exhibit inferior performance compared to those trained on real datasets. One of these datasets, DigiFace, uses a graphics pipeline to generate different identities and different intra-class variations without using real data in training the models. However, the performance of this approach is poor on face recognition benchmarks, possibly due to the lack of realism in the images generated from the graphics pipeline. In this work, we introduce a novel framework for realism transfer aimed at enhancing the realism of synthetically generated face images. Our method leverages the large-scale face foundation model, and we adapt the pipeline for realism enhancement. By integrating the controllable aspects of the graphics pipeline with our realism enhancement technique, we generate a large amount of realistic variations-combining the advantages of both approaches. Our empirical evaluations demonstrate that models trained using our enhanced dataset significantly improve the performance of face recognition systems over the baseline. The source code and datasets will be made available publicly.
在过去几年中,由于收集了大量数据并且神经网络架构得到了进步,面部识别系统的准确性有了显著提高。然而,这些大规模的数据集经常是在没有明确同意的情况下收集的,这引发了伦理和隐私方面的担忧。为了解决这一问题,有人提议使用合成数据集来训练面部识别模型。然而,这类模型仍然依赖于真实数据来训练生成模型,并且通常在性能上不如那些用真实数据集训练过的模型。其中一个数据集DigiFace利用图形管线生成不同的身份和类内变化,而无需在训练模型时使用真实数据。不过,这种方法在面部识别基准测试上的表现不佳,可能是因为从图形管线生成的图像缺乏现实感。在这项工作中,我们引入了一个新的现实性转移框架,旨在提升合成生成的脸部图像的真实性。我们的方法利用了大规模的人脸基础模型,并调整了用于增强现实性的流水线。通过将图形管线中的可控方面与我们的现实性增强技术相结合,我们可以生成大量真实的变体——结合两种方法的优点。实证评估表明,使用我们增强的数据集训练的模型显著提高了面部识别系统的性能。源代码和数据集将会公开提供。
https://arxiv.org/abs/2411.02188
In this study, we introduce a novel multi-modal biometric authentication system that integrates facial, vocal, and signature data to enhance security measures. Utilizing a combination of Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), our model architecture uniquely incorporates dual shared layers alongside modality-specific enhancements for comprehensive feature extraction. The system undergoes rigorous training with a joint loss function, optimizing for accuracy across diverse biometric inputs. Feature-level fusion via Principal Component Analysis (PCA) and classification through Gradient Boosting Machines (GBM) further refine the authentication process. Our approach demonstrates significant improvements in authentication accuracy and robustness, paving the way for advanced secure identity verification solutions.
在这项研究中,我们介绍了一种新颖的多模态生物识别认证系统,该系统整合了面部、声音和签名数据以增强安全措施。我们的模型架构结合了卷积神经网络(CNNs)和循环神经网络(RNNs),独特地包含双共享层以及特定模式的改进,实现了全面的特征提取。该系统通过联合损失函数进行严格的训练,优化了不同生物识别输入下的准确性。借助主成分分析(PCA)实现特征级融合,并通过梯度提升机(GBM)分类进一步完善认证过程。我们的方法在认证准确性和鲁棒性方面展示了显著改进,为先进的安全身份验证解决方案铺平了道路。
https://arxiv.org/abs/2411.02112
The pursuit of realism in cinema has driven significant advancements in animatronics, where the integration of mechatronics, a multidisciplinary field that combines mechanical engineering, electronics, and computer science, plays a pivotal role in enhancing the functionality and realism of animatronics. This interdisciplinary approach facilitates smoother characters movements and enhances the sophistication of behaviors in animatronic creatures, thereby increasing their realism. This article examines the most recent developments in mechatronic technology and their significant impact on the art and engineering of animatronics in the filmmaking. It explores the sophisticated integration of system components and analyzes how these enhancements foster complexity and integration, crucial for achieving unprecedented levels of realism in modern cinema. Further, the article delves into in-depth case studies of well-known movie characters, demonstrating the practical applicability of these state-of-the-art mechatronic solutions in creating compelling, lifelike cinematic experiences. This paper aims to bridge the gap between the technical aspects of mechatronics and the creative demands of the film industry, ultimately contributing to the ongoing evolution of cinematic realism.
对电影中写实主义的追求推动了电子动画技术的重大进步,其中机电一体化(将机械工程、电子技术和计算机科学相结合的跨学科领域)在提升电子动画的功能性和真实性方面发挥了关键作用。这种跨学科方法使角色的动作更加流畅,并增强了电子动画生物的行为复杂性,从而提高了它们的真实感。本文探讨了机电技术领域的最新发展及其对电影制作中电子动画艺术和工程的重要影响。它分析了系统组件的高级集成方式,并研究这些改进如何促进复杂性和整合度的提升,这对于实现现代电影前所未有的真实水平至关重要。此外,文章深入探讨了一些著名电影角色的案例研究,展示了这些最先进的机电解决方案在创造引人入胜、逼真的电影体验中的实际应用性。本文旨在弥合机电技术的技术层面与电影行业的创造性需求之间的差距,最终为推动电影写实主义的持续发展做出贡献。
https://arxiv.org/abs/2411.02102
Deep learning-based speech enhancement (SE) methods often face significant computational challenges when needing to meet low-latency requirements because of the increased number of frames to be processed. This paper introduces the SlowFast framework which aims to reduce computation costs specifically when low-latency enhancement is needed. The framework consists of a slow branch that analyzes the acoustic environment at a low frame rate, and a fast branch that performs SE in the time domain at the needed higher frame rate to match the required latency. Specifically, the fast branch employs a state space model where its state transition process is dynamically modulated by the slow branch. Experiments on a SE task with a 2 ms algorithmic latency requirement using the Voice Bank + Demand dataset show that our approach reduces computation cost by 70% compared to a baseline single-branch network with equivalent parameters, without compromising enhancement performance. Furthermore, by leveraging the SlowFast framework, we implemented a network that achieves an algorithmic latency of just 60 {\mu}s (one sample point at 16 kHz sample rate) with a computation cost of 100 M MACs/s, while scoring a PESQ-NB of 3.12 and SISNR of 16.62.
基于深度学习的语音增强(SE)方法在需要满足低延迟要求时,常常会面临显著的计算挑战,因为此时需要处理更多帧。本文介绍了一种名为SlowFast的框架,旨在当需要进行低延迟增强时减少计算成本。该框架包含一个以较低帧率分析声学环境的慢分支和一个在时间域中按需以较高帧率执行SE的快分支,从而匹配所需的延迟。具体而言,快分支采用状态空间模型,其状态转移过程由慢分支动态调制。使用Voice Bank + Demand数据集对具有2毫秒算法延迟要求的SE任务进行实验表明,与参数相当的单分支基准网络相比,我们的方法将计算成本降低了70%,同时不妥协增强性能。此外,通过利用SlowFast框架,我们实现了一个达到60微秒(16 kHz采样率下一个样本点)算法延迟、且具有每秒1亿次MACs计算成本的网络,在PESQ-NB得分为3.12和SISNR为16.62。
https://arxiv.org/abs/2411.02019
Accurate annotation of educational resources is critical in the rapidly advancing field of online education due to the complexity and volume of content. Existing classification methods face challenges with semantic overlap and distribution imbalance of labels in the multi-label context, which impedes effective personalized learning and resource recommendation. This paper introduces RR2QC, a novel Retrieval Reranking method To multi-label Question Classification by leveraging label semantics and meta-label refinement. Firstly, RR2QC leverages semantic relationships within and across label groups to enhance pre-training strategie in multi-label context. Next, a class center learning task is introduced, integrating label texts into downstream training to ensure questions consistently align with label semantics, retrieving the most relevant label sequences. Finally, this method decomposes labels into meta-labels and trains a meta-label classifier to rerank the retrieved label sequences. In doing so, RR2QC enhances the understanding and prediction capability of long-tail labels by learning from meta-labels frequently appearing in other labels. Addtionally, a Math LLM is used to generate solutions for questions, extracting latent information to further refine the model's insights. Experimental results demonstrate that RR2QC outperforms existing classification methods in Precision@k and F1 scores across multiple educational datasets, establishing it as a potent enhancement for online educational content utilization.
准确注解教育资源对于快速发展的在线教育领域至关重要,因为内容的复杂性和数量庞大。现有的分类方法在多标签环境下面临着语义重叠和标签分布不均衡的挑战,这阻碍了个性化学习和资源推荐的有效性。本文介绍了RR2QC,一种通过利用标签语义和元标签细化来实现多标签问题分类的新型检索重新排序方法。首先,RR2QC利用标签组内及跨组之间的语义关系,在多标签环境下增强了预训练策略。其次,引入了一个类别中心学习任务,将标签文本整合到下游训练中,确保问题与标签语义的一致性,并检索出最相关的标签序列。最后,该方法将标签分解为元标签,并训练一个元标签分类器来重新排序检索到的标签序列。通过这种方式,RR2QC通过从频繁出现在其他标签中的元标签学习,增强了对长尾标签的理解和预测能力。此外,使用数学大语言模型(LLM)生成问题解答,提取潜在信息以进一步细化模型的洞察力。实验结果表明,在多个教育数据集上,RR2QC在Precision@k和F1分数方面优于现有的分类方法,确立了其作为在线教育资源利用增强的重要地位。
https://arxiv.org/abs/2411.01841
Semi-supervised learning (SSL) offers a robust framework for harnessing the potential of unannotated data. Traditionally, SSL mandates that all classes possess labeled instances. However, the emergence of open-world SSL (OwSSL) introduces a more practical challenge, wherein unlabeled data may encompass samples from unseen classes. This scenario leads to misclassification of unseen classes as known ones, consequently undermining classification accuracy. To overcome this challenge, this study revisits two methodologies from self-supervised and semi-supervised learning, self-labeling and consistency, tailoring them to address the OwSSL problem. Specifically, we propose an effective framework called OwMatch, combining conditional self-labeling and open-world hierarchical thresholding. Theoretically, we analyze the estimation of class distribution on unlabeled data through rigorous statistical analysis, thus demonstrating that OwMatch can ensure the unbiasedness of the self-label assignment estimator with reliability. Comprehensive empirical analyses demonstrate that our method yields substantial performance enhancements across both known and unknown classes in comparison to previous studies. Code is available at this https URL.
半监督学习(Semi-supervised learning, SSL)提供了一个强大的框架,用于利用未标注数据的潜力。传统上,SSL 要求所有类别都具有标记实例。然而,开放世界半监督学习(Open-world Semi-Supervised Learning, OwSSL)的出现带来了一个更现实的挑战,即未标注的数据可能包含来自未知类别的样本。这种情形导致了将未知类别错误分类为已知类别的问题,从而损害了分类准确性。为了克服这一挑战,本研究重新审视了自监督学习和半监督学习中的两种方法——自标签(self-labeling)和一致性(consistency),并将它们调整以解决 OwSSL 问题。具体来说,我们提出了一种有效的框架,称为 OwMatch,结合了条件自标注和开放世界层次阈值技术。理论上,我们通过严格的统计分析对未标注数据上的类别分布估计进行了分析,从而证明 OwMatch 可以可靠地确保自我标签分配估算的无偏性。综合经验分析表明,与之前的研究相比,我们的方法在已知类和未知类上都带来了显著的性能提升。代码可以在这个网址获得:[提供的链接]。
https://arxiv.org/abs/2411.01833
We introduce DreamPolish, a text-to-3D generation model that excels in producing refined geometry and high-quality textures. In the geometry construction phase, our approach leverages multiple neural representations to enhance the stability of the synthesis process. Instead of relying solely on a view-conditioned diffusion prior in the novel sampled views, which often leads to undesired artifacts in the geometric surface, we incorporate an additional normal estimator to polish the geometry details, conditioned on viewpoints with varying field-of-views. We propose to add a surface polishing stage with only a few training steps, which can effectively refine the artifacts attributed to limited guidance from previous stages and produce 3D objects with more desirable geometry. The key topic of texture generation using pretrained text-to-image models is to find a suitable domain in the vast latent distribution of these models that contains photorealistic and consistent renderings. In the texture generation phase, we introduce a novel score distillation objective, namely domain score distillation (DSD), to guide neural representations toward such a domain. We draw inspiration from the classifier-free guidance (CFG) in textconditioned image generation tasks and show that CFG and variational distribution guidance represent distinct aspects in gradient guidance and are both imperative domains for the enhancement of texture quality. Extensive experiments show our proposed model can produce 3D assets with polished surfaces and photorealistic textures, outperforming existing state-of-the-art methods.
我们介绍了DreamPolish,这是一种擅长生成精细几何结构和高质量纹理的文本到3D生成模型。在几何构建阶段,我们的方法利用多种神经表示来提升合成过程的稳定性。与依赖于新采样视图中的视角条件扩散先验不同(这通常会导致几何表面出现不希望看到的瑕疵),我们引入了一个额外的法线估计器,在具有不同视野范围的视点条件下对几何细节进行抛光处理。我们提出添加一个仅需几步训练的表面抛光阶段,可以有效改进先前阶段有限指导产生的缺陷,并生成更理想的几何形状的3D对象。在纹理生成的关键问题上,使用预训练的文本到图像模型需要找到这些模型庞大的潜在分布中包含逼真且一致渲染结果的合适领域。在纹理生成阶段,我们引入了一种新颖的得分蒸馏目标——域得分蒸馏(DSD),以引导神经表示向此类领域发展。我们从基于文本条件的图像生成任务中的无分类器指导(CFG)中汲取灵感,并证明了CFG和变分分布指导分别代表了梯度指导的不同方面,对于提升纹理质量都至关重要。广泛的实验表明,我们的模型能够生成具有抛光表面和逼真纹理的3D资产,并超越现有的最先进的方法。
https://arxiv.org/abs/2411.01602
Background: Cone-beam computed tomography (CBCT) plays a crucial role in image-guided radiotherapy, but artifacts and noise make them unsuitable for accurate dose calculation. Artificial intelligence methods have shown promise in enhancing CBCT quality to produce synthetic CT (sCT) images. However, existing methods either produce images of suboptimal quality or incur excessive time costs, failing to satisfy clinical practice standards. Methods and materials: We propose a novel hybrid conditional latent diffusion model for efficient and accurate CBCT-to-CT synthesis, named HC$^3$L-Diff. We employ the Unified Feature Encoder (UFE) to compress images into a low-dimensional latent space, thereby optimizing computational efficiency. Beyond the use of CBCT images, we propose integrating its high-frequency knowledge as a hybrid condition to guide the diffusion model in generating sCT images with preserved structural details. This high-frequency information is captured using our designed High-Frequency Extractor (HFE). During inference, we utilize denoising diffusion implicit model to facilitate rapid sampling. We construct a new in-house prostate dataset with paired CBCT and CT to validate the effectiveness of our method. Result: Extensive experimental results demonstrate that our approach outperforms state-of-the-art methods in terms of sCT quality and generation efficiency. Moreover, our medical physicist conducts the dosimetric evaluations to validate the benefit of our method in practical dose calculation, achieving a remarkable 93.8% gamma passing rate with a 2%/2mm criterion, superior to other methods. Conclusion: The proposed HC$^3$L-Diff can efficiently achieve high-quality CBCT-to-CT synthesis in only over 2 mins per patient. Its promising performance in dose calculation shows great potential for enhancing real-world adaptive radiotherapy.
背景:锥束计算机断层扫描(CBCT)在图像引导的放射治疗中扮演着关键角色,但其存在的伪影和噪声使其不适合进行精确剂量计算。人工智能方法显示出增强CBCT质量以生成合成CT(sCT)图像的巨大潜力。然而,现有的方法要么产生次优质量的图像,要么花费过多的时间成本,无法满足临床实践的标准。 方法与材料:我们提出了一种新颖的混合条件潜在扩散模型来实现高效且准确的CBCT到CT合成,命名为HC$^3$L-Diff。我们采用统一特征编码器(UFE)将图像压缩进低维潜空间,从而优化计算效率。除了使用CBCT图像外,我们还提议将其高频知识作为混合条件整合进去,以引导扩散模型生成保留结构细节的sCT图像。此高频信息通过我们的设计——高频提取器(HFE)捕获。在推理过程中,我们利用去噪扩散隐式模型来促进快速采样。为了验证方法的有效性,构建了一个新的内部前列腺数据集,该数据集中包含配对的CBCT和CT。 结果:广泛的实验结果显示,我们的方法在sCT质量和生成效率方面均超越了最先进的方法。此外,我们的医疗物理学家进行了剂量评估,以证明该方法在实际剂量计算中的益处,实现了2%/2mm标准下的93.8%伽玛通过率,优于其他方法。 结论:所提出的HC$^3$L-Diff可以在每位患者仅超过2分钟的时间内高效地实现高质量的CBCT到CT合成。其在剂量计算中表现出色,显示出极大潜力来提升现实世界的适应性放射治疗。 注释: - CBCT: 锥束计算机断层扫描 - sCT: 合成CT(Synthetic CT) - UFE: 统一特征编码器(Unified Feature Encoder) - HFE: 高频提取器(High-Frequency Extractor)
https://arxiv.org/abs/2411.01575
Landmark digitization is essential in geometric morphometrics, enabling the quantification of biological shapes, such as facial structures, for in-depth morphological analysis. Traditional landmarking, which identifies specific anatomical points, can be complemented by semilandmarks when precise locations are challenging to define. However, manual placement of numerous landmarks is time-consuming and prone to human error, leading to inconsistencies across studies. To address this, we introduce FaceDig, an AI-powered tool designed to automate landmark placement with human-level precision, focusing on anatomically sound facial points. FaceDig is open-source and integrates seamlessly with analytical platforms like R and Python. It was trained using one of the largest and most ethnically diverse face datasets, applying a landmark configuration optimized for 2D enface photographs. Our results demonstrate that FaceDig provides reliable landmark coordinates, comparable to those placed manually by experts. The tool's output is compatible with the widely-used TpsDig2 software, facilitating adoption and ensuring consistency across studies. Users are advised to work with standardized facial images and visually inspect the results for potential corrections. Despite the growing preference for 3D morphometrics, 2D facial photographs remain valuable due to their cultural and practical significance. Future enhancements to FaceDig will include support for profile views, further expanding its utility. By offering a standardized approach to landmark placement, FaceDig promotes reproducibility in facial morphology research and provides a robust alternative to existing 2D tools.
地标数字化在几何形态学中至关重要,它能够量化生物形状(如面部结构),以进行深入的形态分析。传统的地标标记识别特定解剖点,并且当精确位置难以定义时可以使用半地标来补充。然而,手动放置大量地标既耗时又容易出现人为错误,导致研究之间的不一致性。为了解决这个问题,我们引入了FaceDig,这是一种利用人工智能自动进行人类水平精度的地标定位工具,专注于面部解剖学上合理的点位。FaceDig是开源的,并且可以无缝集成到R和Python等分析平台中。它使用了一个最大、最具有种族多样性的脸部数据集之一进行训练,采用了优化的2D正面照片地标配置。我们的结果显示,FaceDig提供的地标坐标与专家手动放置的坐标相当可靠。该工具的输出与广泛使用的TpsDig2软件兼容,这有利于采用并确保研究的一致性。用户被建议使用标准化面部图像,并视觉检查结果以进行可能的校正。尽管3D形态学越来越受到青睐,但由于其文化和实践上的重要性,2D面部照片仍然具有价值。FaceDig未来的改进将包括支持侧面视图,进一步扩展其实用性。通过提供一种标准的方法来进行地标定位,FaceDig促进了面部形态研究中的可重复性,并为现有的2D工具提供了强大的替代方案。
https://arxiv.org/abs/2411.01508
Retinal fundus photography enhancement is important for diagnosing and monitoring retinal diseases. However, early approaches to retinal image enhancement, such as those based on Generative Adversarial Networks (GANs), often struggle to preserve the complex topological information of blood vessels, resulting in spurious or missing vessel structures. The persistence diagram, which captures topological features based on the persistence of topological structures under different filtrations, provides a promising way to represent the structure information. In this work, we propose a topology-preserving training paradigm that regularizes blood vessel structures by minimizing the differences of persistence diagrams. We call the resulting framework Topology Preserving Optimal Transport (TPOT). Experimental results on a large-scale dataset demonstrate the superiority of the proposed method compared to several state-of-the-art supervised and unsupervised techniques, both in terms of image quality and performance in the downstream blood vessel segmentation task. The code is available at this https URL.
视网膜眼底摄影增强对于诊断和监测视网膜疾病至关重要。然而,早期的视网膜图像增强方法,如基于生成对抗网络(GANs)的方法,常常难以保持血管的复杂拓扑信息,导致出现虚假或缺失的血管结构。持久图通过在不同过滤下捕获拓扑结构的持续性特征,为表示结构信息提供了一种有前景的方式。在这项工作中,我们提出了一种保留拓扑结构的训练范式,该方法通过对最小化持久图之间的差异来规范化血管结构。我们将这一框架称为拓扑保持最优传输(TPOT)。在大规模数据集上的实验结果表明,与几种最先进的监督和无监督技术相比,所提出的方法在图像质量和下游血管分割任务性能方面都表现出优越性。代码可在以下链接获取:[此处为提供的URL]。
https://arxiv.org/abs/2411.01403
In medical imaging, accurate diagnosis heavily relies on effective image enhancement techniques, particularly for X-ray images. Existing methods often suffer from various challenges such as sacrificing global image characteristics over local image characteristics or vice versa. In this paper, we present a novel approach, called G-CLAHE (Global-Contrast Limited Adaptive Histogram Equalization), which perfectly suits medical imaging with a focus on X-rays. This method adapts from Global Histogram Equalization (GHE) and Contrast Limited Adaptive Histogram Equalization (CLAHE) to take both advantages and avoid weakness to preserve local and global characteristics. Experimental results show that it can significantly improve current state-of-the-art algorithms to effectively address their limitations and enhance the contrast and quality of X-ray images for diagnostic accuracy.
在医学成像中,准确的诊断依赖于有效的图像增强技术,特别是对于X射线图像。现有方法常常面临各种挑战,比如牺牲全局图像特征以提升局部图像特征或反之亦然。本文提出了一种新的方法,称为G-CLAHE(全局对比受限自适应直方图均衡化),该方法特别适合医学成像领域,特别是在处理X射线图像时。这种方法结合了全局直方图均衡化(GHE)和对比度受限的自适应直方图均衡化(CLAHE),旨在同时利用它们的优点并避免各自的缺点,以保持局部和全局特征。实验结果表明,它可以显著改进当前最先进的算法,有效地解决它们的局限性,并提高X射线图像的对比度和质量,从而提升诊断准确性。
https://arxiv.org/abs/2411.01373
With the scale of vision Transformer-based models continuing to grow, finetuning these large-scale pretrained models for new tasks has become increasingly parameter-intensive. Visual prompt tuning is introduced as a parameter-efficient finetuning (PEFT) method to this trend. Despite its successes, a notable research challenge persists within almost all PEFT approaches: significant performance degradation is observed when there is a substantial disparity between the datasets applied in pretraining and finetuning phases. To address this challenge, we draw inspiration from human visual cognition, and propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models. Our approach innovatively incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information. Apart from its inherent simplicity and intuitiveness, VFPT exhibits superior performance across all datasets, offering a general solution to dataset challenges, irrespective of data disparities. Empirical results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks, with low parameter usage (e.g., 0.57% of model parameters on VTAB-1k) and notable performance enhancements (e.g., 73.20% of mean accuracy on VTAB-1k). Our code is avaliable at this https URL.
随着基于视觉Transformer的模型规模持续增长,对这些大规模预训练模型进行新任务微调所需的参数量也越来越多。为应对这一趋势,引入了视觉提示调优(Visual Prompt Tuning)作为一种参数高效的微调方法(PEFT)。尽管取得了一定成功,几乎所有PEFT方法中仍存在一个显著的研究挑战:当预训练和微调阶段所用的数据集之间存在显著差异时,性能会大幅下降。为解决这一问题,我们借鉴人类视觉认知的原理,提出了一种名为“Visual Fourier Prompt Tuning”(VFPT)的方法,作为适应大规模Transformer模型的一种通用且有效的解决方案。我们的方法创新性地将快速傅里叶变换融入提示嵌入中,并同时考虑了空间域和频率域的信息。除了其固有的简洁性和直观性外,VFPT在所有数据集上均表现出优越的性能,为解决数据集挑战提供了通用方案,无论数据差异如何。实验证明,我们的方法在两个基准测试中的表现优于当前最先进的基线,在参数使用量较低(例如,在VTAB-1k中仅占模型参数的0.57%)的情况下,还实现了显著的性能提升(例如,在VTAB-1k上的平均准确率达到了73.20%)。我们的代码可在以下链接获取:[此 HTTPS URL]。
https://arxiv.org/abs/2411.01327
This paper makes a step towards modeling the modality discrepancy in the cross-spectral re-identification task. Based on the Lambertain model, we observe that the non-linear modality discrepancy mainly comes from diverse linear transformations acting on the surface of different materials. From this view, we unify all data augmentation strategies for cross-spectral re-identification by mimicking such local linear transformations and categorizing them into moderate transformation and radical transformation. By extending the observation, we propose a Random Linear Enhancement (RLE) strategy which includes Moderate Random Linear Enhancement (MRLE) and Radical Random Linear Enhancement (RRLE) to push the boundaries of both types of transformation. Moderate Random Linear Enhancement is designed to provide diverse image transformations that satisfy the original linear correlations under constrained conditions, whereas Radical Random Linear Enhancement seeks to generate local linear transformations directly without relying on external information. The experimental results not only demonstrate the superiority and effectiveness of RLE but also confirm its great potential as a general-purpose data augmentation for cross-spectral re-identification. The code is available at \textcolor{magenta}{\url{this https URL}}.
本文朝着建模跨光谱重识别任务中的模态差异迈出了一步。基于兰贝特定律模型,我们观察到非线性模态差异主要来自于作用于不同材料表面的多种线性变换。从这个角度来看,我们将所有用于跨光谱重识别的数据增强策略统一起来,通过模拟这样的局部线性变换,并将其分类为适度变换和激进变换。通过对这一观察结果进行扩展,我们提出了一种随机线性增强(RLE)策略,其中包括中度随机线性增强(MRLE)和激进随机线性增强(RRLE),以推动两种类型变换的界限。中度随机线性增强旨在提供多样化的图像变换,在受限条件下满足原始的线性相关性;而激进随机线性增强则试图直接生成局部线性变换,不依赖外部信息。实验结果不仅展示了RLE的优越性和有效性,还证实了它作为跨光谱重识别通用数据增强方法的巨大潜力。代码可在\textcolor{magenta}{\url{此 https URL}}获取。
https://arxiv.org/abs/2411.01225
Federated Learning (FL) is essential for efficient data exchange in Internet of Things (IoT) environments, as it trains Machine Learning (ML) models locally and shares only model updates. However, FL is vulnerable to privacy threats like model inversion and membership inference attacks, which can expose sensitive training data. To address these privacy concerns, Differential Privacy (DP) mechanisms are often applied. Yet, adding DP noise to black-box ML models degrades performance, especially in dynamic IoT systems where continuous, lifelong FL learning accumulates excessive noise over time. To mitigate this issue, we introduce Federated HyperDimensional computing with Privacy-preserving (FedHDPrivacy), an eXplainable Artificial Intelligence (XAI) framework that combines the neuro-symbolic paradigm with DP. FedHDPrivacy carefully manages the balance between privacy and performance by theoretically tracking cumulative noise from previous rounds and adding only the necessary incremental noise to meet privacy requirements. In a real-world case study involving in-process monitoring of manufacturing machining operations, FedHDPrivacy demonstrates robust performance, outperforming standard FL frameworks-including Federated Averaging (FedAvg), Federated Stochastic Gradient Descent (FedSGD), Federated Proximal (FedProx), Federated Normalized Averaging (FedNova), and Federated Adam (FedAdam)-by up to 38%. FedHDPrivacy also shows potential for future enhancements, such as multimodal data fusion.
联邦学习(FL)对于物联网(IoT)环境中的高效数据交换至关重要,因为它在本地训练机器学习(ML)模型,并仅分享模型更新。然而,联邦学习容易受到如模型反转和成员推理攻击等隐私威胁,这些威胁可能暴露敏感的训练数据。为了解决这些问题,通常会应用差分隐私(DP)机制。但是,向黑盒ML模型中添加DP噪声会损害性能,特别是在动态IoT系统中,持续进行终身联邦学习会导致随时间积累过多噪声。为了缓解这一问题,我们引入了带有隐私保护的联邦高维计算框架(FedHDPrivacy),这是一个结合了神经符号范式和差分隐私的可解释人工智能(XAI)框架。FedHDPrivacy通过理论追踪前几轮累积的噪音并仅添加必要的增量噪音来谨慎管理隐私与性能之间的平衡,以满足隐私要求。在一个涉及制造加工操作过程监控的真实案例研究中,FedHDPrivacy展示了强大的性能表现,比标准FL框架——包括联邦平均(FedAvg)、联邦随机梯度下降(FedSGD)、联邦近邻(FedProx)、联邦归一化平均(FedNova)和联邦Adam(FedAdam)——最多提高了38%。FedHDPrivacy还显示出未来增强的潜力,例如多模态数据融合。
https://arxiv.org/abs/2411.01140
End-to-end visual information extraction (VIE) aims at integrating the hierarchical subtasks of VIE, including text spotting, word grouping, and entity labeling, into a unified framework. Dealing with the gaps among the three subtasks plays a pivotal role in designing an effective VIE model. OCR-dependent methods heavily rely on offline OCR engines and inevitably suffer from OCR errors, while OCR-free methods, particularly those employing a black-box model, might produce outputs that lack interpretability or contain hallucinated content. Inspired by CenterNet, DeepSolo, and ESP, we propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task. Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities. Furthermore, we devise corresponding hierarchical pre-training strategies, categorized as image reconstruction, layout learning, and language enhancement, to reinforce the cross-modality representation of the hierarchical encoders. Quantitative experiments on public benchmarks demonstrate that HIP outperforms previous state-of-the-art methods, while qualitative results show its excellent interpretability.
端到端视觉信息抽取(VIE)旨在将VIE的分层子任务,包括文本检测、词组构建和实体标注,整合进一个统一框架中。处理这三个子任务之间的差距在设计有效的VIE模型中起着关键作用。依赖OCR的方法严重依赖离线OCR引擎,并不可避免地会遭受OCR错误的影响,而无OCR的方法,特别是那些使用黑盒模型的方法,则可能会产生缺乏可解释性或包含虚构内容的输出。受到CenterNet、DeepSolo和ESP启发,我们提出了HIP方法,将实体建模为分层点(HIerarchical Points),以更好地符合端到端VIE任务的层级性质。具体而言,这种分层点可以灵活地被编码,并随后解码为目标文本转录、各类区域中心以及实体类别。此外,我们设计了相应的分层预训练策略,分为图像重构、布局学习和语言增强三类,以强化跨模态表示中的层次化编码器表现。在公共基准数据集上的定量实验表明,HIP优于之前最先进的方法,而定性结果则显示其出色的可解释性。
https://arxiv.org/abs/2411.01139
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo , a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
我们展示了使用单一基础模型的中间表示来增强各种音乐下游任务的有效性。我们介绍了SoniDo,这是一种音乐基础模型(MFM),旨在从目标音乐样本中提取层次特征。通过利用这些层次中间特征,SoniDo控制了信息的粒度,从而在包括理解和生成任务在内的各种下游任务上取得了更好的性能。我们在具有代表性的任务上特别评估了这一方法,如音乐标签分类、音乐转录、音乐源分离和音乐混音。我们的结果表明,从基础模型中提取的特征为训练下游任务模型提供了有价值的增强。这突显了使用从音乐基础模型中提取的特征作为下游任务增效器的能力。我们的方法不仅有利于现有的特定任务模型,还支持数据稀缺约束下的音乐下游任务。这为更有效和可访问的音乐处理解决方案铺平了道路。
https://arxiv.org/abs/2411.01135
In real-world applications, as data availability increases, obtaining labeled data for machine learning (ML) projects remains challenging due to the high costs and intensive efforts required for data annotation. Many ML projects, particularly those focused on multi-label classification, also grapple with data imbalance issues, where certain classes may lack sufficient data to train effective classifiers. This study introduces and examines a novel oversampling method for multi-label text classification, designed to address performance challenges associated with data imbalance. The proposed method identifies potential new samples from unlabeled data by leveraging similarity measures between instances. By iteratively searching the unlabeled dataset, the method locates instances similar to those in underrepresented classes and evaluates their contribution to classifier performance enhancement. Instances that demonstrate performance improvement are then added to the labeled dataset. Experimental results indicate that the proposed approach effectively enhances classifier performance post-oversampling.
在实际应用中,随着数据可用性的增加,获取用于机器学习(ML)项目的标注数据仍然具有挑战性,因为这需要高昂的成本和大量的注释工作。许多ML项目,特别是那些专注于多标签分类的项目,还面临着数据不平衡的问题,在这种情况下,某些类别可能缺乏足够的训练有效分类器的数据。本研究介绍并考察了一种用于多标签文本分类的新颖过采样方法,旨在解决与数据不平衡相关的性能挑战。该方法通过利用实例之间的相似性度量从未标注数据中识别潜在的新样本。通过迭代搜索未标注数据集,该方法找到类似于代表性不足类别的实例,并评估它们对分类器性能提升的贡献。然后将那些表现出性能改进的实例添加到已标注的数据集中。实验结果表明,所提出的方法在过采样后能够有效增强分类器的性能。
https://arxiv.org/abs/2411.01013
In today's networked world, Digital Twin Networks (DTNs) are revolutionizing how we understand and optimize physical networks. These networks, also known as 'Digital Twin Networks (DTNs)' or 'Networks Digital Twins (NDTs),' encompass many physical networks, from cellular and wireless to optical and satellite. They leverage computational power and AI capabilities to provide virtual representations, leading to highly refined recommendations for real-world network challenges. Within DTNs, tasks include network performance enhancement, latency optimization, energy efficiency, and more. To achieve these goals, DTNs utilize AI tools such as Machine Learning (ML), Deep Learning (DL), Reinforcement Learning (RL), Federated Learning (FL), and graph-based approaches. However, data quality, scalability, interpretability, and security challenges necessitate strategies prioritizing transparency, fairness, privacy, and accountability. This chapter delves into the world of AI-driven traffic analysis within DTNs. It explores DTNs' development efforts, tasks, AI models, and challenges while offering insights into how AI can enhance these dynamic networks. Through this journey, readers will gain a deeper understanding of the pivotal role AI plays in the ever-evolving landscape of networked systems.
在当今互联互通的世界中,数字孪生网络(DTNs)正在改变我们理解和优化物理网络的方式。这些网络也被称为“数字孪生网络(DTNs)”或“网络数字孪生(NDTs)”,涵盖从蜂窝和无线到光通信和卫星的多种物理网络。它们利用计算能力和人工智能技术提供虚拟表示,从而为现实世界中的网络挑战提供高度精确的建议。在DTNs中,任务包括提升网络性能、优化延迟、提高能源效率等。为了实现这些目标,DTNs运用了诸如机器学习(ML)、深度学习(DL)、强化学习(RL)、联邦学习(FL)和基于图的方法等AI工具。然而,数据质量、可扩展性、可解释性和安全性等问题需要优先考虑透明度、公平性、隐私和责任的策略。本章深入探讨了DTNs中的人工智能驱动流量分析。它探讨了DTNs的发展努力、任务、人工智能模型及所面临的挑战,并提供了关于AI如何增强这些动态网络的见解。通过这一旅程,读者将更深刻地理解AI在不断演进的网络系统领域中的关键作用。
https://arxiv.org/abs/2411.00681