While significant progress has been made on the text-to-SQL task, recent solutions repeatedly encode the same database schema for every question, resulting in unnecessary high inference cost and often overlooking crucial database knowledge. To address these issues, we propose You Only Read Once (YORO), a novel paradigm that directly internalizes database knowledge into the parametric knowledge of a text-to-SQL model during training and eliminates the need for schema encoding during inference. YORO significantly reduces the input token length by 66%-98%. Despite its shorter inputs, our empirical results demonstrate YORO's competitive performances with traditional systems on three benchmarks as well as its significant outperformance on large databases. Furthermore, YORO excels in handling questions with challenging value retrievals such as abbreviation.
尽管在文本到关系数据库任务上已经取得了显著的进展,但最近提出的解决方案在每一个问题上都编码了相同的数据库模式,导致不必要的推理成本过高,并且经常忽视关键数据库知识。为了应对这些问题,我们提出了You Only Read Once(YORO)这一新范式,它将数据库知识直接内化于文本到关系模型在训练过程中的参数化知识中,从而在推理过程中消除了模式编码的需求。YORO显著地将输入词长降低了66%-98%。 尽管YORO的输入词长较短,但我们的实证结果表明,与传统系统相比,YORO在三个基准测试上的竞争性能非常强,同时在大型数据库上的表现也非常显著。此外,YORO在处理具有挑战性的值检索问题(如缩写)时表现出色。
https://arxiv.org/abs/2409.12172
Lesion segmentation in PET/CT imaging is essential for precise tumor characterization, which supports personalized treatment planning and enhances diagnostic precision in oncology. However, accurate manual segmentation of lesions is time-consuming and prone to inter-observer variability. Given the rising demand and clinical use of PET/CT, automated segmentation methods, particularly deep-learning-based approaches, have become increasingly more relevant. The autoPET III Challenge focuses on advancing automated segmentation of tumor lesions in PET/CT images in a multitracer multicenter setting, addressing the clinical need for quantitative, robust, and generalizable solutions. Building on previous challenges, the third iteration of the autoPET challenge introduces a more diverse dataset featuring two different tracers (FDG and PSMA) from two clinical centers. To this extent, we developed a classifier that identifies the tracer of the given PET/CT based on the Maximum Intensity Projection of the PET scan. We trained two individual nnUNet-ensembles for each tracer where anatomical labels are included as a multi-label task to enhance the model's performance. Our final submission achieves cross-validation Dice scores of 76.90% and 61.33% for the publicly available FDG and PSMA datasets, respectively. The code is available at this https URL .
在PET/CT成像中,病变分割对于精确肿瘤特征描述至关重要,这支持个性化治疗规划和提高癌症诊断精度。然而,准确的手动分割病变是耗时的且容易受到操作者变异性影响。随着PET/CT临床应用的需求和需求的增加,自动分割方法,特别是基于深度学习的 approaches,变得越来越相关。 自动PET III挑战专注于在多发射器多中心环境下推进PET/CT图像肿瘤病变的自动分割,解决临床需要高质量的、可靠的和可扩展的解决方案。在以前挑战的基础上,第三个自动PET挑战引入了一个包含两个不同示踪剂(FDG和PSMA)的更 diverse数据集。为此,我们开发了一个分类器,根据PET扫描的最大强度投影来确定给定PET/CT的示踪剂。我们为每个示踪剂训练两个单阶段nnUNet集成,将解剖标签作为一个多标签任务来提高模型的性能。 我们的最终提交在公开可用的FDG和PSMA数据集上的交叉验证Dice分数分别为76.90%和61.33%。代码可在此处访问:https://this URL.
https://arxiv.org/abs/2409.12155
We introduce MoRAG, a novel multi-part fusion based retrieval-augmented generation strategy for text-based human motion generation. The method enhances motion diffusion models by leveraging additional knowledge obtained through an improved motion retrieval process. By effectively prompting large language models (LLMs), we address spelling errors and rephrasing issues in motion retrieval. Our approach utilizes a multi-part retrieval strategy to improve the generalizability of motion retrieval across the language space. We create diverse samples through the spatial composition of the retrieved motions. Furthermore, by utilizing low-level, part-specific motion information, we can construct motion samples for unseen text descriptions. Our experiments demonstrate that our framework can serve as a plug-and-play module, improving the performance of motion diffusion models. Code, pretrained models and sample videos will be made available at: this https URL
我们介绍了一种名为MoRAG的多部分融合基于检索增强生成策略,用于基于文本的人体运动生成。该方法通过改进运动检索过程来获得额外的知识,从而增强运动扩散模型。通过有效地提示大型语言模型(LLMs),我们解决了运动检索中的拼写错误和重新表述问题。我们的方法利用多部分检索策略来提高运动检索在语言空间中的泛化能力。我们通过检索运动中空间组成的多样性来创建各种样本。此外,通过利用低级别的、部分特定的运动信息,我们可以为未知的文本描述构建运动样本。我们的实验证明,我们的框架可以作为一个插件式模块,提高运动扩散模型的性能。代码、预训练模型和样本视频将在此处公布:https://这个链接。
https://arxiv.org/abs/2409.12140
We propose a new benchmark to measure a language model's linguistic reasoning skills without relying on pre-existing language-specific knowledge. The test covers 894 questions grouped in 160 problems across 75 (mostly) extremely low-resource languages, extracted from the International Linguistic Olympiad corpus. To attain high accuracy on this benchmark, models don't need previous knowledge of the tested language, as all the information needed to solve the linguistic puzzle is presented in the context. We find that, while all analyzed models rank below 25% accuracy, there is a significant gap between open and closed models, with the best-performing proprietary model at 24.05% and the best-performing open model at 8.84%.
我们提出了一个新的基准来衡量语言模型的语言推理能力,而不依赖于先前的语言特定知识。测试包括75种主要来自国际语言学奥林匹克竞赛语料库的低资源语言中的894个问题。要在这个基准上获得高准确度,模型不需要先前的语言知识,因为解决语言谜题所需的所有信息都在上下文中提供。我们发现,虽然所有分析模型都排名低于25%,但开箱即用的模型和封闭模型之间存在显著的差距,开箱即用的最佳模型在24.05%的准确度,而最佳开放模型在8.84%的准确度。
https://arxiv.org/abs/2409.12126
With the ever-growing complexity of models in the field of remote sensing (RS), there is an increasing demand for solutions that balance model accuracy with computational efficiency. Knowledge distillation (KD) has emerged as a powerful tool to meet this need, enabling the transfer of knowledge from large, complex models to smaller, more efficient ones without significant loss in performance. This review article provides an extensive examination of KD and its innovative applications in RS. KD, a technique developed to transfer knowledge from a complex, often cumbersome model (teacher) to a more compact and efficient model (student), has seen significant evolution and application across various domains. Initially, we introduce the fundamental concepts and historical progression of KD methods. The advantages of employing KD are highlighted, particularly in terms of model compression, enhanced computational efficiency, and improved performance, which are pivotal for practical deployments in RS scenarios. The article provides a comprehensive taxonomy of KD techniques, where each category is critically analyzed to demonstrate the breadth and depth of the alternative options, and illustrates specific case studies that showcase the practical implementation of KD methods in RS tasks, such as instance segmentation and object detection. Further, the review discusses the challenges and limitations of KD in RS, including practical constraints and prospective future directions, providing a comprehensive overview for researchers and practitioners in the field of RS. Through this organization, the paper not only elucidates the current state of research in KD but also sets the stage for future research opportunities, thereby contributing significantly to both academic research and real-world applications.
随着遥感领域(RS)模型日益复杂,人们越来越需要兼顾模型准确性与计算效率的解决方案。知识蒸馏(KD)作为一种强大的工具,满足了这个需求,使大型、复杂模型的知识可以转移到更小、更高效的模型,而不会牺牲太多性能。本文回顾了KD及其在RS领域的创新应用。 首先,我们介绍了KD方法的基本概念和历史演变。KD的优点,特别是在模型压缩、增强计算效率和提高性能方面,得到了突出强调,这些优势对RS场景的实用部署至关重要。 文章全面梳理了KD技术,对每个类别进行了深入分析,以展示其广度和深度,并举例展示了KD方法在RS任务(如实例分割和目标检测)的实际应用。 此外,本文讨论了KD在RS中的挑战和局限性,包括实际限制和未来的研究方向,为RS领域的研究人员和实践者提供了全面概述。 通过这种组织,本文不仅阐明了KD在RS领域的现状,还为未来的研究机会奠定了基础,从而对学术研究和实际应用都做出了重要贡献。
https://arxiv.org/abs/2409.12111
Understanding how humans process visual information is one of the crucial steps for unraveling the underlying mechanism of brain activity. Recently, this curiosity has motivated the fMRI-to-image reconstruction task; given the fMRI data from visual stimuli, it aims to reconstruct the corresponding visual stimuli. Surprisingly, leveraging powerful generative models such as the Latent Diffusion Model (LDM) has shown promising results in reconstructing complex visual stimuli such as high-resolution natural images from vision datasets. Despite the impressive structural fidelity of these reconstructions, they often lack details of small objects, ambiguous shapes, and semantic nuances. Consequently, the incorporation of additional semantic knowledge, beyond mere visuals, becomes imperative. In light of this, we exploit how modern LDMs effectively incorporate multi-modal guidance (text guidance, visual guidance, and image layout) for structurally and semantically plausible image generations. Specifically, inspired by the two-streams hypothesis suggesting that perceptual and semantic information are processed in different brain regions, our framework, Brain-Streams, maps fMRI signals from these brain regions to appropriate embeddings. That is, by extracting textual guidance from semantic information regions and visual guidance from perceptual information regions, Brain-Streams provides accurate multi-modal guidance to LDMs. We validate the reconstruction ability of Brain-Streams both quantitatively and qualitatively on a real fMRI dataset comprising natural image stimuli and fMRI data.
理解人类如何处理视觉信息是揭开大脑活动背后的关键步骤之一。最近,这种好奇心激励了fMRI到图像重构任务;它旨在根据视觉刺激重建相应的视觉刺激。令人惊讶的是,利用像Latent Diffusion Model(LDM)这样的强大生成模型在重构视觉数据集中的复杂视觉刺激取得了很好的效果。尽管这些重构具有令人印象深刻的结构保真度,但它们通常缺乏小物件的细节、模糊的形状和语义细微差别。因此,引入额外的语义知识(不仅仅是视觉信息)变得至关重要。 鉴于这一点,我们研究了现代LDMs如何有效结合多模态指导(文本指导、视觉指导和图像布局)进行结构化和语义性图像生成。具体来说,受到两个流假设的启发,即感知和语义信息在不同的脑区处理,我们的Brain-Streams框架将fMRI信号从这些脑区映射到适当的嵌入。这意味着通过从语义信息区域提取文本指导,从感知信息区域提取视觉指导,Brain-Streams为LDMs提供准确的多模态指导。 我们在包括自然图像刺激和fMRI数据的实际fMRI数据集上验证了Brain-Streams的重建能力。我们既以数量方式验证了它的重建效果,也以质量方式验证了它的重建效果。
https://arxiv.org/abs/2409.12099
This paper presents a general refractive camera model and online co-estimation of odometry and the refractive index of unknown media. This enables operation in diverse and varying refractive fluids, given only the camera calibration in air. The refractive index is estimated online as a state variable of a monocular visual-inertial odometry framework in an iterative formulation using the proposed camera model. The method was verified on data collected using an underwater robot traversing inside a pool. The evaluations demonstrate convergence to the ideal refractive index for water despite significant perturbations in the initialization. Simultaneously, the approach enables on-par visual-inertial odometry performance in refractive media without prior knowledge of the refractive index or requirement of medium-specific camera calibration.
本文提出了一种通用的反射相机模型以及基于在线协同估计未知介质中的相位差和折射率的方法。这使得仅基于空气中的相机校准可以在各种不同的折射流体中操作。通过使用所提出的相机模型,将相位差估计为单目视觉惯性导航框架中的状态变量,实现基于在线迭代公式。该方法在用潜水机器人穿越池内的数据上进行了验证。评估结果表明,尽管在初始化过程中存在很大的扰动,但该方法在理想折射指数上收敛。同时,该方法可以在无需事先知道折射率或对介质特定相机校准的情况下实现等视觉惯性导航性能。
https://arxiv.org/abs/2409.12074
Safety is one of the key issues preventing the deployment of reinforcement learning techniques in real-world robots. While most approaches in the Safe Reinforcement Learning area do not require prior knowledge of constraints and robot kinematics and rely solely on data, it is often difficult to deploy them in complex real-world settings. Instead, model-based approaches that incorporate prior knowledge of the constraints and dynamics into the learning framework have proven capable of deploying the learning algorithm directly on the real robot. Unfortunately, while an approximated model of the robot dynamics is often available, the safety constraints are task-specific and hard to obtain: they may be too complicated to encode analytically, too expensive to compute, or it may be difficult to envision a priori the long-term safety requirements. In this paper, we bridge this gap by extending the safe exploration method, ATACOM, with learnable constraints, with a particular focus on ensuring long-term safety and handling of uncertainty. Our approach is competitive or superior to state-of-the-art methods in final performance while maintaining safer behavior during training.
安全是阻止将强化学习技术应用于现实机器人中的关键问题之一。虽然安全强化学习领域的大多数方法都没有要求先验知识约束和机器人运动学,并且仅依赖数据,但通常很难将它们应用于复杂的现实环境。相反,基于模型的方法已经证明,将约束和动态知识引入学习框架可以使学习算法直接部署到现实机器人上。然而,不幸的是,虽然通常可以获得机器人动态的近似模型,但安全约束是任务特定的,很难获得:它们可能过于复杂,无法通过分析来编码,或者难以在训练过程中明确地想象远期安全性需求。在本文中,我们通过扩展安全探索方法ATACOM,引入可学习约束,特别关注确保长期安全和处理不确定性。我们的方法在最终表现上与最先进的Methods相当或更优,同时保持训练过程中的安全性行为。
https://arxiv.org/abs/2409.12045
In the rapidly evolving field of machine learning, training models with datasets from various locations and organizations presents significant challenges due to privacy and legal concerns. The exploration of effective collaborative training settings capable of leveraging valuable knowledge from distributed and isolated datasets is increasingly crucial. This study investigates key factors that impact the effectiveness of collaborative training methods in code next-token prediction, as well as the correctness and utility of the generated code, demonstrating the promise of such methods. Additionally, we evaluate the memorization of different participant training data across various collaborative training settings, including centralized, federated, and incremental training, highlighting their potential risks in leaking data. Our findings indicate that the size and diversity of code datasets are pivotal factors influencing the success of collaboratively trained code models. We show that federated learning achieves competitive performance compared to centralized training while offering better data protection, as evidenced by lower memorization ratios in the generated code. However, federated learning can still produce verbatim code snippets from hidden training data, potentially violating privacy or copyright. Our study further explores effectiveness and memorization patterns in incremental learning, emphasizing the sequence in which individual participant datasets are introduced. We also identify cross-organizational clones as a prevalent challenge in both centralized and federated learning scenarios. Our findings highlight the persistent risk of data leakage during inference, even when training data remains unseen. We conclude with recommendations for practitioners and researchers to optimize multisource datasets, propelling cross-organizational collaboration forward.
在快速发展的机器学习领域,使用各种地点和组织的数据集训练模型存在显着的安全和隐私问题。探索能够利用分布式和孤立数据集的有效合作训练设置越来越重要。本研究调查了影响协作训练方法在代码预测下一个词的有效性的关键因素,以及生成的代码的正确性和可用性,证明了这些方法的优势。此外,我们评估了各种协作训练设置中不同参与者训练数据的学习记忆,包括集中式、分布式和增量式训练,突出泄露数据的风险。我们的研究结果表明,代码数据集的大小和多样性是影响协作训练模型成功的关键因素。我们证明了分布式学习在保持竞争力的性能同时提供更好的数据保护方面比集中式训练更有效,正如生成的代码中较低的存储比所表明的。然而,分布式学习仍然可能从隐藏的训练数据中产生等效的代码片段,这可能导致隐私或版权问题。我们的研究进一步研究了增量学习中的效果和记忆模式,强调了在引入个人参与者数据时序列的重要性。我们也指出了集中式和分布式学习场景中跨组织克隆的普遍挑战。我们的研究结果表明,在推理过程中数据泄露的风险持续存在,即使训练数据未被看到。我们得出结论,对于实践者和研究人员,优化多源数据集将推动跨组织合作向前发展。
https://arxiv.org/abs/2409.12020
As powerful pre-trained vision-language models (VLMs) like CLIP gain prominence, numerous studies have attempted to combine VLMs for downstream tasks. Among these, prompt learning has been validated as an effective method for adapting to new tasks, which only requiring a small number of parameters. However, current prompt learning methods face two challenges: first, a single soft prompt struggles to capture the diverse styles and patterns within a dataset; second, fine-tuning soft prompts is prone to overfitting. To address these challenges, we propose a mixture of soft prompt learning method incorporating a routing module. This module is able to capture a dataset's varied styles and dynamically selects the most suitable prompts for each instance. Additionally, we introduce a novel gating mechanism to ensure the router selects prompts based on their similarity to hard prompt templates, which both retaining knowledge from hard prompts and improving selection accuracy. We also implement semantically grouped text-level supervision, initializing each soft prompt with the token embeddings of manually designed templates from its group and applied a contrastive loss between the resulted text feature and hard prompt encoded text feature. This supervision ensures that the text features derived from soft prompts remain close to those from their corresponding hard prompts, preserving initial knowledge and mitigating overfitting. Our method has been validated on 11 datasets, demonstrating evident improvements in few-shot learning, domain generalization, and base-to-new generalization scenarios compared to existing baselines. The code will be available at \url{https://anonymous.4open.science/r/mocoop-6387}
随着像CLIP这样的强大预训练视觉语言模型(VLMs)越来越受到关注,许多研究试图将VLMs用于下游任务。在这些研究中,提示学习被证明是一种有效的适应新任务的方法,只需几个参数。然而,当前的提示学习方法面临两个挑战:首先,单个软提示很难捕捉数据集中多样化的风格和模式;其次,微调软提示很容易过拟合。为了应对这些挑战,我们提出了一个结合软提示学习的方法,包括一个路由模块。这个模块能够捕捉数据集的多样风格,动态地为每个实例选择最合适的提示。此外,我们还引入了一种新颖的门控机制,确保路由器根据提示的相似性选择提示,这保留来自硬提示的知识,并提高了选择精度。我们还在每个软提示上应用了语义分组文本级监督,为每个组分配一个自定义模板的token嵌入,并在结果文本特征和硬提示编码文本特征之间应用对比损失。这种监督确保了从软提示中提取的文本特征接近其对应硬提示,保留初始知识并减轻过拟合。我们的方法在11个数据集上的验证表明,与现有基线相比,在几 shot学习、领域泛化和新基点映射场景中取得了显著的改进。代码将在\url{https://anonymous.4open.science/r/mocoop-6387}中提供。
https://arxiv.org/abs/2409.12011
In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
在这份技术报告中,我们描述了SNTL-NTU团队为2024年语音识别与分类任务(DCASE)提交的任务1:数据高效的低复杂度音频场景分类。我们引入了三种系统来处理不同大小的训练集。对于小训练集,我们通过减少提供的基线模型的基通道复杂度来降低模型的复杂度。我们引入了数据增强的形式为mixup,以增加训练样本的多样性。对于较大的训练集,我们使用FocusNet来向由多个Patchout faST Spectrogram Transformer(PaSST)模型和基于原始采样率44.1 kHz的基准模型组成的集成模型提供混乱的分类信息。我们使用知识蒸馏将集成模型分解为基线学生模型。在TAU urban acoustic scene 2022移动开发数据集上训练系统,在划分(100, 50, 25, 10, 5)%的测试准确率上取得了最高平均值(62.21, 59.82, 56.81, 53.03, 47.97)。
https://arxiv.org/abs/2409.11964
In this paper, we address the challenge of generating realistic 3D human motions for action classes that were never seen during the training phase. Our approach involves decomposing complex actions into simpler movements, specifically those observed during training, by leveraging the knowledge of human motion contained in GPTs models. These simpler movements are then combined into a single, realistic animation using the properties of diffusion models. Our claim is that this decomposition and subsequent recombination of simple movements can synthesize an animation that accurately represents the complex input action. This method operates during the inference phase and can be integrated with any pre-trained diffusion model, enabling the synthesis of motion classes not present in the training data. We evaluate our method by dividing two benchmark human motion datasets into basic and complex actions, and then compare its performance against the state-of-the-art.
在本文中,我们解决了在训练阶段从未见过的动作类中生成逼真的3D人体运动的问题。我们的方法利用GPT模型的知识,通过分解复杂动作为训练期间观察到的更简单的动作,具体这些动作,然后将这些简单的动作合并成一个真实的动画,利用扩散模型的特性。我们的主张是,这种分解和后续简单动作的重新组合可以合成准确地表示复杂输入动作的动画。这种方法在推理阶段运行,可以与任何预训练的扩散模型集成,从而合成训练数据中没有的动量类别。我们通过将两个基准的人体运动数据集分为基本和复杂动作,然后与最先进的水平进行比较,来评估我们的方法。
https://arxiv.org/abs/2409.11920
State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.
状态空间模型(SSMs)通过将状态空间技术集成到深度学习中,引入了一种新颖的上下文建模方法。然而,由于它们的数据无关矩阵,它们在全局上下文建模方面遇到困难。Mamba模型通过S6选择性扫描算法提供了数据相关的变体,通过增强上下文建模,特别是对于长序列,解决了这个问题。然而,基于Mamba的架构在参数数量上很难扩展,这是视觉应用的主要限制。本文解决了大型SSM在图像分类和动作识别方面的可扩展性问题,而不需要使用诸如知识蒸馏等技术。我们分析了Mamba基和注意力基模型,提出了一个Mamba-Attention并行架构,通过增强可扩展性、鲁棒性和性能来解决SSM的问题。我们证明了具有稳定和高效并行架构的Mamba基架构可以解决图像和视频的扩展性问题,并增加了对常见伪影像压缩等伪影的鲁棒性。我们对ImageNet-1K、Kinetics-400和Something-Something-v2基准进行的深入评估表明,通过我们提出的方法,可以提高最先进的Mamba基架构的准确度最高达+1.7。
https://arxiv.org/abs/2409.11867
The recent development of deep learning large models in medicine shows remarkable performance in medical image analysis and diagnosis, but their large number of parameters causes memory and inference latency challenges. Knowledge distillation offers a solution, but the slide-level gradients cannot be backpropagated for student model updates due to high-resolution pathological images and slide-level labels. This study presents an Efficient Fine-tuning on Compressed Models (EFCM) framework with two stages: unsupervised feature distillation and fine-tuning. In the distillation stage, Feature Projection Distillation (FPD) is proposed with a TransScan module for adaptive receptive field adjustment to enhance the knowledge absorption capability of the student model. In the slide-level fine-tuning stage, three strategies (Reuse CLAM, Retrain CLAM, and End2end Train CLAM (ETC)) are compared. Experiments are conducted on 11 downstream datasets related to three large medical models: RETFound for retina, MRM for chest X-ray, and BROW for histopathology. The experimental results demonstrate that the EFCM framework significantly improves accuracy and efficiency in handling slide-level pathological image problems, effectively addressing the challenges of deploying large medical models. Specifically, it achieves a 4.33% increase in ACC and a 5.2% increase in AUC compared to the large model BROW on the TCGA-NSCLC and TCGA-BRCA datasets. The analysis of model inference efficiency highlights the high efficiency of the distillation fine-tuning method.
近年来,在医学领域中,深度学习大型模型的开发在医学图像分析和诊断方面表现出显著的性能,但它们具有大量的参数,导致记忆和推理延迟。知识蒸馏提供了解决方案,但由于高分辨率病理图像和层级的标签,学生模型的更新无法通过级联梯度进行反向传播。这项研究介绍了一种高效的可压缩模型(EFCM)框架,包括两个阶段:无监督特征蒸馏和微调。在蒸馏阶段,提出了使用TransScan模块的Feature Projection Distillation(FPD)策略,以自适应地调整学生模型的感官场以增强知识吸收能力。在微调阶段,比较了三种策略(重用CLAM,重置CLAM和端到端训练CLAM(ETC))。实验在三个大型医学模型相关的11个下游数据集上进行,包括视网膜、胸部X光片和病理学。实验结果表明,EFCM框架在处理层级的病理图像问题方面显著提高了准确性和效率,有效解决了部署大型医疗模型的挑战。具体来说,它比大型模型BROW在TCGA-NSCLC和TCGA-BRCA数据集上实现了4.33%的ACC和5.2%的AUC的提高。对模型推理效率的分析强调了蒸馏微调方法的效率。
https://arxiv.org/abs/2409.11817
The landscape of Deep Learning has experienced a major shift with the pervasive adoption of Transformer-based architectures, particularly in Natural Language Processing (NLP). Novel avenues for physical applications, such as solving Partial Differential Equations and Image Vision, have been explored. However, in challenging domains like robotics, where high non-linearity poses significant challenges, Transformer-based applications are scarce. While Transformers have been used to provide robots with knowledge about high-level tasks, few efforts have been made to perform system identification. This paper proposes a novel methodology to learn a meta-dynamical model of a high-dimensional physical system, such as the Franka robotic arm, using a Transformer-based architecture without prior knowledge of the system's physical parameters. The objective is to predict quantities of interest (end-effector pose and joint positions) given the torque signals for each joint. This prediction can be useful as a component for Deep Model Predictive Control frameworks in robotics. The meta-model establishes the correlation between torques and positions and predicts the output for the complete trajectory. This work provides empirical evidence of the efficacy of the in-context learning paradigm, suggesting future improvements in learning the dynamics of robotic systems without explicit knowledge of physical parameters. Code, videos, and supplementary materials can be found at project website. See this https URL
深度学习的应用领域经历了一次重大转折,特别是自然语言处理(NLP)领域,基于Transformer的架构普遍采用。还探索了一些新颖的物理应用途径,如求解偏微分方程和图像视觉。然而,在具有巨大非线性度的挑战领域(例如机器人领域)中,基于Transformer的应用很少。虽然Transformer已经用于为机器人提供关于高级任务的知識,但很少努力用于系统识别。本文提出了一种新的方法来学习高维物理系统的元动力学模型,例如Franka机器人手臂,使用没有系统物理参数先前知识的情况下基于Transformer的架构。目标是为每个关节的扭矩信号预测感兴趣量(末端执行器姿态和关节位置)。这个预测可以作为机器人领域Deep Model预测控制框架的组件。元模型建立了扭矩和位置之间的关联,预测完整的轨迹输出。这项工作提供了在上下文学习范式下学习机器人系统有效性的实证证据,表明在缺乏明确物理参数的情况下,未来可以改进学习机器人系统的动态。代码,视频和补充材料可以在项目网站上找到。请点击这个链接
https://arxiv.org/abs/2409.11815
This paper investigates the factuality of large language models (LLMs) as knowledge bases in the legal domain, in a realistic usage scenario: we allow for acceptable variations in the answer, and let the model abstain from answering when uncertain. First, we design a dataset of diverse factual questions about case law and legislation. We then use the dataset to evaluate several LLMs under different evaluation methods, including exact, alias, and fuzzy matching. Our results show that the performance improves significantly under the alias and fuzzy matching methods. Further, we explore the impact of abstaining and in-context examples, finding that both strategies enhance precision. Finally, we demonstrate that additional pre-training on legal documents, as seen with SaulLM, further improves factual precision from 63% to 81%.
本文研究了大型语言模型(LLMs)在法律领域作为知识库的可信度,在一个现实的使用场景中:我们允许答案的合理变化,并让模型在不确定时回避回答。首先,我们设计了一个关于案例法和立法事实的多样性事实问题的数据集。然后,我们使用这个数据集来评估几种LLM在不同评估方法下的表现,包括精确匹配、同义匹配和模糊匹配。我们的结果表明,在同义匹配和模糊匹配方法下,性能显著提高。此外,我们探讨了回避和上下文例子的影响,发现两种策略都有助于提高精度。最后,我们证明了如SaulLM所示,在法律文件上的预训练确实可以进一步提高事实精确性,从63%提高到81%。
https://arxiv.org/abs/2409.11798
Face recognition in the wild is now advancing towards light-weight models, fast inference speed and resolution-adapted capability. In this paper, we propose a bridge distillation approach to turn a complex face model pretrained on private high-resolution faces into a light-weight one for low-resolution face recognition. In our approach, such a cross-dataset resolution-adapted knowledge transfer problem is solved via two-step distillation. In the first step, we conduct cross-dataset distillation to transfer the prior knowledge from private high-resolution faces to public high-resolution faces and generate compact and discriminative features. In the second step, the resolution-adapted distillation is conducted to further transfer the prior knowledge to synthetic low-resolution faces via multi-task learning. By learning low-resolution face representations and mimicking the adapted high-resolution knowledge, a light-weight student model can be constructed with high efficiency and promising accuracy in recognizing low-resolution faces. Experimental results show that the student model performs impressively in recognizing low-resolution faces with only 0.21M parameters and 0.057MB memory. Meanwhile, its speed reaches up to 14,705, ~934 and 763 faces per second on GPU, CPU and mobile phone, respectively.
野外的面部识别现在正在朝着轻量级模型、快速推理速度和分辨率适应能力的发展方向前进。在本文中,我们提出了一种桥蒸馏方法,将私有的高分辨率面部预训练模型转化为低分辨率面部识别的轻量级模型。在我们的方法中,通过两步蒸馏解决了跨数据集分辨率适应知识传递问题。第一步,我们进行跨数据集蒸馏,将高分辨率私人面部上的先验知识传递到高分辨率公共面部,并生成紧凑且具有区分性的特征。第二步,通过多任务学习进一步将先验知识传递给合成低分辨率面部。通过学习低分辨率面部表示并模拟适应高分辨率知识,可以构建具有高效率和令人满意的准确性的轻量学生模型来识别低分辨率面部。实验结果表明,仅使用0.21M参数和0.057MB内存的高学生模型在识别低分辨率面部时表现出色。同时,其速度分别在GPU、CPU和移动手机上达到14,705、~934和763张/秒。
https://arxiv.org/abs/2409.11786
Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.
深度跟踪器已经在视觉跟踪方面取得了成功。通常,这些跟踪器使用预训练的深度网络来表示所有不同对象的多个通道特征,这些网络在某些固定层上进行优化。使用的深度网络通常是为了从大规模数据中提取丰富的知识,因此它们能够很好地表示通用对象。然而,这些网络过于复杂,无法表示特定的运动物体,导致泛化差劣以及高计算和内存成本。本文提出了一种名为通道剥离的新颖且通用的框架,以帮助深度跟踪器。为了验证通道剥离的有效性,我们以判别相关滤波器(DCF)和ECO为例。我们证明了整合公式可以将特征压缩、响应图生成和模型更新统一为一个能量最小化问题,以便在飞行中选择有用的特征通道,提高跟踪移动物体的有效性。通道剥离可以准确地提取好的通道,减轻噪音通道的影响,并通常减少通道数量,同时适应不同通道和网络。通过在流行基准上进行广泛的实验评估,我们充分证明了我们框架的有效性和通用性。
https://arxiv.org/abs/2409.11785
The recent success of large language models (LLMs) and the scaling law has led to a widespread adoption of larger models. Particularly in the healthcare industry, there is an increasing demand for locally operated LLMs due to security concerns. However, the majority of high quality open-source LLMs have a size of 70B parameters, imposing significant financial burdens on users for GPU preparation and operation. To overcome these issues, we present a medical adaptation based on the recent 7B models, which enables the operation in low computational resources. We compare the performance on medical question-answering benchmarks in two languages (Japanese and English), demonstrating that its scores reach parity with or surpass those of currently existing medical LLMs that are ten times larger. We find that fine-tuning an English-centric base model on Japanese medical dataset improves the score in both language, supporting the effect of cross-lingual knowledge transfer. We hope that this study will alleviate financial challenges, serving as a stepping stone for clinical institutions to practically utilize LLMs locally. Our evaluation code is available at this https URL.
近年来大型语言模型(LLMs)的成功和扩展定律的普及,导致了许多大型模型的广泛采用。特别是在医疗行业,由于安全问题,对本地运营的LLM的需求不断增加。然而,大多数高质量的开源LLM的参数规模为70B,对用户来说,GPU的准备和运行成本带来了巨大的财务负担。为了克服这些问题,我们基于最近的7B模型提出了一个医疗适应性,使得在低计算资源下进行操作。我们比较了在两种语言(日本语和英语)上的医疗问题回答基准测试的成绩,证明了其分数与现有的医疗LLM相当或者超过了它们。我们发现,在英语-中心模型的基础上对日本医疗数据集进行微调,在两种语言上都有提高,这支持跨语言知识转移的效果。我们希望这项研究能够减轻财务负担,作为临床机构实际利用LLM的一步阶梯。我们的评估代码可以从该链接的https:// URL中获取。
https://arxiv.org/abs/2409.11783
In material physics, characterization techniques are foremost crucial for obtaining the materials data regarding the physical properties as well as structural, electronics, magnetic, optic, dielectric, and spectroscopic characteristics. However, for many materials, ensuring availability and safe accessibility is not always easy and fully warranted. Moreover, the use of modeling and simulation techniques need a lot of theoretical knowledge, in addition of being associated to costly computation time and a great complexity deal. Thus, analyzing materials with different techniques for multiple samples simultaneously, still be very challenging for engineers and researchers. It is worth noting that although of being very risky, X-ray diffraction is the well known and widely used characterization technique which gathers data from structural properties of crystalline 1d, 2d or 3d materials. We propose in this paper, a Smart GRU for Gated Recurrent Unit model to forcast structural characteristics or properties of thin films of tin oxide SnO$_2$(110). Indeed, thin films samples are elaborated and managed experimentally and the collected data dictionary is then used to generate an AI -- Artificial Intelligence -- GRU model for the thin films of tin oxide SnO$_2$(110) structural property characterization.
在材料物理中,表征技术对于获取关于材料物理性质以及结构、电子、磁、光学、电学、和光谱特性的数据至关重要。然而,对于许多材料来说,确保可用性和安全访问并不总是容易的,并且完全值得信赖。此外,使用建模和仿真技术需要大量的理论知识,并且还涉及到昂贵的计算时间和复杂性很高的费用。因此,同时分析多个样品的人工工程师和研究人员仍然会面临很大的挑战。值得注意的是,尽管存在很大的风险,X射线衍射是一种已知且广泛使用的表征技术,可以从晶格结构的1d、2d或3d材料的结构性质中收集数据。我们在这篇论文中提出了一种智能GRU模型,用于预测SnO2(110)薄膜的结构特性或性质。事实上,薄膜样品通过实验方法进行详细处理,然后收集到的数据字典被用于生成AI-人工智能GRU模型,用于对SnO2(110)薄膜的结构特性进行表征。
https://arxiv.org/abs/2409.11782