Self-supervised contrastive learning has emerged as one of the most successful deep learning paradigms. In this regard, it has seen extensive use in image registration and, more recently, in the particular field of medical image registration. In this work, we propose to test and extend and improve a state-of-the-art framework for color fundus image registration, ConKeD. Using the ConKeD framework we test multiple loss functions, adapting them to the framework and the application domain. Furthermore, we evaluate our models using the standarized benchmark dataset FIRE as well as several datasets that have never been used before for color fundus registration, for which we are releasing the pairing data as well as a standardized evaluation approach. Our work demonstrates state-of-the-art performance across all datasets and metrics demonstrating several advantages over current SOTA color fundus registration methods
自监督对比学习已经成为最成功的深度学习范式之一。在这方面,它在图像配准和更近期的医学图像配准领域看到了广泛的应用。在这项工作中,我们提出了一个用于测试和改进最先进的颜色 fundus 图像配准框架ConKeD的框架。使用ConKeD框架我们测试了多个损失函数,并将其适应框架和应用领域。此外,我们还使用标准化基准数据集FIRE以及之前没有用于颜色 fundus 图像配准的数据集来评估我们的模型。我们的工作在所有数据集和指标上都展示了当前最佳性能,并比当前最佳方法具有几个优势。
https://arxiv.org/abs/2404.16773
Diffusion-based technologies have made significant strides, particularly in personalized and customized facialgeneration. However, existing methods face challenges in achieving high-fidelity and detailed identity (ID)consistency, primarily due to insufficient fine-grained control over facial areas and the lack of a comprehensive strategy for ID preservation by fully considering intricate facial details and the overall face. To address these limitations, we introduce ConsistentID, an innovative method crafted for diverseidentity-preserving portrait generation under fine-grained multimodal facial prompts, utilizing only a single reference image. ConsistentID comprises two key components: a multimodal facial prompt generator that combines facial features, corresponding facial descriptions and the overall facial context to enhance precision in facial details, and an ID-preservation network optimized through the facial attention localization strategy, aimed at preserving ID consistency in facial regions. Together, these components significantly enhance the accuracy of ID preservation by introducing fine-grained multimodal ID information from facial regions. To facilitate training of ConsistentID, we present a fine-grained portrait dataset, FGID, with over 500,000 facial images, offering greater diversity and comprehensiveness than existing public facial datasets. % such as LAION-Face, CelebA, FFHQ, and SFHQ. Experimental results substantiate that our ConsistentID achieves exceptional precision and diversity in personalized facial generation, surpassing existing methods in the MyStyle dataset. Furthermore, while ConsistentID introduces more multimodal ID information, it maintains a fast inference speed during generation.
基于扩散的技术取得了显著的进展,特别是在个性化面部生成方面。然而,现有的方法在实现高保真度和详细身份一致性方面面临挑战,主要原因是面部区域细粒度控制不足,以及没有全面考虑面部细节和整个面部以实现ID保留的策略。为了应对这些局限,我们引入了ConsistentID,一种为在细粒度多模态面部提示下生成多样化身份肖像的创新方法,仅使用单个参考图像。ConsistentID包括两个关键组件:一个多模态面部提示生成器,将面部特征、相应的面部描述和整个面部上下文相结合以提高面部细节的精度,和一个通过面部关注局部定位策略优化的ID保留网络,旨在保留面部区域ID的一致性。这些组件一起显著提高了ID保留的准确性,通过引入面部区域的细粒度多模态ID信息。为了方便ConsistentID的训练,我们提出了一个超过50万张面部图片的细粒度肖像数据集FGID,比现有的公共面部数据集(如LAION-Face,CelebA,FFHQ和SFHQ)具有更大的多样性和完整性。实验结果证实,我们的ConsistentID在个性化面部生成方面实现了非凡的精度和多样性,超过了MyStyle数据集中的现有方法。此外,虽然ConsistentID引入了更多的多模态ID信息,但在生成过程中保持了快速的推理速度。
https://arxiv.org/abs/2404.16771
Existing definitions and associated conceptual frameworks for computer-based system safety should be revisited in light of real-world experiences from deploying autonomous vehicles. Current terminology used by industry safety standards emphasizes mitigation of risk from specifically identified hazards, and carries assumptions based on human-supervised vehicle operation. Operation without a human driver dramatically increases the scope of safety concerns, especially due to operation in an open world environment, a requirement to self-enforce operational limits, participation in an ad hoc sociotechnical system of systems, and a requirement to conform to both legal and ethical constraints. Existing standards and terminology only partially address these new challenges. We propose updated definitions for core system safety concepts that encompass these additional considerations as a starting point for evolving safe-ty approaches to address these additional safety challenges. These results might additionally inform framing safety terminology for other autonomous system applications.
现有的计算机系统安全定义和相关概念框架应该在考虑自动驾驶车辆的实际应用经验后进行重新审视。目前由行业安全标准使用的术语侧重于从明确确定的危险中减轻风险,并基于人类监督的车辆操作做出假设。在没有人类驾驶员的情况下,操作范围和安全问题的范围急剧扩大,特别是由于在开放世界环境中的操作,需要自我执行操作限制,参与临时社会技术系统的运作,以及符合法律和道德约束等。现有的标准和术语仅部分解决了这些新挑战。我们提出了涵盖这些额外考虑的核心系统安全概念的更新定义,作为发展安全方法应对这些额外安全挑战的起点。这些结果还可能告知其他自动驾驶系统应用的安全术语的制定。
https://arxiv.org/abs/2404.16768
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.
最初是为连续控制问题而开发的,但Proximal Policy Optimization (PPO)现在已成为各种强化学习(RL)应用(包括对生成模型的微调)的摇钱树。不幸的是,PPO需要多个技巧来实现稳定的收敛(例如值网络,截断) ,并以其对这些组件的具体实现非常敏感而闻名。为了应对这个问题,我们回退一步并问:在生成模型时代,一个简约的RL算法会是什么样子?我们提出了REBEL,一种通过直接对两个完成之间的策略参数化来降低策略优化问题的算法。在理论方面,我们证明了诸如自然策略梯度等基本RL算法可以被视为REBEL的变体,这使我们能够匹配RL文献中关于收敛和样本复杂性的最强已知理论保证。REBEL还可以干净地整合离线数据,并处理我们经常遇到的实际问题中的自偏好。在实证研究中,我们发现REBEL在语言建模和图像生成方面的性能与PPO和DPO相当或更好,同时比PPO更简单地实现,并且具有更快的计算可处理性。
https://arxiv.org/abs/2404.16767
While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely "superficial". We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across eight languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can achieve performance comparable to their SFT counterparts. This method presents a cost-effective alternative to SFT and advances the democratization of multilingual LLMs.
虽然监督微调(SFT)已经是一种将基础大型语言模型(LLM)输出定制到特定偏好的直接方法,但人们对其深度的担忧也随之提出,有些批评认为这仅仅是“表面”。我们将在跨语言生成任务的范围内对这一假设进行深入探讨,并提出一个名为PreTTY的新训练-免费对齐方法,该方法采用最小化任务相关的前缀来连接基础LLM和SFT LLM,实现与训练相同或更好的性能,而无需训练。在八种语言的机器翻译和词性标注实验中,证明了PreTTY在跨语言环境中的有效性。值得注意的是,通过仅使用前几个前缀启动解码过程,基础LLM可以实现与SFT同级的性能。这种方法为SFT提供了一种成本效益高的替代方案,并推动了多语言LLM的民主化。
https://arxiv.org/abs/2404.16766
Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.
提取谁对谁说了什么是对分析当今丰富数据(如在线新闻文章)中的人际交流至关重要的一部分。然而,德国新闻文章中缺乏带注释的数据,这严重限制了可能系统的质量和可用性。为了弥补这一不足,我们为基于WIKINEWS的德语新闻文章中的引用归因提供一个全新、免费、创意共享许可证的 dataset。该dataset提供了1000个文档(250,000个标记)的精心挑选的注释,采用细粒度的注释模式,使各种下游用途成为可能。注释不仅指明谁说了什么,而且还在何种背景下,对谁说了什么,并定义了引用类型。我们指定我们的注释模式,描述了该dataset的创建,并提供了定量分析。此外,我们描述了合适的评估指标,将两个现有的引用归因系统应用于该dataset,讨论它们的结果以评估我们的dataset的可用性,并概述了我们的dataset在下游任务中的应用场景。
https://arxiv.org/abs/2404.16764
Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at this https URL.
我们关注从一张图片上回归3D人体姿势和形状的问题,重点关注3D准确性。目前最佳方法利用大量的3D伪地面真(p-GT)和2D关键点数据集,导致稳健的性能。然而,随着2D准确性的增加,3D姿势准确性的下降是一个悖论。这是由于p-GT和近似相机投影模型的偏差导致的。我们计算了当前相机模型引起的误差,并表明,精确地匹配2D关键点和p-GT确实会导致错误的3D姿势。我们的分析定义了在最小化2D和p-GT损失时会导致无效距离的区间。我们使用这个方法来定义一个新的损失函数:Threshold-Adaptive Loss Scaling(TALS)。这个损失函数惩罚 gross 2D和p-GT损失,但不惩罚更小的损失。有了这样的损失,有很多3D姿势都可以解释2D证据。为了减少这种歧义,我们需要在有效的人体姿势上建立一个先验,但这样的先验可能会引入不必要的偏差。为了解决这个问题,我们利用人体姿势的标记化表示来重新定义问题,并将其转化为标记预测问题。这限制了估计姿势在有效姿势的空间内,有效地提供了均匀的先验。在EMDB和3DPW数据集上进行的大量实验证明,我们重新定义的关键点损失和标记化使我们能够在野外数据上进行训练,同时提高3D准确性超过现有水平。我们的模型和代码可在https://这个链接上进行研究。
https://arxiv.org/abs/2404.16752
This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: this http URL
本文讨论了从文本描述中生成3D带衣服的人的任务。以前的工作通常将人体和衣服编码为一个整体模型,并在一个阶段优化中生成整个模型,这使得他们在衣物编辑方面挣扎,同时失去了对整个生成过程的细粒度控制。为了解决这个问题,我们提出了一个逐层的带衣服的人表示与渐进优化策略相结合的方法,从而在生成过程中实现衣物分离的3D人体模型,并提供了对生成过程的控制能力。基本思路是逐步生成最小带衣服的人体和逐层生成衣服。在服装生成过程中,我们提出了一种新的分层组合渲染方法来融合多层人体模型,并使用新的损失函数帮助解耦服装模型与人体。所提出的方法实现了高质量的分离,从而为3D服装生成提供了一种有效的方法。大量的实验证明,我们的方法在实现最先进的3D带衣服的人生成的同时,还支持虚拟试穿等衣物编辑应用。项目页面:http:// this http URL
https://arxiv.org/abs/2404.16748
Word error rate (WER) is a metric used to evaluate the quality of transcriptions produced by Automatic Speech Recognition (ASR) systems. In many applications, it is of interest to estimate WER given a pair of a speech utterance and a transcript. Previous work on WER estimation focused on building models that are trained with a specific ASR system in mind (referred to as ASR system-dependent). These are also domain-dependent and inflexible in real-world applications. In this paper, a hypothesis generation method for ASR System-Independent WER estimation (SIWE) is proposed. In contrast to prior work, the WER estimators are trained using data that simulates ASR system output. Hypotheses are generated using phonetically similar or linguistically more likely alternative words. In WER estimation experiments, the proposed method reaches a similar performance to ASR system-dependent WER estimators on in-domain data and achieves state-of-the-art performance on out-of-domain data. On the out-of-domain data, the SIWE model outperformed the baseline estimators in root mean square error and Pearson correlation coefficient by relative 17.58% and 18.21%, respectively, on Switchboard and CALLHOME. The performance was further improved when the WER of the training set was close to the WER of the evaluation dataset.
词错误率(WER)是一种用于评估自动语音识别(ASR)系统产生的转录质量的度量。在许多应用中,估计WER根据语音单位和文本的关系进行估计。之前的工作主要集中在构建特定ASR系统的模型(称为ASR系统依赖的模型)。这些模型在领域上是相关的,且在现实世界的应用中不灵活。在本文中,提出了一个用于ASR System-Independent WER估计(SIWE)的假设生成方法。与之前的工作不同,使用模拟ASR系统输出的数据来训练WER估计算法。假设是通过 phonetically similar 或 linguistically more likely 替代词来生成的。在WER估计实验中,所提出的方法在领域内数据上的性能与ASR系统依赖的WER估计算法相当,同时在对外部数据上的性能达到了最先进的水平。在外部数据上,SIWE模型分别比基线估计器在Switchboard和CALLHOME上的相对误差的绝对值和皮尔逊相关系数上提高了17.58%和18.21%。当训练集的WER接近评估数据上的WER时,性能进一步得到改善。
https://arxiv.org/abs/2404.16743
Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain. Several approaches have been suggested in literature for generating cancelable biometric templates. In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed. By employing random walk and other steps given in the proposed two algorithms viz. CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template. The performance of the proposed methods is compared with other state-of-the-art methods. Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis. Furthermore, CBRW performs better on both gray as well as color images.
可取消生物特征是一个具有挑战性的研究领域,其中通过将原始生物特征变换为另一个不可逆的领域来确保原始生物特征的安全。在文献中,已经提出了几种生成可取消生物特征模板的方法。本文提出了两种基于随机漫步(CBRW)的新颖且简单的可取消生物特征模板生成方法。通过使用提出的两种算法 viz. CBRW-BitXOR 和 CBRW-BitCMP,将原始生物特征变换为可取消模板。本文方法的表现与最先进的算法进行了比较。实验在八个公开可用的灰度和彩色数据集上进行,即CP(耳朵)(灰度与彩色),UTIRIS(眼睛)(灰度与彩色),ORL(面部)(灰度),IIT德里亚(眼睛)(灰度与彩色),和AR(眼睛)(彩色)。生成模板的表现用相关系数系数(Cr)、算术均方误差(RMSE)、峰值信号与噪声比(PSNR)、结构相似性(SSIM)、平均绝对误差(MAE)、像素变化率(NPCR)和统一平均变化强度(UACI)进行衡量。通过实验结果,证明了本文方法在质量和数量分析方面优于其他最先进的算法。此外,CBRW在灰度和彩色图像上表现更好。
https://arxiv.org/abs/2404.16739
This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
本文提出了一种名为Neighborhood-based Traveling Salesman Problem (DTSP)的新的学习方法,用于通过给定的任务点快速生成非holonomic车辆通过邻近区域的周游路线。该方法包括两个学习阶段:首先,一种模型无关的强化学习方法利用特权信息从由LinKernighan启发式(LKH)算法生成的专家轨迹中提炼知识。随后,一个监督学习阶段训练一个自适应网络,以独立于特权信息解决问题。在第一个学习阶段之前,还开发了一种使用演示数据进行参数初始化的技术,以提高训练效率。与LKH相比,所提出的学习方法解决方案大约快50倍,并且比其他演示学习方法和RL取得了显著的优越性,大多数这些方法无法感知所有任务点。
https://arxiv.org/abs/2404.16721
Detection of malignant lesions on mammography images is extremely important for early breast cancer diagnosis. In clinical practice, images are acquired from two different angles, and radiologists can fully utilize information from both views, simultaneously locating the same lesion. However, for automatic detection approaches such information fusion remains a challenge. In this paper, we propose a new model called MAMM-Net, which allows the processing of both mammography views simultaneously by sharing information not only on an object level, as seen in existing works, but also on a feature level. MAMM-Net's key component is the Fusion Layer, based on deformable attention and designed to increase detection precision while keeping high recall. Our experiments show superior performance on the public DDSM dataset compared to the previous state-of-the-art model, while introducing new helpful features such as lesion annotation on pixel-level and classification of lesions malignancy.
在乳腺X光片(mammography)图像中检测恶性病变对于早期乳腺癌诊断至关重要。在临床实践中,图像从两个不同的角度获取,放射科医生可以同时利用这两个角度的信息,同时定位相同的病变。然而,对于自动检测方法,例如信息融合,仍然是一个挑战。在本文中,我们提出了一个名为MAMM-Net的新模型,允许在物体级别共享关于对象的更多信息,同时在特征级别共享关于特征的信息。MAMM-Net的关键组件是融合层,基于可塑性注意力和设计,旨在提高检测精度同时保持高召回率。我们的实验结果表明,与之前的最佳模型相比,在公开的DDSM数据集上具有卓越的性能,同时引入了新的有帮助的功能,如在像素级别对病变进行注释和将病变分类为恶性。
https://arxiv.org/abs/2404.16718
Vision-language models enable open-world classification of objects without the need for any retraining. While this zero-shot paradigm marks a significant advance, even today's best models exhibit skewed performance when objects are dissimilar from their typical depiction. Real world objects such as pears appear in a variety of forms -- from diced to whole, on a table or in a bowl -- yet standard VLM classifiers map all instances of a class to a \it{single vector based on the class label}. We argue that to represent this rich diversity within a class, zero-shot classification should move beyond a single vector. We propose a method to encode and account for diversity within a class using inferred attributes, still in the zero-shot setting without retraining. We find our method consistently outperforms standard zero-shot classification over a large suite of datasets encompassing hierarchies, diverse object states, and real-world geographic diversity, as well finer-grained datasets where intra-class diversity may be less prevalent. Importantly, our method is inherently interpretable, offering faithful explanations for each inference to facilitate model debugging and enhance transparency. We also find our method scales efficiently to a large number of attributes to account for diversity -- leading to more accurate predictions for atypical instances. Finally, we characterize a principled trade-off between overall and worst class accuracy, which can be tuned via a hyperparameter of our method. We hope this work spurs further research into the promise of zero-shot classification beyond a single class vector for capturing diversity in the world, and building transparent AI systems without compromising performance.
视觉语言模型使得无需重新训练即可对开放世界中的物体进行分类。虽然这种零样本范式取得了重大进展,但即使是最先进的模型在物体不与其典型描述相当时也会表现出偏斜的性能。现实世界中的苹果呈现出各种形式——从切成薄片到整个,放在桌子上或碗里——然而,标准视觉语言模型将类别的实例映射到基于类别的单个向量上。我们认为,为了在类中表示这种丰富的多样性,零样本分类应超越单一向量。我们提出了一种方法,通过推断属性来编码和解释类中的多样性,在零样本设置中不需要重新训练。我们发现在一系列包括层次结构、多样物体状态和真实世界地理多样性的大数据集上,我们的方法 consistently优于标准零样本分类。此外,我们的方法具有内在可解释性,为每个推理提供准确的解释,从而促进模型调试和提高透明度。我们还发现,我们的方法能够有效地扩展到大量的属性,以考虑多样性,从而使典型实例的预测更准确。最后,我们描述了总体和最差类准确度之间的原则性权衡,该权衡可以通过我们方法的超参数进行调整。我们希望这项工作能够推动关于零样本分类在捕捉世界多样性方面的前景以及在不牺牲性能的情况下构建透明AI系统的研究。
https://arxiv.org/abs/2404.16717
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.
我们提出了LayerSkip,这是一种加速大型语言模型(LLM)推理速度的端到端解决方案。首先,在训练过程中,我们应用层下落,对于较早的层,下落率较低,对于较晚的层,下落率较高,并且有一个早期的退出损失,其中所有Transformer层都共享相同的退出。其次,在推理过程中,我们证明了这种训练方法在较早的层上增加了早期退出模型的准确性,而没有添加任何辅助层或模块到模型中。第三,我们提出了一个新颖的自适应解码解决方案,其中我们在较早的层退出,并使用模型的剩余层来验证和纠正。我们针对LLama模型的大小和不同类型的训练进行了实验:从头预训练,继续预训练,针对特定数据领域的微调,以及针对特定任务的微调。我们实现了我们的推理解决方案,并在综述的CNN/DM文档上展示了速度提高至2.16倍,在编码上提高了1.82倍,在TOPv2语义解析任务上提高了2.0倍。
https://arxiv.org/abs/2404.16710
We propose a novel multi-stage trans-dimensional architecture for multi-view cardiac image segmentation. Our method exploits the relationship between long-axis (2D) and short-axis (3D) magnetic resonance (MR) images to perform a sequential 3D-to-2D-to-3D segmentation, segmenting the long-axis and short-axis images. In the first stage, 3D segmentation is performed using the short-axis image, and the prediction is transformed to the long-axis view and used as a segmentation prior in the next stage. In the second step, the heart region is localized and cropped around the segmentation prior using a Heart Localization and Cropping (HLC) module, focusing the subsequent model on the heart region of the image, where a 2D segmentation is performed. Similarly, we transform the long-axis prediction to the short-axis view, localize and crop the heart region and again perform a 3D segmentation to refine the initial short-axis segmentation. We evaluate our proposed method on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) dataset, where our method outperforms state-of-the-art methods in segmenting cardiac regions of interest in both short-axis and long-axis images. The pre-trained models, source code, and implementation details will be publicly available.
我们提出了一个新颖的多阶段多视角心肌图像分割架构。我们的方法利用长轴(2D)和短轴(3D)磁共振(MR)图像之间的关系进行级联3D-to-2D-to-3D分割,分割长轴和短轴图像。在第一阶段,使用短轴图像进行3D分割,并将预测转换为长轴视图,用作下一阶段的分割先决条件。在第二阶段,使用心定位和裁剪(HLC)模块将心区域定位和裁剪在分割先决条件周围,将后续模型聚焦于图像中的心区域,并进行2D分割。同样,我们将长轴预测转换为短轴视图,将心区域定位和裁剪,并再次进行3D分割,以优化初始的短轴分割。我们在M&M-2数据集上评估我们的方法,该数据集包括多病种、多视角和多中心右心室分割。我们的方法在短轴和长轴图像中分割感兴趣的心脏区域方面均优于最先进的Methods。预训练模型、源代码和实现细节将公开可用。
https://arxiv.org/abs/2404.16708
Navigating mobile robots in social environments remains a challenging task due to the intricacies of human-robot interactions. Most of the motion planners designed for crowded and dynamic environments focus on choosing the best velocity to reach the goal while avoiding collisions, but do not explicitly consider the high-level navigation behavior (avoiding through the left or right side, letting others pass or passing before others, etc.). In this work, we present a novel motion planner that incorporates topology distinct paths representing diverse navigation strategies around humans. The planner selects the topology class that imitates human behavior the best using a deep neural network model trained on real-world human motion data, ensuring socially intelligent and contextually aware navigation. Our system refines the chosen path through an optimization-based local planner in real time, ensuring seamless adherence to desired social behaviors. In this way, we decouple perception and local planning from the decision-making process. We evaluate the prediction accuracy of the network with real-world data. In addition, we assess the navigation capabilities in both simulation and a real-world platform, comparing it with other state-of-the-art planners. We demonstrate that our planner exhibits socially desirable behaviors and shows a smooth and remarkable performance.
在社交环境中导航移动机器人仍然是一个具有挑战性的任务,因为人机交互的复杂性。为了解决这个问题,大多数为拥挤和动态环境设计的运动规划器都集中于选择最佳速度以达到目标,同时避免碰撞,但这些规划器没有明确考虑高级导航行为(避免穿过左侧或右侧,让别人通过或在其前面经过等)。在本文中,我们提出了一个新颖的运动规划器,它包含了代表人类行为多样性导航策略的拓扑学不同的路径。规划器通过基于真实世界人类运动数据训练的深度神经网络模型选择最优秀的拓扑学类,确保社会智能和上下文意识导航。我们的系统通过实时优化基于拓扑的运动规划器来优化所选路径,确保无缝适应期望的社会行为。 在这种程度上,我们解耦了感知和局部规划与决策过程。我们在真实世界中评估网络的预测准确性。此外,我们还评估了该规划器在模拟和真实世界平台上的导航能力,将其与最先进的规划器进行比较。我们证明了我们的规划器表现出社会可接受的行为,表现出平滑和令人印象深刻的表现。
https://arxiv.org/abs/2404.16705
In the rapidly evolving field of artificial intelligence, ensuring safe decision-making of Large Language Models (LLMs) is a significant challenge. This paper introduces Governance of the Commons Simulation (GovSim), a simulation platform designed to study strategic interactions and cooperative decision-making in LLMs. Through this simulation environment, we explore the dynamics of resource sharing among AI agents, highlighting the importance of ethical considerations, strategic planning, and negotiation skills. GovSim is versatile and supports any text-based agent, including LLMs agents. Using the Generative Agent framework, we create a standard agent that facilitates the integration of different LLMs. Our findings reveal that within GovSim, only two out of 15 tested LLMs managed to achieve a sustainable outcome, indicating a significant gap in the ability of models to manage shared resources. Furthermore, we find that by removing the ability of agents to communicate, they overuse the shared resource, highlighting the importance of communication for cooperation. Interestingly, most LLMs lack the ability to make universalized hypotheses, which highlights a significant weakness in their reasoning skills. We open source the full suite of our research results, including the simulation environment, agent prompts, and a comprehensive web interface.
在人工智能领域,确保大型语言模型(LLMs)的安全决策是一个重要的挑战。本文介绍了治理共享资源模拟(GovSim)模拟平台,该平台旨在研究LLMs的战略互动和合作决策。通过这个仿真环境,我们探讨了AI代理之间资源共享的动态,强调了道德考虑、战略规划和谈判技能的重要性。GovSim具有灵活性,支持任何基于文本的代理,包括LLM代理。利用生成代理框架,我们创建了一个标准代理,促进不同LLM的集成。我们的研究结果表明,在GovSim中,只有两个 out of 15 经测试的LLM成功地实现了可持续的结果,表明模型在管理共享资源方面的能力存在显著的差距。此外,我们发现,通过移除代理与进行沟通的能力,它们超出了共享资源的使用,强调了沟通在合作中的重要性。有趣的是,大多数LLM都缺乏普遍化假设的能力,这表明它们在推理能力方面存在显著的弱点。我们开源了我们所有研究的完整套件,包括仿真环境、代理提示和综合网页界面。
https://arxiv.org/abs/2404.16698
This report enlists 13 functional conditions cashed out in computational terms that have been argued to be constituent of conscious valenced experience. These are extracted from existing empirical and theoretical literature on, among others, animal sentience, medical disorders, anaesthetics, philosophy, evolution, neuroscience, and artificial intelligence.
本报告列举了13个在计算条件下被认为是构成意识价值体验的必要功能条件,这些条件来自于关于动物感知、疾病、麻醉、哲学、进化、神经科学和人工智能等领域的现有实证和理论文献。
https://arxiv.org/abs/2404.16696
We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.
我们通过进行四项预注册实验,研究了人类和OpenAl的GPT-4大型语言模型在解决问题行为方面的差异,以探讨添加偏差(addition bias)这一认知趋势。实验涉及来自美国588名参与者和GPT-4模型的680个迭代。问题解决任务可以是创建网格内的对称性(实验1和3)或者编辑摘要(实验2和4)。 根据我们的假设,我们发现总体上存在添加偏差。解决方案效率(实验1和2)和指令的积极性(实验3和4)非常重要。当减法相对更有效时,人类参与者更不可能使用添加策略。GPT-4表现出相反的行为,在减法更有效时具有强烈的添加偏差。 在指令积极性方面,GPT-4在被告知“改进”时更可能添加单词,而人类则没有这种效果。当我们研究添加偏差在不同条件下时,发现GPT-4的回答更加偏见,相对于人类而言。我们的研究结果强调了考虑可比较的和有时更好的减法替代方案的重要性,以及重新评估自己以及特别是语言模型的解决问题的行为。
https://arxiv.org/abs/2404.16692