Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them. In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.
单体大型语言模型(LLMs)如GPT-4已经为现代生成人工智能应用铺平了道路。然而,在规模上训练、服务于和维护单体LLM仍然成本高昂且具有挑战性。现代AI加速器计算与内存比例的不成比例增加导致了一种内存墙,迫切需要新的方法来部署人工智能。专家组合(CoE)是一种模块化方法,降低了训练和服务的成本和复杂性。然而,在采用传统硬件时,这种方法存在两个关键挑战:(1)没有融合操作,较小模型具有较低的操作强度,使得高利用率更加困难;(2)同时托管大量模型可以既高昂又缓慢地动态切换。在本文中,我们描述了如何通过结合CoE、流式数据流和三层内存系统来扩展AI内存墙。我们描述了Samba-CoE,这是一种具有150个专家和万亿总参数的CoE系统。我们将Samba-CoE部署在SambaNova SN40L可重构数据流单元(RDU)上——一种专为企业和推理与训练应用程序而设计的数据流加速器架构。芯片引入了一种新的三层内存系统,包括芯片内分布式SRAM、可包装的HBM和离散的DDR DRAM。专用 inter-RDU 网络使数据流可以通过多个Socket实现扩展和扩展。我们通过在八个RDUSocket上运行各种基准测试,展示了从2x到13x的性能提升。我们还展示了使用CoE推理部署时,8个Socket RDU节点可以将机器占地面积降低19倍,提高模型切换速度15x至31x,并实现整体性能提升3.7x,低于DGX H100和6.6x。
https://arxiv.org/abs/2405.07518
Many robotic systems, such as mobile manipulators or quadrotors, cannot be equipped with high-end GPUs due to space, weight, and power constraints. These constraints prevent these systems from leveraging recent developments in visuomotor policy architectures that require high-end GPUs to achieve fast policy inference. In this paper, we propose Consistency Policy, a faster and similarly powerful alternative to Diffusion Policy for learning visuomotor robot control. By virtue of its fast inference speed, Consistency Policy can enable low latency decision making in resource-constrained robotic setups. A Consistency Policy is distilled from a pretrained Diffusion Policy by enforcing self-consistency along the Diffusion Policy's learned trajectories. We compare Consistency Policy with Diffusion Policy and other related speed-up methods across 6 simulation tasks as well as two real-world tasks where we demonstrate inference on a laptop GPU. For all these tasks, Consistency Policy speeds up inference by an order of magnitude compared to the fastest alternative method and maintains competitive success rates. We also show that the Conistency Policy training procedure is robust to the pretrained Diffusion Policy's quality, a useful result that helps practioners avoid extensive testing of the pretrained model. Key design decisions that enabled this performance are the choice of consistency objective, reduced initial sample variance, and the choice of preset chaining steps. Code and training details will be released publicly.
许多机器人系统(如移动操作器或四旋翼),由于空间、重量和能源限制,无法配备高端GPU。这些限制使得这些系统无法利用需要高端GPU以实现快速策略推理的视觉运动规划架构的最近发展。在本文中,我们提出了Consistency Policy,这是一种比Diffusion Policy更快速且具有相似功率的替代方案,用于学习视觉运动控制。由于其快速的推理速度,Consistency Policy可以在资源受限的机器人设置中实现低延迟的决策。通过在扩散策略的学习轨迹上强制自一致性,将预训练的Diffusion Policy提炼为Consistency Policy。我们在6个模拟任务以及两个真实世界任务上与扩散策略和其他相关加速方法进行比较。对于所有这些任务,与最快替代方法相比,Consistency Policy加速了推理的倍数。我们还证明了Consistency Policy的训练过程对预训练的Diffusion Policy的质量具有鲁棒性,这是一个有益的结果,可以帮助实践者避免对预训练模型的广泛测试。实现这种性能的关键设计决策包括选择一致性目标、降低初始样本方差和选择预置链接步骤。代码和训练细节将公开发布。
https://arxiv.org/abs/2405.07503
The framework of Pearl's Causal Hierarchy (PCH) formalizes three types of reasoning: observational, interventional, and counterfactual, that reflect the progressive sophistication of human thought regarding causation. We investigate the computational complexity aspects of reasoning in this framework focusing mainly on satisfiability problems expressed in probabilistic and causal languages across the PCH. That is, given a system of formulas in the standard probabilistic and causal languages, does there exist a model satisfying the formulas? The resulting complexity changes depending on the level of the hierarchy as well as the operators allowed in the formulas (addition, multiplication, or marginalization). We focus on formulas involving marginalization that are widely used in probabilistic and causal inference, but whose complexity issues are still little explored. Our main contribution are the exact computational complexity results showing that linear languages (allowing addition and marginalization) yield NP^PP-, PSPACE-, and NEXP-complete satisfiability problems, depending on the level of the PCH. Moreover, we prove that the problem for the full language (allowing additionally multiplication) is complete for the class succ$\exists$R for languages on the highest, counterfactual level. Previous work has shown that the satisfiability problem is complete for succ$\exists$R on the lower levels leaving the counterfactual case open. Finally, we consider constrained models that are restricted to a small polynomial size. The constraint on the size reduces the complexity of the interventional and counterfactual languages to NEXP-complete.
珍珠的因果层次结构(PCH)形式化了三种类型的推理:观察、干预和反事实推理,它们反映了人类关于因果关系的渐进复杂性。我们在这个框架下研究了推理的计算复杂性方面,主要关注在PCH中用概率和因果语言表达的满足性问题。具体来说,给定一个标准概率和因果语言的公式系统,是否存在一个模型满足这些公式?复杂性取决于层次结构级别以及公式允许的操作(加法、乘法或边际)。我们关注涉及边际的公式,这些公式在概率和因果推理中应用广泛,但它们的复杂性问题仍然很少被探索。我们的主要贡献是精确的计算复杂性结果,表明线性语言(允许加法和边际)产生NP^PP-、PSpace-和NEXP-完全满足问题,这取决于PCH的层次结构级别。此外,我们还证明了完全语言问题(允许额外乘法)对于最高级别的语言具有完备性。以前的工作表明,在较低层次上,satisfiability问题对于satisfiability问题具有完备性,这使得反事实问题保持开放。最后,我们考虑约束模型,它们仅允许有限数量的多项式大小。对大小的约束降低了干预和反事实语言的复杂性,使其成为NEXP-完全的。
https://arxiv.org/abs/2405.07373
Blackgrass (Alopecurus myosuroides) is a competitive weed that has wide-ranging impacts on food security by reducing crop yields and increasing cultivation costs. In addition to the financial burden on agriculture, the application of herbicides as a preventive to blackgrass can negatively affect access to clean water and sanitation. The WeedScout project introduces a Real-Rime Autonomous Black-Grass Classification and Mapping (RT-ABGCM), a cutting-edge solution tailored for real-time detection of blackgrass, for precision weed management practices. Leveraging Artificial Intelligence (AI) algorithms, the system processes live image feeds, infers blackgrass density, and covers two stages of maturation. The research investigates the deployment of You Only Look Once (YOLO) models, specifically the streamlined YOLOv8 and YOLO-NAS, accelerated at the edge with the NVIDIA Jetson Nano (NJN). By optimising inference speed and model performance, the project advances the integration of AI into agricultural practices, offering potential solutions to challenges such as herbicide resistance and environmental impact. Additionally, two datasets and model weights are made available to the research community, facilitating further advancements in weed detection and precision farming technologies.
黑grass(Alopecurus myosuroides)是一种竞争性杂草,它通过减少产量和增加种植成本对粮食安全产生广泛影响。除了对农业的经济负担外,使用除草剂作为预防黑grass的应用可能会影响清洁水和卫生的访问。WeedScout项目引入了实时黑grass分类和映射(RT-ABGCM),这是一种专为实时检测黑grass而设计的先进解决方案,用于精确杂草管理。通过利用人工智能(AI)算法,该系统处理实时图像数据,推断黑grass密度,并覆盖两个生长阶段。研究探讨了部署You Only Look Once(YOLO)模型的应用,特别是经过优化的YOLOv8和YOLO-NAS,在NVIDIA Jetson Nano(NJN)边缘加速。通过优化推理速度和模型性能,该项目将人工智能融入农业实践,为应对除草剂耐药性和环境问题提供潜在解决方案。此外,为研究社区提供了两个数据集和模型权重,促进进一步杂草检测和精确农业技术的发展。
https://arxiv.org/abs/2405.07349
This paper explores the seamless integration of Generative AI (GenAI) and Evolutionary Algorithms (EAs) within the domain of large-scale multi-objective optimization. Focusing on the transformative role of Large Language Models (LLMs), our study investigates the potential of LLM-Assisted Inference to automate and enhance decision-making processes. Specifically, we highlight its effectiveness in illuminating key decision variables in evolutionarily optimized solutions while articulating contextual trade-offs. Tailored to address the challenges inherent in inferring complex multi-objective optimization solutions at scale, our approach emphasizes the adaptive nature of LLMs, allowing them to provide nuanced explanations and align their language with diverse stakeholder expertise levels and domain preferences. Empirical studies underscore the practical applicability and impact of LLM-Assisted Inference in real-world decision-making scenarios.
本文探讨了在大型多目标优化领域中,将生成式人工智能(GenAI)和进化算法(EAs)无缝集成的重要性。重点关注大型语言模型(LLMs)的变革性作用,我们的研究调查了LLM辅助推理在自动化和增强决策过程方面的潜在能力。具体来说,我们强调了LLM在阐明进化优化解决方案中关键决策变量的有效性,同时揭示了语境边际效应。针对在规模上推断复杂多目标优化解决方案所固有的挑战,我们的方法突出了LLM的适应性,使它们能够提供细微的解释,并使它们的语言与不同利益相关者的专业知识和领域偏好保持一致。实证研究证实了LLM辅助推理在现实世界决策场景中的实际应用和影响。
https://arxiv.org/abs/2405.07212
Online Test-Time Adaptation (OTTA) has emerged as an effective strategy to handle distributional shifts, allowing on-the-fly adaptation of pre-trained models to new target domains during inference, without the need for source data. We uncovered that the widely studied entropy minimization (EM) method for OTTA, suffers from noisy gradients due to ambiguity near decision boundaries and incorrect low-entropy predictions. To overcome these limitations, this paper introduces a novel cosine alignment optimization approach with a dual-objective loss function that refines the precision of class predictions and adaptability to novel domains. Specifically, our method optimizes the cosine similarity between feature vectors and class weight vectors, enhancing the precision of class predictions and the model's adaptability to novel domains. Our method outperforms state-of-the-art techniques and sets a new benchmark in multiple datasets, including CIFAR-10-C, CIFAR-100-C, ImageNet-C, Office-Home, and DomainNet datasets, demonstrating high accuracy and robustness against diverse corruptions and domain shifts.
在线测试时间自适应(OTTA)作为一种有效的策略来处理分布变化,允许在推理过程中根据需要动态调整预训练模型的新目标领域,而无需原始数据。我们发现,广泛研究的热量最小化(EM)方法在OTTA中由于决策边界附近的不确定性以及错误的低熵预测而存在噪声梯度。为了克服这些局限,本文提出了一种新颖的余弦对齐优化方法,具有一个双目标损失函数,可以提高分类预测的准确性和模型对新颖领域的适应性。具体来说,我们的方法优化了特征向量与分类权重向量之间的余弦相似度,增强了分类预测的准确性和模型对新颖领域的适应性。我们的方法在多个数据集上的表现超过了最先进的技巧,并在包括CIFAR-10-C、CIFAR-100-C、ImageNet-C、Office-Home和DomainNet数据集上,证明了其对各种失真和领域漂移的高准确性和稳健性。
https://arxiv.org/abs/2405.07171
Generative Artificial Intelligence (GAI) is taking the world by storm with its unparalleled content creation ability. Large Language Models (LLMs) are at the forefront of this movement. However, the significant resource demands of LLMs often require cloud hosting, which raises issues regarding privacy, latency, and usage limitations. Although edge intelligence has long been utilized to solve these challenges by enabling real-time AI computation on ubiquitous edge resources close to data sources, most research has focused on traditional AI models and has left a gap in addressing the unique characteristics of LLM inference, such as considerable model size, auto-regressive processes, and self-attention mechanisms. In this paper, we present an edge intelligence optimization problem tailored for LLM inference. Specifically, with the deployment of the batching technique and model quantization on resource-limited edge devices, we formulate an inference model for transformer decoder-based LLMs. Furthermore, our approach aims to maximize the inference throughput via batch scheduling and joint allocation of communication and computation resources, while also considering edge resource constraints and varying user requirements of latency and accuracy. To address this NP-hard problem, we develop an optimal Depth-First Tree-Searching algorithm with online tree-Pruning (DFTSP) that operates within a feasible time complexity. Simulation results indicate that DFTSP surpasses other batching benchmarks in throughput across diverse user settings and quantization techniques, and it reduces time complexity by over 45% compared to the brute-force searching method.
生成式人工智能(GAI)正以无与伦比的内容创作能力席卷世界。大型语言模型(LLMs)是这一运动的主力军。然而,LLMs的显著资源需求往往需要云托管,这引发了关于隐私、延迟和使用限制等问题。尽管边缘智能长期以来已通过在接近数据源的普遍边缘资源上实现实时AI计算来解决这些挑战,但大多数研究都集中在传统AI模型上,而没有解决LLM推理的独特特点,例如巨大的模型大小、自回归过程和自注意力机制。在本文中,我们提出了一个针对LLM推理的边缘智能优化问题。具体来说,通过在资源受限的边缘设备上部署批量技术和模型量化,我们形式化了一种基于Transformer解码器的LLM推理模型。此外,我们的方法旨在通过批量调度和通信和计算资源的无缝分配来最大化推理吞吐量,同时考虑边缘资源限制和用户的延迟和准确度需求。为解决这个NP困难问题,我们开发了一种最优的在线树搜索算法——边树Pruning(DFTSP),其具有可接受的时间复杂度。仿真结果表明,DFTSP在多样用户设置和量化技术的基准测试中超过了其他批注,并且与暴力搜索方法相比,减少了45%的时间复杂度。
https://arxiv.org/abs/2405.07140
Despite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose to take advantage of existing pre-trained large-scale vision and language models to directly generate captions with test time adaptation. Specifically, we bridge video and text using three key models: a general video understanding model XCLIP, a general image understanding model CLIP, and a text generation model GPT-2, due to their source-code availability. The main challenge is how to enable the text generation model to be sufficiently aware of the content in a given video so as to generate corresponding captions. To address this problem, we propose using learnable tokens as a communication medium between frozen GPT-2 and frozen XCLIP as well as frozen CLIP. Differing from the conventional way to train these tokens with training data, we update these tokens with pseudo-targets of the inference data under several carefully crafted loss functions which enable the tokens to absorb video information catered for GPT-2. This procedure can be done in just a few iterations (we use 16 iterations in the experiments) and does not require ground truth data. Extensive experimental results on three widely used datasets, MSR-VTT, MSVD, and VATEX, show 4% to 20% improvements in terms of the main metric CIDEr compared to the existing state-of-the-art methods.
尽管在完全监督的视频标题中取得了显著的进展,但零样本方法仍然没有被充分利用。在本文中,我们提出了一种利用现有的大规模视觉和语言预训练模型直接生成适应测试时间调整的标题的方法。具体来说,我们通过三个关键模型:一个通用的视频理解模型XCLIP、一个通用的图像理解模型CLIP和一个文本生成模型GPT-2来桥接视频和文本,因为它们具有源代码可用性。主要挑战是如何让文本生成模型对给定视频的内容有足够的意识,从而生成相应的标题。为解决这个问题,我们提出了一种使用可学习标记作为在冻定的GPT-2和冻定的XCLIP以及冻定的CLIP之间的通信媒介的方法。与传统利用训练数据来训练这些标记的方式不同,我们在多个精心设计的损失函数下使用推理数据的伪目标更新这些标记。这个过程可以在几轮迭代后完成(我们在实验中使用了16轮)并且不需要地面真实数据。在三个广泛使用的数据集MSR-VTT、MSVD和VATEX上的实验结果表明,与现有最先进的方法相比,提高幅度在4%到20%之间。
https://arxiv.org/abs/2405.07046
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon.
现有的基于主题的文本-图像生成模型存在让人厌倦的微调步骤,并很难同时保持文本-图像对齐和主题一致性。为了生成组合主题,它经常遇到诸如对象缺失和属性混合等问题,其中输入提示中的某些主题可能没有生成,或者它们的属性被错误地组合在一起。为了应对这些局限性,我们提出了一个主题驱动的生成框架,并在推理时间在生成过程中施加无训练指导。这种方法增强了关注图,使得每个主题都可以精确地绑定属性和进行特征注入。值得注意的是,我们的方法表现出出色的零散生成能力,尤其是在具有挑战性的合成生成任务中。此外,我们提出了一个新的指标GroundingScore来全面评估主题对齐效果。所得到定量结果作为有力的证据展示了我们提出方法的有效性。代码即将发布。
https://arxiv.org/abs/2405.06948
Semi-supervised anomaly detection for guaranteeing the reliability of intelligent systems has received increasing attention. However, existing methods rely too much on data correlation and neglect causality, which can be misleading due to confounding factors and affect system reliability. Additionally, the current reinforcement learning anomaly detection methods can effectively identify known and unknown anomalies in environments with limited labeled samples. Despite its effectiveness, these methods still face several challenges, such as under-utilization of priori knowledge, lack of model flexibility, and insufficient reward feedback when interacting with the environment. To address the above problems, this paper innovatively constructs a counterfactual causal reinforcement learning model, termed Triple-Assisted Causal Reinforcement Learning Anomaly Detector (Tri-CRLAD). The model utilizes the causal inference mechanism to radically improve the performance of semi-supervised models and enhance the model's ability to uncover anomaly data in the face of unknown or rare data. In addition, Tri-CRLAD features a triple decision support mechanism, namely, a sampling strategy based on historical similarity, an adaptive threshold smoothing adjustment strategy, and an adaptive decision reward mechanism. These mechanisms further enhance the flexibility and generalization ability of the model, enabling it to effectively respond to various complex and dynamically changing environments. Finally, Tri-CRLAD matches or exceeds the performance of 9 baseline methods across 7 diverse intelligent system datasets, including satellite systems, medical systems, and health systems. Moreover, anomaly detection stability was significantly improved by up to 23\% with an extremely small number of known anomaly samples. Our code is available at this https URL
半监督异常检测保证智能系统的可靠性已经引起越来越多的关注。然而,现有的方法过于依赖数据相关性,并忽视了因果关系,这可能会因混淆因素而误导,并影响系统的可靠性。此外,当前的强化学习异常检测方法可以有效地在具有有限标注样本的环境中识别已知和未知异常。尽管这些方法的有效性得到了提高,但它们仍然面临几个挑战,例如先验知识的利用率低,缺乏模型灵活性,以及在与环境交互时缺乏奖励反馈。为解决这些问题,本文创新地构建了一种名为Tri-Assisted Causal Reinforcement Learning Anomaly Detector(Tri-CRLAD)的反事实因果强化学习模型。该模型利用因果推理机制大幅提高半监督模型的性能,并增强模型在未知或稀有数据面前发现异常数据的能力。此外,Tri-CRLAD还具有三重决策支持机制,包括基于历史相似的采样策略、自适应阈值平滑调整策略和自适应决策奖励机制。这些机制进一步增强了模型的灵活性和泛化能力,使模型能够有效应对各种复杂和动态变化的场景。最后,Tri-CRLAD在包括卫星系统、医疗系统和健康系统在内的7个不同智能系统数据集上的性能与9个基线方法相匹敌或超过。此外,通过极其少量的已知异常样本,异常检测的稳定性显著提高了23%。我们的代码可在此处访问:https://www.xxxxxx.com
https://arxiv.org/abs/2405.06925
This work examines the reproducibility and benchmarking of state-of-the-art real-time object detection models. As object detection models are often used in real-world contexts, such as robotics, where inference time is paramount, simply measuring models' accuracy is not enough to compare them. We thus compare a large variety of object detection models' accuracy and inference speed on multiple graphics cards. In addition to this large benchmarking attempt, we also reproduce the following models from scratch using PyTorch on the MS COCO 2017 dataset: DETR, RTMDet, ViTDet and YOLOv7. More importantly, we propose a unified training and evaluation pipeline, based on MMDetection's features, to better compare models. Our implementation of DETR and ViTDet could not achieve accuracy or speed performances comparable to what is declared in the original papers. On the other hand, reproduced RTMDet and YOLOv7 could match such performances. Studied papers are also found to be generally lacking for reproducibility purposes. As for MMDetection pretrained models, speed performances are severely reduced with limited computing resources (larger, more accurate models even more so). Moreover, results exhibit a strong trade-off between accuracy and speed, prevailed by anchor-free models - notably RTMDet or YOLOx models. The code used is this paper and all the experiments is available in the repository at this https URL.
本工作探讨了最先进的实时物体检测模型的可重复性和基准测试。 由于物体检测模型通常用于现实世界场景,如机器人,其中推理时间至关重要,仅测量模型的准确性不足以比较它们。 因此,我们在多个显卡上对多种物体检测模型的准确性和推理速度进行了比较。除了这个大规模的基准测试尝试之外,我们还使用PyTorch从头构建了以下物体检测模型:DETR,RTMDet,ViTDet和YOLOv7。重要的是,我们基于MMDetection的功能提出了统一的训练和评估管道,以更好地比较模型。然而,我们的DETR和ViTDet实现的精度或速度性能无法与原始论文相当。另一方面,复制的RTMDet和YOLOv7可以实现与原始论文相当的性能。研究论文也被发现通常缺乏可重复性方面的研究。对于MMDetection预训练模型,速度性能会随着计算资源有限而严重降低(更大、更准确的模型甚至更糟)。此外,结果表明,精度与速度之间存在强烈的权衡,始终倾向于锚定自由模型 - 尤其是RTMDet或YOLOx模型。所使用的代码就是本文,所有实验都可以在上述链接的存储库中找到。
https://arxiv.org/abs/2405.06911
We introduce SAM3D, a new approach to semi-automatic zero-shot segmentation of 3D images building on the existing Segment Anything Model. We achieve fast and accurate segmentations in 3D images with a four-step strategy comprising: volume slicing along non-orthogonal axes, efficient prompting in 3D, slice-wise inference using the pretrained SAM, and recoposition and refinement in 3D. We evaluated SAM3D performance qualitatively on an array of imaging modalities and anatomical structures and quantify performance for specific organs in body CT and tumors in brain MRI. By enabling users to create 3D segmentations of unseen data quickly and with dramatically reduced manual input, these methods have the potential to aid surgical planning and education, diagnostic imaging, and scientific research.
我们介绍了一种新的半自动零 shot分割3D图像的方法,基于现有的Segment Anything模型。通过采用四步策略,我们可以在3D图像中实现快速和准确的分割,包括:沿着非平行轴进行体积切片,3D中的高效提示,使用预训练的SAM进行切片的推理,以及3D中的复位和优化。我们在一系列图像模态和解剖结构上对SAM3D的性能进行了定性评估,并对特定器官的脑部CT和肿瘤的MRI进行了定量评估。通过允许用户快速创建未见数据的3D分割,并显著减少了手动输入,这些方法具有帮助手术规划和教育,诊断影像和科学研究等潜在价值。
https://arxiv.org/abs/2405.06786
Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at this https URL.
线性变换器作为一种子quadratic-时间选择性的替代软最大注意力,因固定大小的循环状态使得推理成本降低而引起了 significant interest。然而,其原始公式存在 poor scaling 和 underperforms compute-matched transformers 的问题。为解决这些问题,一些最近线性模型如 RWKV 和 Mamba 尝试通过提出新颖的时间混合和门控架构来解决这些问题,但预训练大型语言模型需要大量的数据和计算投资。因此,寻找子quadratic 架构的搜索受到可用的计算和高质量预训练数据集的限制。作为预训练线性变换器的成本效益替代方法,我们提出了可扩展的UP训练 for Recurrent Attention (SUPRA)。我们提出了一种用最小的计算成本将现有的大型预训练变换器上训练成循环神经网络(RNN)的方法。这使我们能够利用现有 transformer LLM 的强大预训练数据和性能,同时只需支付训练成本的5%。我们发现,我们的线性化技术在标准基准测试中具有竞争力的性能,但我们还发现了即使是最大的线性模型也存在持续的上下文学习和长上下文建模的不足。我们的代码和模型可以从该链接找到。
https://arxiv.org/abs/2405.06640
Aligning Large Language Models (LLMs) to cater to different human preferences, learning new skills, and unlearning harmful behavior is an important problem. Search-based methods, such as Best-of-N or Monte-Carlo Tree Search, are performant, but impractical for LLM adaptation due to their high inference cost. On the other hand, using Reinforcement Learning (RL) for adaptation is computationally efficient, but performs worse due to the optimization challenges in co-training the value function and the policy. We present a new framework for reward optimization, Value Augmented Sampling (VAS), that can maximize different reward functions using data sampled from only the initial, frozen LLM. VAS solves for the optimal reward-maximizing policy without co-training the policy and the value function, making the optimization stable, outperforming established baselines, such as PPO and DPO, on standard benchmarks, and achieving comparable results to Best-of-128 with lower inference cost. Unlike existing RL methods that require changing the weights of the LLM, VAS does not require access to the weights of the pre-trained LLM. Thus, it can even adapt LLMs (e.g., ChatGPT), which are available only as APIs. In addition, our algorithm unlocks the new capability of composing several rewards and controlling the extent of each one during deployment time, paving the road ahead for the future of aligned, personalized LLMs.
将大型语言模型(LLMs)适配不同的人类偏好、学习和重新学习有害行为是一个重要的问题。基于搜索的方法,如最佳N或蒙特卡洛树搜索,性能出色,但它们的推理成本过高,不适用于LLM的适应。另一方面,使用强化学习(RL)进行适应具有计算效率,但由于在共同训练价值和策略时的优化挑战,表现较差。我们提出了一个新的奖励优化框架,价值增强采样(VAS),可以通过仅从初始、冻结的LLM中采样数据来最大化不同的奖励函数。VAS在无需共同训练策略和价值函数的情况下求解最优奖励最大化策略,使优化稳定,超越了现有的基线,如PPO和DPO,在标准基准上实现类似的结果,同时具有较低的推理成本。与现有的RL方法需要更改LLM的权重不同,VAS不需要访问预训练LLM的权重。因此,它甚至可以适应LLM(例如ChatGPT),这些LLM只能作为API提供。此外,我们的算法还解锁了在部署时间组合多个奖励并控制每个奖励的程度的新能力,为未来基于对齐、个性化的LLM铺平道路。
https://arxiv.org/abs/2405.06639
Emotions guide our decision making process and yet have been little explored in practical ethical decision making scenarios. In this challenge, we explore emotions and how they can influence ethical decision making in a home robot context: which fetch requests should a robot execute, and why or why not? We discuss, in particular, two aspects of emotion: (1) somatic markers: objects to be retrieved are tagged as negative (dangerous, e.g. knives or mind-altering, e.g. medicine with overdose potential), providing a quick heuristic for where to focus attention to avoid the classic Frame Problem of artificial intelligence, (2) emotion inference: users' valence and arousal levels are taken into account in defining how and when a robot should respond to a human's requests, e.g. to carefully consider giving dangerous items to users experiencing intense emotions. Our emotion-based approach builds a foundation for the primary consideration of Safety, and is complemented by policies that support overriding based on Context (e.g. age of user, allergies) and Privacy (e.g. administrator settings). Transparency is another key aspect of our solution. Our solution is defined using behaviour trees, towards an implementable design that can provide reasoning information in real-time.
情感引导着我们的决策过程,但在实际伦理决策场景中,对情感的影响却鲜有关注。在本次挑战中,我们探讨情感及其如何影响家庭机器人环境中的道德决策:机器人应该执行哪些抓取请求,以及原因或者原因不是什么?我们特别讨论了情感的两个方面:(1)身体标记:被标记为负面的物体(例如危险物品,如刀具或具有过量潜在风险的药物),提供了一个快速的方法来集中注意力以避免经典人工智能的框架问题;(2)情感推断:根据用户的情感和唤醒水平,机器应该如何回应人类的请求,例如在用户经历强烈的情绪时,谨慎地考虑给予危险物品给用户。基于情感的 approach 为首要考虑安全奠定了基础,并得到了支持基于上下文(例如用户年龄,过敏)和隐私(例如管理员设置)的政策补充。透明度是解决方案的另一个关键方面。我们的解决方案使用行为树来定义,旨在实现具有实时推理信息的可实现设计。
https://arxiv.org/abs/2405.06543
The technology of autonomous driving is currently attracting a great deal of interest in both research and industry. In this paper, we present a deep learning dual-model solution that uses two deep neural networks for combined braking and steering in autonomous vehicles. Steering control is achieved by applying the NVIDIA's PilotNet model to predict the steering wheel angle, while braking control relies on the use of MobileNet SSD. Both models rely on a single front-facing camera for image input. The MobileNet SSD model is suitable for devices with constrained resources, whereas PilotNet struggles to operate efficiently on smaller devices with limited resources. To make it suitable for such devices, we modified the PilotNet model using our own original network design and reduced the number of model parameters and its memory footprint by approximately 60%. The inference latency has also been reduced, making the model more suitable to operate on resource-constrained devices. The modified PilotNet model achieves similar loss and accuracy compared to the original PilotNet model. When evaluated in a simulated environment, both autonomous driving systems, one using the modified PilotNet model and the other using the original PilotNet model for steering, show similar levels of autonomous driving performance.
目前,自动驾驶技术在研究和工业领域都吸引了大量关注。在本文中,我们提出了一个使用两个深度神经网络的联合制动和转向解决方案,用于实现自动驾驶车辆的联合制动。转向控制是通过将NVIDIA的PilotNet模型应用于预测方向盘角度来实现的,而制动控制则依赖于使用MobileNet SSD。两个模型都依赖于单个前向摄像头进行图像输入。移动Net SSD模型适用于具有有限资源的设备,而PilotNet在资源受限的设备上表现不佳。为了使该设备适合此类设备,我们使用我们自己的原始网络设计对PilotNet模型进行了修改,并将其参数和内存足迹减少了约60%。推理延迟也减少了,使模型更适合于资源受限的设备。经过修改的PilotNet模型与原始PilotNet模型在损失和准确度上实现了相似的结果。当在模拟环境中评估时,使用修改的PilotNet模型的自动驾驶系统和使用原始PilotNet模型的自动驾驶系统都表现出相似的自驾性能水平。
https://arxiv.org/abs/2405.06473
Semantic mapping with Bayesian Kernel Inference (BKI) has shown promise in providing a richer understanding of environments by effectively leveraging local spatial information. However, existing methods face challenges in constructing accurate semantic maps or reliable uncertainty maps in perceptually challenging environments due to unreliable semantic predictions. To address this issue, we propose an evidential semantic mapping framework, which integrates the evidential reasoning of Dempster-Shafer Theory of Evidence (DST) into the entire mapping pipeline by adopting Evidential Deep Learning (EDL) and Dempster's rule of combination. Additionally, the extended belief is devised to incorporate local spatial information based on their uncertainty during the mapping process. Comprehensive experiments across various off-road datasets demonstrate that our framework enhances the reliability of uncertainty maps, consistently outperforming existing methods in scenes with high perceptual uncertainties while showing semantic accuracy comparable to the best-performing semantic mapping techniques.
通过利用局部空间信息有效地利用 Bayesian Kernel Inference (BKI),语义映射在提供对环境的更丰富理解方面表现出良好的前景。然而,由于不可靠的语义预测,现有方法在感知具有挑战性的环境中构建准确语义映射或可靠的不确定性地图存在挑战。为了解决这个问题,我们提出了一个证据语义映射框架,它将 Dempster-Shafer 理论证据(DST)的证据推理融入到整个映射过程中,采用 Evidential Deep Learning(EDL)和 Dempster 的规则组合。此外,扩展信念根据它们在映射过程中的不确定性基于局部空间信息。在各种非道路数据集上进行全面的实验证明,我们的框架提高了不确定地图的可靠性,在具有高感知不确定性的场景中始终优于现有方法,同时具有与最佳语义映射技术相当的精度。
https://arxiv.org/abs/2405.06265
Geometry Problem Solving (GPS), which is a classic and challenging math problem, has attracted much attention in recent years. It requires a solver to comprehensively understand both text and diagram, master essential geometry knowledge, and appropriately apply it in reasoning. However, existing works follow a paradigm of neural machine translation and only focus on enhancing the capability of encoders, which neglects the essential characteristics of human geometry reasoning. In this paper, inspired by dual-process theory, we propose a Dual-Reasoning Geometry Solver (DualGeoSolver) to simulate the dual-reasoning process of humans for GPS. Specifically, we construct two systems in DualGeoSolver, namely Knowledge System and Inference System. Knowledge System controls an implicit reasoning process, which is responsible for providing diagram information and geometry knowledge according to a step-wise reasoning goal generated by Inference System. Inference System conducts an explicit reasoning process, which specifies the goal in each reasoning step and applies the knowledge to generate program tokens for resolving it. The two systems carry out the above process iteratively, which behaves more in line with human cognition. We conduct extensive experiments on two benchmark datasets, GeoQA and GeoQA+. The results demonstrate the superiority of DualGeoSolver in both solving accuracy and robustness from explicitly modeling human reasoning process and knowledge application.
幾何問題求解(GPS)是經典且具有挑戰性的數學問題,近年來引起了大量關注。它需要求解者全面理解文本和圖形,熟練掌握基本幾何知識,並在推理時恰當應用。然而,現有作品遵循神經機器翻譯的范式,只關注編碼器的性能增強,忽視了人體幾何推理的關鍵特點。在本文中,受到對偶過程理論的啟發,我們提出了對偶求解幾何求解器(DualGeoSolver),用于模擬人類對GPS的對偶推理過程。具體來說,我们在DualGeoSolver中構建了兩個系統,即知識系統和推理系統。知識系統控制一個隱式推理過程,該過程負責根據由推理系統生成的逐步推理目標提供圖形信息和幾何知識。推理系統進行一個明確的推理過程,指定每個推理步驟的目標,并應用知識來生成解決問題的程序標記。這兩個系統進行上述過程迭代,這更符合人類認知。我们在GeoQA和GeoQA+兩個標本數據集上進行了廣泛的實驗。結果表明,DualGeoSolver在解決準確性和稳健性方面都具有優勢,這得益于對人體推理過程和知識應用的明確建模。
https://arxiv.org/abs/2405.06232
Achieving gender equality is a pivotal factor in realizing the UN's Global Goals for Sustainable Development. Gender bias studies work towards this and rely on name-based gender inference tools to assign individual gender labels when gender information is unavailable. However, these tools often inaccurately predict gender for Chinese Pinyin names, leading to potential bias in such studies. With the growing participation of Chinese in international activities, this situation is becoming more severe. Specifically, current tools focus on pronunciation (Pinyin) information, neglecting the fact that the latent connections between Pinyin and Chinese characters (Hanzi) behind convey critical information. As a first effort, we formulate the Pinyin name-gender guessing problem and design a Multi-Task Learning Network assisted by Knowledge Distillation that enables the Pinyin embeddings in the model to possess semantic features of Chinese characters and to learn gender information from Chinese character names. Our open-sourced method surpasses commercial name-gender guessing tools by 9.70\% to 20.08\% relatively, and also outperforms the state-of-the-art algorithms.
实现性别平等是实现联合国可持续发展全球目标的的关键因素。性别偏见研究朝着这个方向努力,并依赖于基于名称的性别推断工具来分配个人性别标签,当性别信息不可用时。然而,这些工具通常不准确地预测中国拼音姓名中的性别,导致这些研究中存在偏见。随着中国在国际活动中的参与程度不断增加,这种状况变得更加严重。具体来说,当前的工具集中关注拼音信息,忽视了拼音背后汉字(汉字)之间潜在的传达关键信息。作为第一个尝试,我们提出了拼音姓名性别猜测问题,并设计了一个利用知识蒸馏的多任务学习网络,使模型中的拼音嵌入具有汉字的语义特征,并从汉字姓名中学习性别信息。我们的开源方法超越了商业命名性别猜测工具9.70\%到20.08\%的相对精度,同时也超过了最先进的算法。
https://arxiv.org/abs/2405.06221
Federated Learning (FL) is a decentralized machine learning method that enables participants to collaboratively train a model without sharing their private data. Despite its privacy and scalability benefits, FL is susceptible to backdoor attacks, where adversaries poison the local training data of a subset of clients using a backdoor trigger, aiming to make the aggregated model produce malicious results when the same backdoor condition is met by an inference-time input. Existing backdoor attacks in FL suffer from common deficiencies: fixed trigger patterns and reliance on the assistance of model poisoning. State-of-the-art defenses based on Byzantine-robust aggregation exhibit a good defense performance on these attacks because of the significant divergence between malicious and benign model updates. To effectively conceal malicious model updates among benign ones, we propose DPOT, a backdoor attack strategy in FL that dynamically constructs backdoor objectives by optimizing a backdoor trigger, making backdoor data have minimal effect on model updates. We provide theoretical justifications for DPOT's attacking principle and display experimental results showing that DPOT, via only a data-poisoning attack, effectively undermines state-of-the-art defenses and outperforms existing backdoor attack techniques on various datasets.
联邦学习(FL)是一种去中心化的机器学习方法,它允许参与者在不共享他们的私隐数据的情况下共同训练一个模型。尽管其隐私和可扩展性优势,但FL容易受到后门攻击,其中攻击者通过后门触发器在客户端的局部训练数据上 poison,以在满足相同后门条件时,使聚合模型产生恶意结果。现有的FL后门攻击存在一些共同的缺陷:固定的触发模式和依赖于模型中毒的帮助。基于拜占庭鲁棒聚合的现有防御表现出对这些攻击较好的防御性能,因为恶意和良性模型更新之间的显著差异。为了有效地在良性模型更新之间隐藏恶意模型更新,我们提出了DPOT,一种基于FL的后门攻击策略,通过优化后门触发器动态构建后门目标,使后门数据对模型更新的影响最小化。我们提供了DPOT攻击原则的理论和实验结果,证明了DPOT通过仅使用数据污染攻击有效地削弱了现有防御,并在各种数据集上超过了现有的后门攻击技术。
https://arxiv.org/abs/2405.06206