A unified and versatile LiDAR segmentation model with strong robustness and generalizability is desirable for safe autonomous driving perception. This work presents M3Net, a one-of-a-kind framework for fulfilling multi-task, multi-dataset, multi-modality LiDAR segmentation in a universal manner using just a single set of parameters. To better exploit data volume and diversity, we first combine large-scale driving datasets acquired by different types of sensors from diverse scenes and then conduct alignments in three spaces, namely data, feature, and label spaces, during the training. As a result, M3Net is capable of taming heterogeneous data for training state-of-the-art LiDAR segmentation models. Extensive experiments on twelve LiDAR segmentation datasets verify our effectiveness. Notably, using a shared set of parameters, M3Net achieves 75.1%, 83.1%, and 72.4% mIoU scores, respectively, on the official benchmarks of SemanticKITTI, nuScenes, and Waymo Open.
一个统一且具有强大稳健性和泛化能力的LiDAR分割模型对于安全自动驾驶感知是可取的。本文提出M3Net,一个独一无二的框架,用于通过仅使用一组参数实现多任务、多数据集和多模态LiDAR分割。为了更好地利用数据量和多样性,我们首先将不同场景下获取的大规模驾驶数据集合并,然后进行数据、特征和标签空间的归一化训练。结果,M3Net能够驯服训练中最具竞争力的LiDAR分割模型的异质数据。在十二个LiDAR分割数据集上的广泛实验证实了我们的有效性。值得注意的是,使用共享参数,M3Net在SemanticKITTI、nuScenes和Waymo Open的官方基准测试中都取得了75.1%、83.1%和72.4%的mIoU分数。
https://arxiv.org/abs/2405.01538
Art reinterpretation is the practice of creating a variation of a reference work, making a paired artwork that exhibits a distinct artistic style. We ask if such an image pair can be used to customize a generative model to capture the demonstrated stylistic difference. We propose Pair Customization, a new customization method that learns stylistic difference from a single image pair and then applies the acquired style to the generation process. Unlike existing methods that learn to mimic a single concept from a collection of images, our method captures the stylistic difference between paired images. This allows us to apply a stylistic change without overfitting to the specific image content in the examples. To address this new task, we employ a joint optimization method that explicitly separates the style and content into distinct LoRA weight spaces. We optimize these style and content weights to reproduce the style and content images while encouraging their orthogonality. During inference, we modify the diffusion process via a new style guidance based on our learned weights. Both qualitative and quantitative experiments show that our method can effectively learn style while avoiding overfitting to image content, highlighting the potential of modeling such stylistic differences from a single image pair.
艺术再诠释是对参考作品的一种变体,创作了一对对比艺术作品,展示了独特的艺术风格。我们询问,这样的图像对是否可以用于定制生成模型来捕捉演示的文体差异。我们提出了Pair Customization,一种新的定制方法,它从单个图像对中学习文体差异,然后将获得的风格应用于生成过程。与现有的方法不同,我们的方法从单个图像对中捕获文体差异。这使我们能够在不需要对实例的具体内容进行过拟合的情况下应用文体变化。为了应对这项新任务,我们采用了一种联合优化方法,将风格和内容明确地分离到两个LORA权重空间中。我们优化了这些风格和内容权重,以复制风格和内容图像,同时鼓励它们的正交性。在推理过程中,我们通过基于我们学习到的权重的新的样式指导来修改扩散过程。所有定性和定量实验都表明,我们的方法可以有效地学习风格,同时避免对图像内容的过拟合,突出了从单个图像对中建模这些文体差异的潜在可能性。
https://arxiv.org/abs/2405.01536
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at this https URL.
proprietary language models such as GPT-4 are often used to assess the quality of responses from various language models. However, concerns such as transparency, controllability, and affordability strongly motivate the development of open-source language models specialized in evaluations. On the other hand, existing open evaluator language models have critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, instead focusing on general attributes such as helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator language model than its predecessor that closely mirrors human and GPT-4 judgments. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary language model judges among all tested open evaluator language models. Our models, code, and data are all publicly available at this [https://www.url](http://www.url).
https://arxiv.org/abs/2405.01535
Large Language Models (LLMs) have been shown to be capable of performing high-level planning for long-horizon robotics tasks, yet existing methods require access to a pre-defined skill library (e.g. picking, placing, pulling, pushing, navigating). However, LLM planning does not address how to design or learn those behaviors, which remains challenging particularly in long-horizon settings. Furthermore, for many tasks of interest, the robot needs to be able to adjust its behavior in a fine-grained manner, requiring the agent to be capable of modifying low-level control actions. Can we instead use the internet-scale knowledge from LLMs for high-level policies, guiding reinforcement learning (RL) policies to efficiently solve robotic control tasks online without requiring a pre-determined set of skills? In this paper, we propose Plan-Seq-Learn (PSL): a modular approach that uses motion planning to bridge the gap between abstract language and learned low-level control for solving long-horizon robotics tasks from scratch. We demonstrate that PSL achieves state-of-the-art results on over 25 challenging robotics tasks with up to 10 stages. PSL solves long-horizon tasks from raw visual input spanning four benchmarks at success rates of over 85%, out-performing language-based, classical, and end-to-end approaches. Video results and code at this https URL
大语言模型(LLMs)已经被证明在长时间的机器人任务中具有执行高级计划的能力。然而,现有的方法需要访问预定义的技能库(例如抓取、放置、拖动、推开、导航)。然而,LLM计划并没有解决如何设计或学习这些行为,这使得在长时间设置中解决这个问题变得更加具有挑战性。此外,对于许多感兴趣的任务,机器人需要能够以细粒度的方式调整其行为,要求代理具备修改低级控制动作的能力。我们可以 instead 使用LLM在高级策略上进行知识表示,指导强化学习(RL)策略有效地解决机器人控制任务,而无需预先确定一组技能?在本文中,我们提出了Plan-Seq-Learn(PSL):一种模块化方法,使用运动规划来桥接抽象语言和学习的低级控制,以从零开始解决长时间的机器人任务。我们证明了PSL在超过25个具有挑战性的机器人任务上取得了最先进的成果,其中包括10个阶段。PSL通过从成功的视觉输入中解决长期机器人任务,其成功率超过85%,超过了基于语言的传统方法、基于任务的端到端方法和基于知识的方法。视频结果和代码在此处:https://url
https://arxiv.org/abs/2405.01534
The advances in multimodal large language models (MLLMs) have led to growing interests in LLM-based autonomous driving agents to leverage their strong reasoning capabilities. However, capitalizing on MLLMs' strong reasoning capabilities for improved planning behavior is challenging since planning requires full 3D situational awareness beyond 2D reasoning. To address this challenge, our work proposes a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D before feeding them into an LLM. This query-based representation allows us to jointly encode dynamic objects and static map elements (e.g., traffic lanes), providing a condensed world model for perception-action alignment in 3D. We further propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model with comprehensive visual question-answering (VQA) tasks, including scene description, traffic regulation, 3D grounding, counterfactual reasoning, decision making and planning. Extensive studies show the effectiveness of the proposed architecture as well as the importance of the VQA tasks for reasoning and planning in complex 3D scenes.
多模态大型语言模型(MLLMs)的进步导致了对基于LLM的自动驾驶代理的浓厚兴趣,以利用其强大的推理能力。然而,利用MLLMs的强大的推理能力进行改进的规划行为具有挑战性,因为规划需要超过2D推理的全面3D情景意识。为解决这个问题,我们的工作提出了一个整体框架,实现代理模型与3D驾驶任务的强一致性。我们的框架从采用稀疏查询的全新3D MLLM架构开始,该架构在将视觉表示压缩成3D后输入LLM之前利用稀疏查询。这种基于查询的表示允许我们共同编码动态物体和静态地图元素(例如,交通车道),为3D感知-动作对齐提供了一个压缩的世界模型。我们还提出了OmniDrive-nuScenes,一个新的视觉问题回答数据集,挑战了具有全面视觉问题回答(VQA)任务的模型的真正3D情景意识,包括场景描述、交通规则、3D建模、反事实推理、决策和规划。大量研究证明了所建议的架构的有效性以及VQA任务对复杂3D场景中的推理和规划的重要性。
https://arxiv.org/abs/2405.01533
Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions. Crucially, the CBM design inherently allows for human interventions, in which expert users are given the ability to modify potentially misaligned concept choices to influence the decision behavior of the model in an interpretable fashion. However, existing approaches often require numerous human interventions per image to achieve strong performances, posing practical challenges in scenarios where obtaining human feedback is expensive. In this paper, we find that this is noticeably driven by an independent treatment of concepts during intervention, wherein a change of one concept does not influence the use of other ones in the model's final decision. To address this issue, we introduce a trainable concept intervention realignment module, which leverages concept relations to realign concept assignments post-intervention. Across standard, real-world benchmarks, we find that concept realignment can significantly improve intervention efficacy; significantly reducing the number of interventions needed to reach a target classification performance or concept prediction accuracy. In addition, it easily integrates into existing concept-based architectures without requiring changes to the models themselves. This reduced cost of human-model collaboration is crucial to enhancing the feasibility of CBMs in resource-constrained environments.
概念瓶颈模型(CBMs)将人类可理解的观念作为 ground image 分类,以实现模型决策的可解释性。关键是,CBM 设计本身允许专家用户进行干预,其中专家用户被赋予了修改可能存在错配的概念选择以影响模型决策行为的解释性方式。然而,现有的方法通常需要每个图像进行大量的人工干预才能实现强大的性能,这在一个获得人类反馈代价高昂的场景中提出了实际挑战。在本文中,我们发现这明显是由干预过程中的独立处理概念导致的,其中改变一个概念不会影响模型最终决策的使用其他概念。为了应对这个问题,我们引入了一个可训练的概念干预对齐模块,它利用概念关系在干预后对概念分配进行对齐。在标准、真实世界的基准测试中,我们发现概念对齐可以显著提高干预效果;大大减少了达到目标分类性能或概念预测精度所需的干预数量。此外,它很容易与现有的基于概念的架构集成,而无需对模型本身进行更改。这种降低人机协同成本对增强在资源受限环境中的 CBM 的可行性至关重要。
https://arxiv.org/abs/2405.01531
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation: interacting with unseen objects in novel scenes without test-time adaptation. While typical approaches rely on a large amount of demonstration data for such generalization, we propose an approach that leverages web videos to predict plausible interaction plans and learns a task-agnostic transformation to obtain robot actions in the real world. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal, and can be trained with diverse videos on the web including those of humans and robots manipulating everyday objects. We use these 2D track predictions to infer a sequence of rigid transforms of the object to be manipulated, and obtain robot end-effector poses that can be executed in an open-loop manner. We then refine this open-loop plan by predicting residual actions through a closed loop policy trained with a few embodiment-specific demonstrations. We show that this approach of combining scalably learned track prediction with a residual policy requiring minimal in-domain robot-specific data enables zero-shot robot manipulation, and present a wide array of real-world robot manipulation results across unseen tasks, objects, and scenes. this https URL
我们寻求学习一个通用的目标条件策略,实现零 shot 机器人操作:在未见过的场景中与新的对象进行交互,无需测试时间适应。虽然典型的方法依赖于大量演示数据进行这种泛化,我们提出了一种利用网络视频预测合理交互计划的方法,并学会了在现实世界中获得机器人动作的任务无关转换。我们的框架 Track2Act 预测图像中点在未来的时间步应该如何移动,可以训练包括人类和机器人操作日常物品的多样性视频。我们使用这些 2D 轨迹预测来推断要操作的物体的序列刚变换,并获得可以在开环方式下执行的机器人末端执行器姿态。然后通过通过训练了几次特定于身体特定示范的闭环策略来预测残余动作,从而优化这个开环计划。我们证明了这种结合可扩展学习轨迹预测与需要最小域内机器人特定数据的有残策略的方法实现零 shot 机器人操作,并在未见过的任务、物体和场景中展示了广泛的实时机器人操作结果。这个链接:
https://arxiv.org/abs/2405.01527
Alignment is a standard procedure to fine-tune pre-trained large language models (LLMs) to follow natural language instructions and serve as helpful AI assistants. We have observed, however, that the conventional alignment process fails to enhance the factual accuracy of LLMs, and often leads to the generation of more false facts (i.e. hallucination). In this paper, we study how to make the LLM alignment process more factual, by first identifying factors that lead to hallucination in both alignment steps:\ supervised fine-tuning (SFT) and reinforcement learning (RL). In particular, we find that training the LLM on new knowledge or unfamiliar texts can encourage hallucination. This makes SFT less factual as it trains on human labeled data that may be novel to the LLM. Furthermore, reward functions used in standard RL can also encourage hallucination, because it guides the LLM to provide more helpful responses on a diverse set of instructions, often preferring longer and more detailed responses. Based on these observations, we propose factuality-aware alignment, comprised of factuality-aware SFT and factuality-aware RL through direct preference optimization. Experiments show that our proposed factuality-aware alignment guides LLMs to output more factual responses while maintaining instruction-following capability.
对齐是一种对预训练的大型语言模型(LLMs)进行微调的标准程序,以遵循自然语言指令并作为有帮助的AI助手。然而,我们观察到,传统的对齐过程无法增强LLMs的事实准确性,并通常导致生成更多的虚假事实(即幻觉)。在本文中,我们研究了如何使LLM的对齐过程更加事实准确,通过首先确定导致对齐步骤中出现幻觉的因素:有监督的微调(SFT)和强化学习(RL)。 特别是,我们发现,在为LLM提供新知识或熟悉文本进行训练时,可能会鼓励幻觉。这使得SFT变得不准确,因为它在训练时使用的人类标注数据可能对LLM来说是新颖的。此外,标准RL中使用的奖励函数也可能鼓励幻觉,因为它引导LLM为多样性的指令提供更有帮助的回答,往往更喜欢更长的、更详细的回答。 基于这些观察结果,我们提出了具有事实意识的对齐方法,通过直接偏好优化实现事实意识SFT和事实意识RL。实验证明,我们提出的事实意识对齐引导LLMs输出更准确的事实性响应,同时保持指令跟踪能力。
https://arxiv.org/abs/2405.01525
Generalization to unseen data remains poorly understood for deep learning classification and foundation models. How can one assess the ability of networks to adapt to new or extended versions of their input space in the spirit of few-shot learning, out-of-distribution generalization, and domain adaptation? Which layers of a network are likely to generalize best? We provide a new method for evaluating the capacity of networks to represent a sampled domain, regardless of whether the network has been trained on all classes in the domain. Our approach is the following: after fine-tuning state-of-the-art pre-trained models for visual classification on a particular domain, we assess their performance on data from related but distinct variations in that domain. Generalization power is quantified as a function of the latent embeddings of unseen data from intermediate layers for both unsupervised and supervised settings. Working throughout all stages of the network, we find that (i) high classification accuracy does not imply high generalizability; and (ii) deeper layers in a model do not always generalize the best, which has implications for pruning. Since the trends observed across datasets are largely consistent, we conclude that our approach reveals (a function of) the intrinsic capacity of the different layers of a model to generalize.
推广到未见过的数据在深度学习和基础模型中仍然存在很大的不确定性。如何评估网络在面对新或扩展的输入空间时的适应能力,以及少样本学习、离散泛化和领域适应?网络的哪些层可能最具泛化能力?我们提出了一种评估网络对给定域的表示能力的方法,无论网络是否在域上进行过所有的类别的预训练。我们的方法如下:在特定域上对先进的预训练模型进行微调后,我们评估它们在相关但不同的域上的数据上的表现。泛化能力被量化为中间层未见过的数据的潜在表示的功能,无论是无监督还是监督设置。在整个网络的工作过程中,我们发现:(i)高分类准确率并不一定意味着高泛化能力;(ii)模型中的更深层并不总是泛化最好,这会对剪裁产生影响。由于数据集的趋势在很大程度上是一致的,我们得出结论,我们的方法揭示了模型不同层之间泛化的内在能力。
https://arxiv.org/abs/2405.01524
The transformer structure employed in large language models (LLMs), as a specialized category of deep neural networks (DNNs) featuring attention mechanisms, stands out for their ability to identify and highlight the most relevant aspects of input data. Such a capability is particularly beneficial in addressing a variety of communication challenges, notably in the realm of semantic communication where proper encoding of the relevant data is critical especially in systems with limited bandwidth. In this work, we employ vision transformers specifically for the purpose of compression and compact representation of the input image, with the goal of preserving semantic information throughout the transmission process. Through the use of the attention mechanism inherent in transformers, we create an attention mask. This mask effectively prioritizes critical segments of images for transmission, ensuring that the reconstruction phase focuses on key objects highlighted by the mask. Our methodology significantly improves the quality of semantic communication and optimizes bandwidth usage by encoding different parts of the data in accordance with their semantic information content, thus enhancing overall efficiency. We evaluate the effectiveness of our proposed framework using the TinyImageNet dataset, focusing on both reconstruction quality and accuracy. Our evaluation results demonstrate that our framework successfully preserves semantic information, even when only a fraction of the encoded data is transmitted, according to the intended compression rates.
大型语言模型(LLMs)采用的变换器结构作为一种特殊的深度神经网络(DNNs),具有关注机制,在识别和突出输入数据中最相关方面具有优势。这种能力在解决各种通信挑战方面尤其有益,尤其是在语义通信领域,正确编码相关数据至关重要,尤其是在带宽受限的系统中。在这项工作中,我们专门使用视觉变换器来压缩和简洁地表示输入图像,以保留在整个传输过程中的语义信息。通过使用变换器固有的注意机制,我们创建了一个注意力掩码。这个掩码有效地 prioritize 了图像中关键部分的传输,确保重建阶段关注由掩码突出显示的关键对象。我们的方法显著提高了语义通信的质量,通过根据数据语义信息内容编码数据的不同部分来优化带宽利用率,从而提高整体效率。我们使用TinyImageNet数据集来评估我们提出的框架的有效性,重点关注重建质量和准确性。我们的评估结果表明,根据预定压缩率,我们的框架成功地保留了语义信息,即使只有部分编码数据被传输。
https://arxiv.org/abs/2405.01521
Varied approaches for aligning language models have been proposed, including supervised fine-tuning, RLHF, and direct optimization methods such as DPO. Although DPO has rapidly gained popularity due to its straightforward training process and competitive results, there is an open question of whether there remain practical advantages of using a discriminator, like a reward model, to evaluate responses. We propose D2PO, discriminator-guided DPO, an approach for the online setting where preferences are being collected throughout learning. As we collect gold preferences, we use these not only to train our policy, but to train a discriminative response evaluation model to silver-label even more synthetic data for policy training. We explore this approach across a set of diverse tasks, including a realistic chat setting, we find that our approach leads to higher-quality outputs compared to DPO with the same data budget, and greater efficiency in terms of preference data requirements. Furthermore, we show conditions under which silver labeling is most helpful: it is most effective when training the policy with DPO, outperforming traditional PPO, and benefits from maintaining a separate discriminator from the policy model.
提出了多种与对齐语言模型的方法,包括监督微调、RLHF和直接优化方法(如DPO)。尽管DPO因其直观的训练过程和具有竞争力的结果而迅速受到欢迎,但使用分类器(如奖励模型)来评估响应仍然是一个有争议的问题。我们提出了D2PO、分类器指导的DPO和在线设置中偏好被收集的方法。随着我们收集金偏好,我们不仅用它们来训练我们的策略,而且用它们来训练一个分类响应评估模型,以对抗训练更多的合成数据。我们在一系列多样任务上探讨了这种方法,包括一个真实的聊天设置,我们发现,与使用相同数据预算相比,我们的方法产生了更高的产品质量,并且在偏好数据要求方面更加高效。此外,我们证明了银标记在什么情况下最有帮助:当使用DPO训练策略时,它最有效;超过传统的PPO,并从策略模型中分离出独立的分类器。
https://arxiv.org/abs/2405.01511
Adaptive Cruise Control ACC can change the speed of the ego vehicle to maintain a safe distance from the following vehicle automatically. The primary purpose of this research is to use cutting-edge computing approaches to locate and track vehicles in real time under various conditions to achieve a safe ACC. The paper examines the extension of ACC employing depth cameras and radar sensors within Autonomous Vehicles AVs to respond in real time by changing weather conditions using the Car Learning to Act CARLA simulation platform at noon. The ego vehicle controller's decision to accelerate or decelerate depends on the speed of the leading ahead vehicle and the safe distance from that vehicle. Simulation results show that a Proportional Integral Derivative PID control of autonomous vehicles using a depth camera and radar sensors reduces the speed of the leading vehicle and the ego vehicle when it rains. In addition, longer travel time was observed for both vehicles in rainy conditions than in dry conditions. Also, PID control prevents the leading vehicle from rear collisions
自适应巡航控制(ACC)可以根据预设的速度自动改变车辆的自适应速度,以保持与后车安全距离。本研究的主要目的是利用尖端计算方法在各种情况下实时定位和跟踪车辆,以实现安全ACC。论文检查了在自动驾驶车辆(AV)中使用深度相机和雷达传感器扩展ACC,通过使用Car Learning to Act CARLA仿真平台在中午实时响应天气条件。自车控制器决定加速或减速取决于前车的速度和与该车辆的安全距离。仿真结果表明,使用深度相机和雷达传感器的自动驾驶车辆在下雨时,自车和前车的速度都会降低。此外,在雨天观察到的车辆行驶时间比干燥条件下更长。此外,PID控制还可以防止前车发生碰撞。
https://arxiv.org/abs/2405.01504
Computer-aided segmentation methods can assist medical personnel in improving diagnostic outcomes. While recent advancements like UNet and its variants have shown promise, they face a critical challenge: balancing accuracy with computational efficiency. Shallow encoder architectures in UNets often struggle to capture crucial spatial features, leading in inaccurate and sparse segmentation. To address this limitation, we propose a novel \underline{P}rogressive \underline{A}ttention based \underline{M}obile \underline{UNet} (\underline{PAM-UNet}) architecture. The inverted residual (IR) blocks in PAM-UNet help maintain a lightweight framework, while layerwise \textit{Progressive Luong Attention} ($\mathcal{PLA}$) promotes precise segmentation by directing attention toward regions of interest during synthesis. Our approach prioritizes both accuracy and speed, achieving a commendable balance with a mean IoU of 74.65 and a dice score of 82.87, while requiring only 1.32 floating-point operations per second (FLOPS) on the Liver Tumor Segmentation Benchmark (LiTS) 2017 dataset. These results highlight the importance of developing efficient segmentation models to accelerate the adoption of AI in clinical practice.
计算机辅助分割方法可以帮助医疗人员提高诊断结果。虽然像UNet及其变体这样的最近进展显示出前景,但它们面临着一个关键挑战:平衡准确性和计算效率。UNet中的浅层编码器架构通常很难捕捉关键的空间特征,导致不准确和稀疏分割。为了应对这个局限,我们提出了一个新颖的移动UNet(PAM-UNet)架构。PAM-UNet中的倒置残差(IR)块有助于保持轻量级框架,而逐层的PLA(渐进式洪注意力)通过将注意力指向感兴趣区域在合成过程中进行定向,促进了精确分割。我们的方法将准确性和速度优先考虑,实现了74.65的均IoU和82.87的 dice分数,同时仅在LiTS 2017数据集上需要每秒1.32个浮点运算(FLOPs)。这些结果强调了开发高效的分割模型以加速人工智能在临床实践中的采用的重要性。
https://arxiv.org/abs/2405.01503
Traditionally, natural language processing (NLP) models often use a rich set of features created by linguistic expertise, such as semantic representations. However, in the era of large language models (LLMs), more and more tasks are turned into generic, end-to-end sequence generation problems. In this paper, we investigate the question: what is the role of semantic representations in the era of LLMs? Specifically, we investigate the effect of Abstract Meaning Representation (AMR) across five diverse NLP tasks. We propose an AMR-driven chain-of-thought prompting method, which we call AMRCoT, and find that it generally hurts performance more than it helps. To investigate what AMR may have to offer on these tasks, we conduct a series of analysis experiments. We find that it is difficult to predict which input examples AMR may help or hurt on, but errors tend to arise with multi-word expressions, named entities, and in the final inference step where the LLM must connect its reasoning over the AMR to its prediction. We recommend focusing on these areas for future work in semantic representations for LLMs. Our code: this https URL.
传统上,自然语言处理(NLP)模型通常使用由语言专业知识创建的丰富特征集,例如语义表示。然而,在大型语言模型(LLMs)的时代,越来越多的任务被转化为通用序列生成问题。在本文中,我们研究了在LLM时代语义表示的作用:具体来说,我们研究了抽象意义表示(AMR)在五个不同NLP任务上的效果。我们提出了一个基于AMR的思绪提示方法,我们称之为AMRCoT,并发现它通常会损害性能,而不是帮助。为了研究AMR在这些任务上可能提供的优势,我们进行了一系列分析实验。我们发现很难预测AMR可能会帮助或损害哪些输入示例,但错误往往会在多词表达、命名实体和最后推理步骤中出现,LLM必须将推理跨越AMR与预测相结合。我们建议将未来LLM语义表示工作集中在这些领域上。我们的代码:<https://this URL>。
https://arxiv.org/abs/2405.01502
Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at this https URL
大规模文本到图像(T2I)扩散模型基于文本提示表现出显著的生成能力。基于T2I扩散模型,文本指导图像编辑研究旨在通过改变文本提示来用户操纵生成的图像。然而,现有的图像编辑技术容易导致编辑超过预期的目标区域,主要原因是跨注意图的准确性。为了解决这个问题,我们提出了局部化注意的逆置(LocInv)技术,该技术利用分词图或边界框作为额外的局部化先验来优化扩散过程的降噪阶段中的跨注意图。通过动态更新文本输入中相应的名词词位的 tokens,我们使得跨注意图与文本提示中的正确名词和形容词词首 closely 对齐。基于这种技术,我们在特定物体上实现精细图像编辑,同时防止对其他区域的不必要修改。我们的方法LocInv,基于公开可用的Stable Diffusion,在COCO数据集的子集上进行了广泛的评估,并且无论是在数量上还是在质量上,都取得了卓越的结果。代码将发布在https://这个 URL上。
https://arxiv.org/abs/2405.01496
Federated learning (FL) enables multiple clients to train models collectively while preserving data privacy. However, FL faces challenges in terms of communication cost and data heterogeneity. One-shot federated learning has emerged as a solution by reducing communication rounds, improving efficiency, and providing better security against eavesdropping attacks. Nevertheless, data heterogeneity remains a significant challenge, impacting performance. This work explores the effectiveness of diffusion models in one-shot FL, demonstrating their applicability in addressing data heterogeneity and improving FL performance. Additionally, we investigate the utility of our diffusion model approach, FedDiff, compared to other one-shot FL methods under differential privacy (DP). Furthermore, to improve generated sample quality under DP settings, we propose a pragmatic Fourier Magnitude Filtering (FMF) method, enhancing the effectiveness of generated data for global model training.
联邦学习(FL)允许多个客户端共同训练模型,同时保持数据隐私。然而,FL在通信成本和数据异质性方面面临挑战。一次性的联邦学习通过减少通信轮数、提高效率和提供更好的抗窃听攻击来解决这些挑战。然而,数据异质性仍然是一个显著的挑战,对性能产生影响。本文探讨了扩散模型在一次性FL中的有效性,证明了它们在解决数据异质性和提高FL性能方面的应用。此外,我们还研究了在差分隐私(DP)设置下使用我们的扩散模型方法FedDiff与其他一次性FL方法的效用。为了在DP设置下提高生成的样本质量,我们提出了一个实用的傅里叶变换幅度滤波(FMF)方法,从而提高了全局模型训练的有效性。
https://arxiv.org/abs/2405.01494
While most research on controllable text generation has focused on steering base Language Models, the emerging instruction-tuning and prompting paradigm offers an alternate approach to controllability. We compile and release ConGenBench, a testbed of 17 different controllable generation tasks, using a subset of it to benchmark the performance of 9 different baselines and methods on Instruction-tuned Language Models. To our surprise, we find that prompting-based approaches outperform controllable text generation methods on most datasets and tasks, highlighting a need for research on controllable text generation with Instruction-tuned Language Models in specific. Prompt-based approaches match human performance on most stylistic tasks while lagging on structural tasks, foregrounding a need to study more varied constraints and more challenging stylistic tasks. To facilitate such research, we provide an algorithm that uses only a task dataset and a Large Language Model with in-context capabilities to automatically generate a constraint dataset. This method eliminates the fields dependence on pre-curated constraint datasets, hence vastly expanding the range of constraints that can be studied in the future.
尽管关于可控制文本生成的研究主要集中在引导基础语言模型,但新兴的指令调整和提示范式提供了一种可控制的新方法。我们使用部分数据集编译和发布ConGenBench,使用其中的一部分来基准9种不同的基线和方法在指令调整语言模型上的性能。令我们惊讶的是,基于提示的方法在大多数数据集和任务上优于可控制文本生成方法,突出了在特定情况下研究可控制文本生成与指令调整语言模型的需要。基于提示的方法在大多数风格化任务上与人类性能相匹配,但在结构化任务上落后,突出了需要研究更广泛的约束以及更具挑战性的风格化任务。为了促进这种研究,我们提供了一种使用任务数据集和大语言模型(具有上下文功能)自动生成约束数据集的算法。这种方法消除了依赖性,因此极大地扩展了可以研究的约束范围。
https://arxiv.org/abs/2405.01490
The recent years have witnessed a great array of large multimodal models (LMMs) to effectively solve single-image vision language tasks. However, their abilities to solve multi-image visual language tasks is yet to be improved. The existing multi-image LMMs (e.g. OpenFlamingo, Emu, Idefics, etc) mostly gain their multi-image ability through pre-training on hundreds of millions of noisy interleaved image-text data from web, which is neither efficient nor effective. In this paper, we aim at building strong multi-image LMMs via instruction tuning with academic-level resources. Therefore, we meticulously construct Mantis-Instruct containing 721K instances from 14 multi-image datasets. We design Mantis-Instruct to cover different multi-image skills like co-reference, reasoning, comparing, temporal understanding. We combine Mantis-Instruct with several single-image visual-language datasets to train our model Mantis to handle any interleaved image-text inputs. We evaluate the trained Mantis on five multi-image benchmarks and eight single-image benchmarks. Though only requiring academic-level resources (i.e. 36 hours on 16xA100-40G), Mantis-8B can achieve state-of-the-art performance on all the multi-image benchmarks and beats the existing best multi-image LMM Idefics2-8B by an average of 9 absolute points. We observe that Mantis performs equivalently well on the held-in and held-out evaluation benchmarks. We further evaluate Mantis on single-image benchmarks and demonstrate that Mantis can maintain a strong single-image performance on par with CogVLM and Emu2. Our results are particularly encouraging as it shows that low-cost instruction tuning is indeed much more effective than intensive pre-training in terms of building multi-image LMMs.
近年来,已经出现了许多大型多模态模型(LMMs)来有效地解决单图像视觉语言任务。然而,它们在解决多图像视觉语言任务方面的能力还有待提高。现有的多图像LMM(如OpenFlamingo、Emu、Idefics等)主要是通过在数十亿个 noisy 图像-文本数据上进行预训练来获得多图像能力,这既不高效也不有效。在本文中,我们通过使用学术级别的资源进行指令微调来构建强大的多图像LMM。因此,我们详细构建了包含14个多图像数据集的Mantis-Instruct,每个数据集包含721K个实例。我们设计Mantis-Instruct以涵盖不同的多图像技能,如共指、推理、比较和时间理解。我们将Mantis-Instruct与几个单图像视觉语言数据集相结合来训练我们的模型Mantis,以处理任何交互式图像-文本输入。我们在五个多图像基准和八个单图像基准上对训练后的Mantis进行评估。尽管只需要学术级别的资源(即16xA100-40G,36小时),Mantis-8B在所有多图像基准上实现了与最先进的单图像LMM Idefics2-8B相同的平均9个绝对点的性能。我们观察到Mantis在保持托管和拨出评估基准方面表现相当。我们进一步评估Mantis在单图像基准上的表现,并证明了Mantis与CogVLM和Emu2具有相同的单图像性能。我们的结果尤其鼓舞人心,因为它表明低成本指令微调在构建多图像LMM方面确实比密集预训练更有效。
https://arxiv.org/abs/2405.01483
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which often contain tens or hundreds of billions of parameters. We create NeMo-Aligner, a toolkit for model alignment that can efficiently scale to using hundreds of GPUs for training. NeMo-Aligner comes with highly optimized and scalable implementations for major paradigms of model alignment such as: Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), SteerLM, and Self-Play Fine-Tuning (SPIN). Additionally, our toolkit supports running most of the alignment techniques in a Parameter Efficient Fine-Tuning (PEFT) setting. NeMo-Aligner is designed for extensibility, allowing support for other alignment techniques with minimal effort. It is open-sourced with Apache 2.0 License and we invite community contributions at this https URL
将大型语言模型(LLMs)与人类价值观和偏好对齐是使其有帮助和安全的充要条件。然而,构建高效的工具执行对齐可能具有挑战性,尤其是对于包含数十亿或数百亿个参数的大型和最强大的LLM。我们创建了NeMo-Aligner,一个用于模型对齐的工具包,可以高效地扩展到使用数百个GPU进行训练。NeMo-Aligner附带高度优化的可扩展实现,适用于主要模型对齐范式:强化学习来自人类反馈(RLHF)、直接偏好优化(DPO)、SteerLM和自玩微调(SPIN)。此外,我们的工具包支持在参数高效微调(PEFT)设置中运行大多数对齐技术。NeMo-Aligner旨在可扩展性,允许支持其他对齐技术,只需付出很少的努力。它使用Apache 2.0许可证开源,并邀请您在此链接处为社区贡献:https://www.nemoaligner.org/
https://arxiv.org/abs/2405.01481
Large Vision-Language models (VLMs) have demonstrated strong reasoning capabilities in tasks requiring a fine-grained understanding of literal images and text, such as visual question-answering or visual entailment. However, there has been little exploration of these models' capabilities when presented with images and captions containing figurative phenomena such as metaphors or humor, the meaning of which is often implicit. To close this gap, we propose a new task and a high-quality dataset: Visual Figurative Language Understanding with Textual Explanations (V-FLUTE). We frame the visual figurative language understanding problem as an explainable visual entailment task, where the model has to predict whether the image (premise) entails a claim (hypothesis) and justify the predicted label with a textual explanation. Using a human-AI collaboration framework, we build a high-quality dataset, V-FLUTE, that contains 6,027 <image, claim, label, explanation> instances spanning five diverse multimodal figurative phenomena: metaphors, similes, idioms, sarcasm, and humor. The figurative phenomena can be present either in the image, the caption, or both. We further conduct both automatic and human evaluations to assess current VLMs' capabilities in understanding figurative phenomena.
大视觉语言模型(VLMs)已经在需要对字面图像和文本进行深入理解的任务中表现出强大的推理能力,例如视觉问答或视觉蕴含。然而,在遇到包含象征性现象(如隐喻或幽默)的图像和字幕时,对这些模型的能力进行了深入的研究还是很少的。为了填补这一空白,我们提出了一个新的任务和高质量的数据集:视觉符号语言理解与文本解释(V-FLUTE)。我们将视觉符号语言理解问题视为一种可解释的视觉蕴含任务,其中模型需要预测图像(前提)是否符合一个假设(结论),并通过文本解释预测标签。利用人机合作框架,我们构建了一个高质量的数据集V-FLUTE,其中包括6,027个<图像,陈述,标签,解释>实例,涵盖了五种多样 multimodal 符号现象:隐喻、比喻、惯用语、讽刺和幽默。符号现象可以出现在图像中,描述中,或两者兼备。我们进一步进行了自动和人类评估,以评估现有 VLMs 对符号现象的理解能力。
https://arxiv.org/abs/2405.01474