In this work, we recover the underlying 3D structure of non-geometrically consistent scenes. We focus our analysis on hand-drawn images from cartoons and anime. Many cartoons are created by artists without a 3D rendering engine, which means that any new image of a scene is hand-drawn. The hand-drawn images are usually faithful representations of the world, but only in a qualitative sense, since it is difficult for humans to draw multiple perspectives of an object or scene 3D consistently. Nevertheless, people can easily perceive 3D scenes from inconsistent inputs! In this work, we correct for 2D drawing inconsistencies to recover a plausible 3D structure such that the newly warped drawings are consistent with each other. Our pipeline consists of a user-friendly annotation tool, camera pose estimation, and image deformation to recover a dense structure. Our method warps images to obey a perspective camera model, enabling our aligned results to be plugged into novel-view synthesis reconstruction methods to experience cartoons from viewpoints never drawn before. Our project page is https://toon3d.studio/.
在这项工作中,我们恢复了非几何一致性场景的潜在3D结构。我们的分析重点在于手绘的动漫图像。许多动漫是由没有3D渲染引擎的艺术家创作的,这意味着任何场景的新图像都是手绘的。手绘图像通常忠实于现实世界,但只有从定性意义上说,因为人类很难 consistently绘制物体或场景的多个视角。然而,人们可以很容易地从不一致的输入中感知3D场景!在这项工作中,我们纠正了2D绘图不一致性,以恢复一个可信的3D结构,使得新扭曲的图像相互一致。我们的流程包括一个用户友好的注释工具、目标姿态估计和图像变形以恢复密集结构。我们的方法扭曲图像以遵守透视相机模型,使我们得到的结果可以插入到从未见过的观点合成重建方法中,从从未体验过的角度合成卡通。我们的项目页面是https://toon3d.studio/。
https://arxiv.org/abs/2405.10320
Vector graphics are widely used in digital art and highly favored by designers due to their scalability and layer-wise properties. However, the process of creating and editing vector graphics requires creativity and design expertise, making it a time-consuming task. Recent advancements in text-to-vector (T2V) generation have aimed to make this process more accessible. However, existing T2V methods directly optimize control points of vector graphics paths, often resulting in intersecting or jagged paths due to the lack of geometry constraints. To overcome these limitations, we propose a novel neural path representation by designing a dual-branch Variational Autoencoder (VAE) that learns the path latent space from both sequence and image modalities. By optimizing the combination of neural paths, we can incorporate geometric constraints while preserving expressivity in generated SVGs. Furthermore, we introduce a two-stage path optimization method to improve the visual and topological quality of generated SVGs. In the first stage, a pre-trained text-to-image diffusion model guides the initial generation of complex vector graphics through the Variational Score Distillation (VSD) process. In the second stage, we refine the graphics using a layer-wise image vectorization strategy to achieve clearer elements and structure. We demonstrate the effectiveness of our method through extensive experiments and showcase various applications. The project page is this https URL.
向量图形广泛应用于数字艺术领域,并因其可扩展性和分层特性而受到设计师的青睐。然而,创建和编辑向量图形需要创造力和设计专业知识,使得这是一个耗时任务。近年来,在文本到向量(T2V)生成方面的先进技术旨在使这个过程更加容易。然而,现有的T2V方法直接优化向量图形的控制点,往往导致路径交叉或交错,由于缺乏几何约束。为了克服这些限制,我们提出了一种新颖的神经路径表示方法,通过设计一种双分支变分自编码器(VAE),从序列和图像模态中学习路径潜在空间。通过优化神经路径的组合,我们可以同时包含几何约束,同时保留生成的SVG的表现力。此外,我们还引入了双阶段路径优化方法,以改善生成的SVG的视觉和拓扑质量。在第一阶段,预训练的文本到图像扩散模型指导复杂向量图形通过变分分数蒸馏(VSD)过程进行初始生成。在第二阶段,我们通过分层图像向量化策略来优化图形,以实现更清晰的元素和结构。我们通过广泛的实验来证明我们方法的的有效性,并展示各种应用。这个项目页面是 https://url.com/。
https://arxiv.org/abs/2405.10317
Visual In-Context Learning (ICL) has emerged as a promising research area due to its capability to accomplish various tasks with limited example pairs through analogical reasoning. However, training-based visual ICL has limitations in its ability to generalize to unseen tasks and requires the collection of a diverse task dataset. On the other hand, existing methods in the inference-based visual ICL category solely rely on textual prompts, which fail to capture fine-grained contextual information from given examples and can be time-consuming when converting from images to text prompts. To address these challenges, we propose Analogist, a novel inference-based visual ICL approach that exploits both visual and textual prompting techniques using a text-to-image diffusion model pretrained for image inpainting. For visual prompting, we propose a self-attention cloning (SAC) method to guide the fine-grained structural-level analogy between image examples. For textual prompting, we leverage GPT-4V's visual reasoning capability to efficiently generate text prompts and introduce a cross-attention masking (CAM) operation to enhance the accuracy of semantic-level analogy guided by text prompts. Our method is out-of-the-box and does not require fine-tuning or optimization. It is also generic and flexible, enabling a wide range of visual tasks to be performed in an in-context manner. Extensive experiments demonstrate the superiority of our method over existing approaches, both qualitatively and quantitatively.
视觉上下文学习(ICL)由于通过类比推理完成各种任务的能力而成为一个有前景的研究领域。然而,基于训练的视觉ICL在泛化到未见过的任务方面存在局限性,需要收集多样任务数据集。另一方面,基于推理的视觉ICL方法仅依赖文本提示,无法从给定的例子中捕捉到细微的上下文信息,并且将图像从图像到文本提示的转换过程中需要花费时间。为了应对这些挑战,我们提出了Analogist,一种新颖的基于推理的视觉ICL方法,利用预训练的文本到图像扩散模型来探索图像和文本提示技术。 在视觉提示方面,我们提出了自注意力克隆(SAC)方法,以引导图像示例之间的细粒度结构级类比。在文本提示方面,我们利用GPT-4V的视觉推理能力高效生成文本提示,并引入跨注意掩码(CAM)操作,以增强由文本提示引导的语义级类比的精度。我们的方法是出类拔萃的,不需要微调或优化。它也具有通用性和灵活性,能够以上下文方式执行各种视觉任务。大量实验证明,我们的方法在质量和数量上优于现有方法。
https://arxiv.org/abs/2405.10316
Learning in simulation and transferring the learned policy to the real world has the potential to enable generalist robots. The key challenge of this approach is to address simulation-to-reality (sim-to-real) gaps. Previous methods often require domain-specific knowledge a priori. We argue that a straightforward way to obtain such knowledge is by asking humans to observe and assist robot policy execution in the real world. The robots can then learn from humans to close various sim-to-real gaps. We propose TRANSIC, a data-driven approach to enable successful sim-to-real transfer based on a human-in-the-loop framework. TRANSIC allows humans to augment simulation policies to overcome various unmodeled sim-to-real gaps holistically through intervention and online correction. Residual policies can be learned from human corrections and integrated with simulation policies for autonomous execution. We show that our approach can achieve successful sim-to-real transfer in complex and contact-rich manipulation tasks such as furniture assembly. Through synergistic integration of policies learned in simulation and from humans, TRANSIC is effective as a holistic approach to addressing various, often coexisting sim-to-real gaps. It displays attractive properties such as scaling with human effort. Videos and code are available at this https URL
在模拟中学习和将学到的策略应用于现实世界具有实现通用机器人的潜力。这种方法的关键挑战是解决模拟与现实之间的差距(sim-to-real gaps)。之前的方法通常需要先验的知识领域特定知识。我们认为,获得这种知识的最直接方法是让人类在现实生活中观察和辅助机器人策略执行。机器人可以从人类那里学习以填补各种模拟与现实之间的差距。我们提出了TRANSIC,一种基于人类在环框架的数据驱动方法,以实现基于人类在环的模拟与现实之间的成功转移。TRANSIC允许人类通过干预和在线纠错来通过各种未建模的模拟与现实之间的差距来扩展模拟策略。残余策略可以从人类的纠正中学习,并将其与模拟策略集成以实现自主执行。我们证明了,在我们的方法下,可以实现成功的模拟与现实之间的转移,特别是在复杂的接触操作任务中,如家具组装。通过模拟策略和学习人类策略的协同作用,TRANSIC是一种有效的全面方法来解决各种经常存在的模拟与现实之间的差距。它具有可扩展 human effort 的特点。视频和代码可以通过这个链接https://www.youtube.com/watch?v=获取:
https://arxiv.org/abs/2405.10315
Advances in 3D reconstruction have enabled high-quality 3D capture, but require a user to collect hundreds to thousands of images to create a 3D scene. We present CAT3D, a method for creating anything in 3D by simulating this real-world capture process with a multi-view diffusion model. Given any number of input images and a set of target novel viewpoints, our model generates highly consistent novel views of a scene. These generated views can be used as input to robust 3D reconstruction techniques to produce 3D representations that can be rendered from any viewpoint in real-time. CAT3D can create entire 3D scenes in as little as one minute, and outperforms existing methods for single image and few-view 3D scene creation. See our project page for results and interactive demos at this https URL .
3D重建技术的进步使得高质量的3D捕捉成为可能,但需要用户收集数百到数千张图像来创建3D场景。我们提出了一种名为CAT3D的方法,通过使用多视角扩散模型模拟这种现实世界的捕捉过程,来创建任何3D物体。给定任意数量的输入图像和一组目标新视角,我们的模型生成场景中高度一致的新视角。这些生成的视图可以作为输入,用于具有实时渲染能力的稳健3D重建技术,产生可以从任何视角渲染的3D表示。CAT3D可以在不到一分钟的时间内创建整个3D场景,并超越了现有方法在单张图像和少数视角3D场景创建方面的表现。请查看我们的项目页面,以查看结果和交互式演示。https://url.com/cat3d 。
https://arxiv.org/abs/2405.10314
The evolution of artificial intelligence (AI) has profoundly impacted human society, driving significant advancements in multiple sectors. Yet, the escalating demands on AI have highlighted the limitations of AI's current offerings, catalyzing a movement towards Artificial General Intelligence (AGI). AGI, distinguished by its ability to execute diverse real-world tasks with efficiency and effectiveness comparable to human intelligence, reflects a paramount milestone in AI evolution. While existing works have summarized specific recent advancements of AI, they lack a comprehensive discussion of AGI's definitions, goals, and developmental trajectories. Different from existing survey papers, this paper delves into the pivotal questions of our proximity to AGI and the strategies necessary for its realization through extensive surveys, discussions, and original perspectives. We start by articulating the requisite capability frameworks for AGI, integrating the internal, interface, and system dimensions. As the realization of AGI requires more advanced capabilities and adherence to stringent constraints, we further discuss necessary AGI alignment technologies to harmonize these factors. Notably, we emphasize the importance of approaching AGI responsibly by first defining the key levels of AGI progression, followed by the evaluation framework that situates the status-quo, and finally giving our roadmap of how to reach the pinnacle of AGI. Moreover, to give tangible insights into the ubiquitous impact of the integration of AI, we outline existing challenges and potential pathways toward AGI in multiple domains. In sum, serving as a pioneering exploration into the current state and future trajectory of AGI, this paper aims to foster a collective comprehension and catalyze broader public discussions among researchers and practitioners on AGI.
人工智能(AI)的演变对人类社会产生了深远的影响,推动了多个领域的显著进步。然而,对AI的不断增长的需求揭示了其现有提供的局限性,推动了人工智能通用智能(AGI)的发展。AGI,以其与人类智能相似的执行多样现实任务的高效和有效性而闻名,是AI进化的重要里程碑。虽然现有作品总结了AI的特定最近进展,但它们缺乏对AGI定义、目标和发展轨迹的全面讨论。与现有调查论文不同,本文深入探讨了我们接近AGI以及实现其所需策略的问题,通过广泛的调查、讨论和原创观点进行。我们首先阐述AGI所需的功能框架,包括内部、接口和系统维度。随着AGI的实现需要更高级别的能力和严格的约束,我们进一步讨论了必要的AGI对齐技术以解决这些因素。值得注意的是,我们强调了通过首先确定AGI发展的高级水平,然后确定现状,最后给出AGI达到顶峰的路线图,以负责任地接近AGI的重要性。此外,为了向公众提供对AI整合普遍影响的实际见解,我们在多个领域概述了现有挑战和通往AGI的可能途径。总之,作为对AI现状和未来趋势的开创性探索,本文旨在促进研究人员和实践者对AGI的集体理解和广泛讨论。
https://arxiv.org/abs/2405.10313
In complex environments with large discrete action spaces, effective decision-making is critical in reinforcement learning (RL). Despite the widespread use of value-based RL approaches like Q-learning, they come with a computational burden, necessitating the maximization of a value function over all actions in each iteration. This burden becomes particularly challenging when addressing large-scale problems and using deep neural networks as function approximators. In this paper, we present stochastic value-based RL approaches which, in each iteration, as opposed to optimizing over the entire set of $n$ actions, only consider a variable stochastic set of a sublinear number of actions, possibly as small as $\mathcal{O}(\log(n))$. The presented stochastic value-based RL methods include, among others, Stochastic Q-learning, StochDQN, and StochDDQN, all of which integrate this stochastic approach for both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we illustrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
在具有较大离散动作空间复杂环境的强化学习(RL)中,有效的决策非常重要。尽管基于价值的RL方法(如Q-learning)在实践中得到了广泛应用,但它们带来了计算负担,需要通过每个迭代最大化价值函数来解决。这个负担在处理大规模问题和使用深度神经网络作为函数近似的函数时变得尤为困难。在本文中,我们提出了随机价值基于RL的方法,每个迭代周期内,除了优化整个$n$个动作集合外,只考虑一个随机子线性数量的动作,可能大小为$\mathcal{O}(\log(n))$。所提出的随机价值基于RL方法包括,例如,Stochastic Q-learning,StochDQN和StochDDQN,它们都集成了这个随机方法 both value-function updates and action selection. The theoretical convergence of Stochastic Q-learning is established, while an analysis of stochastic maximization is provided. Moreover, through empirical validation, we demonstrate that the various proposed approaches outperform the baseline methods across diverse environments, including different control problems, achieving near-optimal average returns in significantly reduced time.
https://arxiv.org/abs/2405.10310
We are living in a three-dimensional space while moving forward through a fourth dimension: time. To allow artificial intelligence to develop a comprehensive understanding of such a 4D environment, we introduce 4D Panoptic Scene Graph (PSG-4D), a new representation that bridges the raw visual data perceived in a dynamic 4D world and high-level visual understanding. Specifically, PSG-4D abstracts rich 4D sensory data into nodes, which represent entities with precise location and status information, and edges, which capture the temporal relations. To facilitate research in this new area, we build a richly annotated PSG-4D dataset consisting of 3K RGB-D videos with a total of 1M frames, each of which is labeled with 4D panoptic segmentation masks as well as fine-grained, dynamic scene graphs. To solve PSG-4D, we propose PSG4DFormer, a Transformer-based model that can predict panoptic segmentation masks, track masks along the time axis, and generate the corresponding scene graphs via a relation component. Extensive experiments on the new dataset show that our method can serve as a strong baseline for future research on PSG-4D. In the end, we provide a real-world application example to demonstrate how we can achieve dynamic scene understanding by integrating a large language model into our PSG-4D system.
我们正在通过第四维度(时间)向前移动,生活在三维空间中。为了使人工智能能够全面理解这种四维环境,我们引入了4D透视场景图(PSG-4D),一种新的表示方法,它桥接了在动态四维世界中感知到的原始视觉数据和高层次视觉理解。具体来说,PSG-4D将丰富的4D感官数据抽象为节点,这些节点表示具有精确位置和状态信息的实体,并且边捕获了时间关系。为了促进关于这一新领域的 research,我们构建了一个带有丰富注释的PSG-4D数据集,包括3K个RGB-D视频,总共有1M个帧,每个视频都被标注了4D透视分割掩码以及细粒度的动态场景图。为了解决PSG-4D,我们提出了PSG4DFormer,一种基于Transformer的模型,可以预测透视分割掩码,跟踪掩码沿着时间轴,并通过关系组件生成相应的场景图。对新技术数据集的实验证明表明,我们的方法可以为未来的PSG-4D研究提供一个强大的基线。最后,我们通过将大型语言模型集成到PSG-4D系统中来提供了一个真实的应用实例,展示了我们如何通过整合大型语言模型来实现动态场景理解。
https://arxiv.org/abs/2405.10305
Before deploying outputs from foundation models in high-stakes tasks, it is imperative to ensure that they align with human values. For instance, in radiology report generation, reports generated by a vision-language model must align with human evaluations before their use in medical decision-making. This paper presents Conformal Alignment, a general framework for identifying units whose outputs meet a user-specified alignment criterion. It is guaranteed that on average, a prescribed fraction of selected units indeed meet the alignment criterion, regardless of the foundation model or the data distribution. Given any pre-trained model and new units with model-generated outputs, Conformal Alignment leverages a set of reference data with ground-truth alignment status to train an alignment predictor. It then selects new units whose predicted alignment scores surpass a data-dependent threshold, certifying their corresponding outputs as trustworthy. Through applications to question answering and radiology report generation, we demonstrate that our method is able to accurately identify units with trustworthy outputs via lightweight training over a moderate amount of reference data. En route, we investigate the informativeness of various features in alignment prediction and combine them with standard models to construct the alignment predictor.
在将基础模型在高风险任务中部署输出之前,确保它们与人类价值观保持一致是至关重要的。例如,在放射学报告生成中,使用视觉语言模型生成的报告必须在用于医疗决策之前与人类评价保持一致。本文介绍了一种称为“对齐预测”的一般框架,可以确定满足用户指定对齐标准的单元。保证,即使使用不同的基础模型或数据分布,平均选择的单元也确实满足对齐标准。给定任何预训练模型和新单元,对齐预测利用具有真实对齐状态的参考数据集训练一个对齐预测器。然后选择预测对齐分数超过数据相关阈值的新的单元,证明其相应输出值得信赖。通过应用于问题回答和放射学报告生成,我们证明了我们的方法能够通过轻量训练在适量参考数据上准确地识别具有可信输出的单元。在研究对齐预测的各种特征的信息性之后,我们将它们与标准模型结合使用构建对齐预测器。
https://arxiv.org/abs/2405.10301
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at this https URL
本文介绍了IDEA研究开发的高级开放集物体检测模型Grounding DINO 1.5,该模型的目标是提高开放集物体检测的“边缘”。该系列包括两个模型:Grounding DINO 1.5 Pro,一种高性能模型,旨在在广泛的场景中提高泛化能力,以及Grounding DINO 1.5 Edge,一种专注于更快速度要求的许多需要边缘部署的应用程序的低延迟模型。Grounding DINO 1.5 Pro模型通过扩展模型架构、集成增强的视觉骨架和扩展训练数据集(带有 grounding 注释的超过2000万图像)来超越其前辈,从而实现更丰富的语义理解。Grounding DINO 1.5 Edge模型,虽然设计为具有较低特征缩放的高效模型,但在全面的数据集上训练仍具有稳健的检测能力。实验结果证明了Grounding DINO 1.5的有效性,Grounding DINO 1.5 Pro模型在COCO检测基准上获得了54.3的AP,在LVIS-minival零散转移基准上获得了55.7的AP,创造了新的开放集物体检测纪录。此外,当使用TensorRT优化Grounding DINO 1.5 Edge模型时,其速度达到75.2 FPS,同时取得了LVIS-minival基准上的零散转移性能为36.2 AP,使其更适合边缘计算场景。模型示例和API演示将会在这个链接上发布。
https://arxiv.org/abs/2405.10300
The expanding size of language models has created the necessity for a comprehensive examination across various dimensions that reflect the desiderata with respect to the tradeoffs between various hardware metrics, such as latency, energy consumption, GPU memory usage, and performance. There is a growing interest in establishing Pareto frontiers for different language model configurations to identify optimal models with specified hardware constraints. Notably, architectures that excel in latency on one device may not perform optimally on another. However, exhaustive training and evaluation of numerous architectures across diverse hardware configurations is computationally prohibitive. To this end, we propose HW-GPT-Bench, a hardware-aware language model surrogate benchmark, where we leverage weight-sharing techniques from Neural Architecture Search (NAS) to efficiently train a supernet proxy, encompassing language models of varying scales in a single model. We conduct profiling of these models across 13 devices, considering 5 hardware metrics and 3 distinct model scales. Finally, we showcase the usability of HW-GPT-Bench using 8 different multi-objective NAS algorithms and evaluate the quality of the resultant Pareto fronts. Through this benchmark, our objective is to propel and expedite research in the advancement of multi-objective methods for NAS and structural pruning in large language models.
随着语言模型的大小不断扩展,在各种硬件指标之间进行全面的权衡已经变得必要。为了满足硬件约束,人们越来越关注为不同的语言模型配置建立Pareto前沿,以确定指定的硬件约束下的最优模型。值得注意的是,在单个设备上表现出卓越延迟的架构在其他设备上可能不会表现最优。然而,对多种硬件配置下的大量架构进行详尽训练和评估是计算上过于耗费资源的。为此,我们提出了HW-GPT-Bench,一个硬件感知的语言模型代理基准,利用来自神经架构搜索(NAS)的权重共享技术,以在单个模型中高效训练一个超级网络代理,涵盖不同规模的语言模型。我们在13个设备上对这些模型进行 profiling,考虑了5个硬件指标和3个不同的模型规模。最后,我们使用8种不同的多目标NAS算法展示了HW-GPT-Bench的可用性,并评估了由此产生的Pareto前沿的质量。通过这个基准,我们的目标是以推动和研究大型语言模型中多目标方法和结构修剪的进展为目的,加快研究步伐。
https://arxiv.org/abs/2405.10299
Existing strategies for managing risks from advanced AI systems often focus on affecting what AI systems are developed and how they diffuse. However, this approach becomes less feasible as the number of developers of advanced AI grows, and impedes beneficial use-cases as well as harmful ones. In response, we urge a complementary approach: increasing societal adaptation to advanced AI, that is, reducing the expected negative impacts from a given level of diffusion of a given AI capability. We introduce a conceptual framework which helps identify adaptive interventions that avoid, defend against and remedy potentially harmful uses of AI systems, illustrated with examples in election manipulation, cyberterrorism, and loss of control to AI decision-makers. We discuss a three-step cycle that society can implement to adapt to AI. Increasing society's ability to implement this cycle builds its resilience to advanced AI. We conclude with concrete recommendations for governments, industry, and third-parties.
目前,管理先进人工智能系统风险的策略通常集中于影响AI系统的发展和扩散。然而,这种方法在AI开发者数量不断增加的情况下变得越来越不现实,也会阻碍有益的和有害的使用案例。因此,我们呼吁一种互补的方法:增加社会对先进AI的适应性,即减少从给定扩散水平开始,给定AI能力的潜在有害影响。我们引入了一个概念框架,以帮助识别避免、防御和修复可能对AI系统产生有害影响的适应干预措施,并通过选举操纵、网络恐怖主义和失去对AI决策者的控制等例子进行了说明。我们讨论了社会可以采用的三步循环来适应AI。不断提高社会实施这一循环能力增强了其对先进AI的韧性。最后,我们给出了政府、行业和第三方的具体建议。
https://arxiv.org/abs/2405.10295
Large vision-language models (VLMs) fine-tuned on specialized visual instruction-following data have exhibited impressive language reasoning capabilities across various scenarios. However, this fine-tuning paradigm may not be able to efficiently learn optimal decision-making agents in multi-step goal-directed tasks from interactive environments. To address this challenge, we propose an algorithmic framework that fine-tunes VLMs with reinforcement learning (RL). Specifically, our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning, enabling the VLM to efficiently explore intermediate reasoning steps that lead to the final text-based action. Next, the open-ended text output is parsed into an executable action to interact with the environment to obtain goal-directed task rewards. Finally, our framework uses these task rewards to fine-tune the entire VLM with RL. Empirically, we demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks, enabling 7b models to outperform commercial models such as GPT4-V or Gemini. Furthermore, we find that CoT reasoning is a crucial component for performance improvement, as removing the CoT reasoning results in a significant decrease in the overall performance of our method.
大视觉语言模型(VLMs)在专用视觉指令跟随数据上进行微调已经展示了令人印象深刻的语言推理能力。然而,这种微调范式可能无法有效地从交互环境中学习最优决策策略。为解决这个问题,我们提出了一个使用强化学习(RL)微调VLMs的算法框架。具体来说,我们的框架提供任务描述,然后提示VLM生成连锁推理(CoT)思维,使VLM能够高效探索导致最终文本基于行动的中间推理步骤。接下来,开放的文本输出被解析为可执行动作,以与环境交互以获得目标导向任务奖励。最后,我们的框架使用这些任务奖励对整个VLM进行微调。实验证明,我们提出的框架增强了VLM代理在不同任务中的决策能力,使得7b模型能够优于诸如GPT4-V或Gemini等商业模型。此外,我们发现,CoT推理是提高性能的关键组成部分,因为去除CoT推理会导致我们方法的整体性能显著下降。
https://arxiv.org/abs/2405.10292
Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.
事实提取对于构建知识图谱至关重要。最近,对于下游任务中不断增加的时间性事实需求,导致出现了时间性事实提取任务。在本文中,我们重点讨论自然语言文本中提取时间性事实的问题。之前的研究未能解决在复杂句子中建立时间性事实的时间挑战。为了克服这一障碍,我们提出了一个基于时间轴的句子分解策略,使用大型语言模型(LLMs)进行预训练,确保对与各种事实相关的时间轴有细粒度的理解。此外,我们评估了LLMs的直接时间性事实提取性能,并得到不满意的结果。为此,我们引入了TSDRE,一种将LLM的分解能力融入对较小预训练语言模型(PLM)传统微调的方法。为了支持评估,我们构建了复杂的时间性事实提取数据集ComplexTRED。我们的实验结果表明,TSDRE在HyperRED-Temporal和ComplexTRED数据集上均取得了最先进的成果。
https://arxiv.org/abs/2405.10288
Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training, in this paper, we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically, we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs, and low caption quality and diversity. Then, we devise effective solutions for addressing both problems, which essentially require training with multiple true positive pairs. Finally, we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ($\sim +6\%$ on average over 11 datasets) and image retrieval ($\sim +19\%$ on Flickr30k and $\sim +15\%$ on MSCOCO).
尽管噪音和字幕质量被认为是影响视觉语言对比预训练的重要因素,但在这篇论文中,我们展示了通过解决这些问题来改进训练过程的全部潜力尚未得到实现。具体来说,我们首先研究并分析了两个影响训练的问题:错误的负对分配和低字幕质量和多样性。然后,我们为解决这两个问题制定了有效的解决方案,这本质上需要进行多组真实正例的训练。最后,我们提出了使用sigmoid损失进行训练来满足这一要求。我们证明了在图像识别(平均每11个数据集提高约6%)和图像检索(Flicker30k上的平均提高约19%,MSCOCO上的平均提高约15%)方面,当前最先进的技术都有非常大的提升。
https://arxiv.org/abs/2405.10286
Numerous recent works aim to enhance the efficacy of Large Language Models (LLMs) through strategic prompting. In particular, the Optimization by PROmpting (OPRO) approach provides state-of-the-art performance by leveraging LLMs as optimizers where the optimization task is to find instructions that maximize the task accuracy. In this paper, we revisit OPRO for automated prompting with relatively small-scale LLMs, such as LLaMa-2 family and Mistral 7B. Our investigation reveals that OPRO shows limited effectiveness in small-scale LLMs, with limited inference capabilities constraining optimization ability. We suggest future automatic prompting engineering to consider both model capabilities and computational costs. Additionally, for small-scale LLMs, we recommend direct instructions that clearly outline objectives and methodologies as robust prompt baselines, ensuring efficient and effective prompt engineering in ongoing research.
许多最近的工作旨在通过策略提示增强大型语言模型(LLMs)的效率。特别是,通过利用LLM作为优化器,Proof of Programming (OPRO)方法在优化任务中提供了最先进的性能。在本文中,我们重新审视了OPROMpting (OPR)方法用于自动提示相对较小的LLM,如LLLaMa-2系列和Mistral 7B。我们的调查显示,在小型LLM上,OPROMpting的优化效果有限,有限的语言能力限制了优化能力。我们建议,在未来的自动提示工程中,要考虑模型的特性和计算成本。此外,对于小型LLM,我们建议使用明确说明要达到的目标和方法的直接指令作为稳健的提示基础,以确保在 ongoing研究中有高效的和有效的提示工程。
https://arxiv.org/abs/2405.10276
The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
本工作的目标是同时从文本中生成自然对话脸和语音输出。我们通过将谈话面部生成(TFG)和文本转语音(TTS)系统集成到一个统一框架中来实现这一目标。我们解决了每个任务的主要挑战:(1)生成具有真实世界场景中各种头势的广泛范围的头部;(2)在同一身份下,即使面部运动存在差异,也要确保声音的一致性。为解决这些问题,我们引入了一种基于条件流匹配的运动采样方法,该方法能够以高效的方式生成高质量的运动码。此外,我们引入了一种新的条件方法来对TTS系统,该方法利用TFG模型的运动去除特征来产生统一的语音输出。我们广泛的实验证明了我们的方法有效地创建了自然外观的对话脸和语音,准确地匹配了输入文本。据我们所知,这是第一个在未见过的身份上构建多模态合成系统的尝试。
https://arxiv.org/abs/2405.10272
Federated learning (FL) represents a pivotal shift in machine learning (ML) as it enables collaborative training of local ML models coordinated by a central aggregator, all without the need to exchange local data. However, its application on edge devices is hindered by limited computational capabilities and data communication challenges, compounded by the inherent complexity of Deep Learning (DL) models. Model pruning is identified as a key technique for compressing DL models on devices with limited resources. Nonetheless, conventional pruning techniques typically rely on manually crafted heuristics and demand human expertise to achieve a balance between model size, speed, and accuracy, often resulting in sub-optimal solutions. In this study, we introduce an automated federated learning approach utilizing informed pruning, called AutoFLIP, which dynamically prunes and compresses DL models within both the local clients and the global server. It leverages a federated loss exploration phase to investigate model gradient behavior across diverse datasets and losses, providing insights into parameter significance. Our experiments showcase notable enhancements in scenarios with strong non-IID data, underscoring AutoFLIP's capacity to tackle computational constraints and achieve superior global convergence.
联邦学习(FL)在机器学习(ML)中具有关键性的转变,因为它允许由中央聚合器协调本地ML模型的协同训练,而无需交换本地数据。然而,在边缘设备上应用FL存在计算能力和数据通信挑战的限制,再加上Deep Learning(DL)模型的固有复杂性。模型剪枝被认为是压缩具有有限资源设备的DL模型的关键技术。然而,传统的剪枝技术通常依赖于人工创建的启发式,并需要人类专业知识来达到模型大小、速度和准确性的平衡,往往导致次优解决方案。在本研究中,我们引入了一种自动化的联邦学习方法,利用智能剪枝,称为AutoFLIP,它可以在本地客户端和全局服务器上动态地剪枝和压缩DL模型。它利用联邦损失探索阶段研究了模型梯度行为,提供了对参数重要性的洞察。我们的实验展示了在具有强大非IID数据的情况下显著的增强,突出了AutoFLIP解决计算限制和实现卓越全局收敛的能力。
https://arxiv.org/abs/2405.10271
In this work, our goals are two fold: large-vocabulary continuous sign language recognition (CSLR), and sign language retrieval. To this end, we introduce a multi-task Transformer model, CSLR2, that is able to ingest a signing sequence and output in a joint embedding space between signed language and spoken language text. To enable CSLR evaluation in the large-vocabulary setting, we introduce new dataset annotations that have been manually collected. These provide continuous sign-level annotations for six hours of test videos, and will be made publicly available. We demonstrate that by a careful choice of loss functions, training the model for both the CSLR and retrieval tasks is mutually beneficial in terms of performance -- retrieval improves CSLR performance by providing context, while CSLR improves retrieval with more fine-grained supervision. We further show the benefits of leveraging weak and noisy supervision from large-vocabulary datasets such as BOBSL, namely sign-level pseudo-labels, and English subtitles. Our model significantly outperforms the previous state of the art on both tasks.
在这项工作中,我们的目标有两个:大规模词汇连续手语识别(CSLR)和手语检索。为此,我们引入了一个多任务Transformer模型CSLR2,它能够将手语序列和口头语言文本联合嵌入空间中的输出。为了在大型词汇设置中实现CSLR评估,我们引入了新的数据集注释,这些注释已经手动收集。这些提供了六个小时测试视频的连续手语级注释,将公开发布。我们证明了,通过仔细选择损失函数,同时训练CSLR和检索任务,可以提高性能——检索通过提供上下文来提高CSLR性能,而CSLR通过更细粒度的监督来提高检索。我们还进一步展示了利用大型词汇数据集如BOBSL的优势,如手语级伪标签和英文字幕。我们的模型在两个任务上都显著超过了前人水平。
https://arxiv.org/abs/2405.10266
This paper investigates the dynamics of a deep neural network (DNN) learning interactions. Previous studies have discovered and mathematically proven that given each input sample, a well-trained DNN usually only encodes a small number of interactions (non-linear relationships) between input variables in the sample. A series of theorems have been derived to prove that we can consider the DNN's inference equivalent to using these interactions as primitive patterns for inference. In this paper, we discover the DNN learns interactions in two phases. The first phase mainly penalizes interactions of medium and high orders, and the second phase mainly learns interactions of gradually increasing orders. We can consider the two-phase phenomenon as the starting point of a DNN learning over-fitted features. Such a phenomenon has been widely shared by DNNs with various architectures trained for different tasks. Therefore, the discovery of the two-phase dynamics provides a detailed mechanism for how a DNN gradually learns different inference patterns (interactions). In particular, we have also verified the claim that high-order interactions have weaker generalization power than low-order interactions. Thus, the discovered two-phase dynamics also explains how the generalization power of a DNN changes during the training process.
本文研究了深度神经网络(DNN)学习交互的动态。之前的研究发现并数学证明了,给定每个输入样本,经过良好训练的DNN通常只编码样本中输入变量之间的小数量(非线性关系)交互。一系列推论已经被导出来,证明我们可以将DNN的推理视为使用这些交互作为推理的基本模式。在本文中,我们发现了DNN在两个阶段学习交互。第一个阶段主要惩罚中高阶交互,第二个阶段主要学习逐渐增加阶数的交互。我们可以将两个阶段的现象视为DNN过拟合特征的起点。事实上,这种现象已经在各种架构训练的DNN中得到了广泛分享。因此,发现两个阶段动态提供了DNN逐渐学习不同推理模式(交互)的详细机制。特别,我们还验证了说法,高阶交互的泛化能力比低阶交互弱。因此,发现的两个阶段动态也解释了DNN在训练过程中泛化能力的变化。
https://arxiv.org/abs/2405.10262