Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are, however, too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT, a novel, simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames, DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation, before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques, outperforms sophisticated "universal" trackers like OmniMotion, and is on par with, or better than, the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code, data, and videos showcasing the capabilities of our approach are available in the project webpage: this https URL .
近年来,点跟踪技术已经能够通过视频中的大部分区域恢复场景中任何点的轨迹,即使存在遮挡。然而,在实践中,这些方法过于缓慢,无法在合理的时间内跟踪每个单帧中观察到的点。本文介绍了一种新的、简单而有效的解决此问题的方法——DOT。 首先,DOT使用一种易于实施的点跟踪算法从运动边界的关键区域提取出一小部分轨迹。给定源和目标帧,DOT然后通过最近的邻域插值计算出密集流场和可见性掩码的粗略初始估计,然后使用一个可训练的光学流估计算法来进一步优化它们。 我们证明了DOT比目前的点跟踪技术更准确,优于复杂的“通用”跟踪器OmniMotion,与诸如CoTracker这样的最佳点跟踪算法相媲美,同时速度至少快了两倍。通过与合成和真实视频的定量和定性实验来验证所提出方法的承诺。代码、数据和视频展示了我们方法的功能,您可以查看项目网页上的这个链接:https://。
https://arxiv.org/abs/2312.00786
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this, we define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences, the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity, we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time.
我们提出了一种新颖的序列建模方法,可以无需利用任何语言数据来学习大型视觉模型(LVM)。为了实现这一点,我们定义了一个通用的格式"视觉句子",其中我们可以将原始图像和视频以及语义分割和深度重构的标注数据来源表示为不需要任何元知识 beyond the pixels 的数据来源。一旦这种广泛的视觉数据(包括420亿个标记)用序列形式表示,模型可以训练以最小化下一个标记的交叉熵损失。通过在模型架构和数据多样性的不同规模上进行训练,我们提供了实验证据,证明我们的模型具有良好的扩展性。许多不同的视觉任务都可以通过在测试时设计合适的视觉提示来解决。
https://arxiv.org/abs/2312.00785
While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.
尽管现有的大型视觉语言多模态模型集中于整个图像理解,但在实现区域特定理解方面存在显著的差距。当前使用文本坐标或空间编码的方法通常无法提供用户友好的界面进行视觉提示。为解决这个问题,我们引入了一种新型的多模态模型,可以解码任意视觉提示。这使得用户能够直觉地标记图像,并使用自然提示(如“红色边界框”或“指针箭头”)与模型交互。我们的简单设计直接将视觉标记覆盖到RGB图像上,消除了复杂区域编码的需要,在视觉理解任务(如Visual7W、PointQA和Visual Commonsense Reasoning)中实现了最先进的性能。此外,我们还介绍了ViP-Bench,一个全面评估模型在理解视觉提示的多维度能力的基准,为该领域未来的研究提供了便利。代码、数据和模型都是公开可用的。
https://arxiv.org/abs/2312.00784
Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations, prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this, real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge, we introduce MorpheuS, a framework for dynamic 360° surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance, in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360° surface reconstruction of a deformable object from a monocular RGB-D video.
神经渲染已经在动态场景重构方面取得了显著的成功。得益于神经表示的表达能力,先前的作品可以准确捕捉运动并实现目标对象的低成本高保真度重构。然而,现实世界的视频场景通常包含 large unobserved 区域,神经表示在实现现实完成方面遇到困难。为了解决这个问题,我们引入了 MorpheuS,一个从随意捕捉的 RGB-D 视频中的动态 360° 表面重构的框架。我们的方法将目标场景建模为一个规范场,其中包含其几何和外观,以及一个从当前帧到规范空间的扭曲场。我们利用基于视点的扩散先验,并将其知识用于实现对未观察到的区域的现实完成。在各种真实世界和合成数据集上的实验结果表明,我们的方法可以从单目 RGB-D 视频对可变形对象实现高保真度的 360° 表面重构。
https://arxiv.org/abs/2312.00778
Text-driven video generation witnesses rapid progress. However, merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents, especially for customized content creation. In this paper, we study the task of video generation with image prompts, which provide more accurate and direct content control beyond the text prompts. Specifically, we propose a feed-forward framework VideoBooth, with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts, while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level, multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames, which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably, VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with feed-forward pass.
基于文本的視頻生成见证了快速進步。然而,仅使用文本提示并不能准确地描述用户意图,特别是对于定制内容创作。在本文中,我們研究了使用图像提示的視頻生成任務,這在文本提示之外提供了更準確和直接的內容控制。具體來說,我們提出了VideoBooth框架,其有两个专门的设计:1)我們提出將圖像提示以粗略到精確的方式嵌入。粗略的圖像標識符從圖像編碼器提供了對圖像提示的高级編碼,而建議的注意力注入模塊提供的圖像提示的多尺度詳細編碼。這兩種互补的標識符可以忠誠地捕捉所需的 appearance。2)在精細級別的注意力注入模塊中,多尺度圖像提示被作為額外的鍵值輸入到不同的跨場關注層中。這多余的空間信息對第一幀的細節進行精確的優化,然後传播到其余帧,保持時間一致性。大量的實驗證明,VideoBooth在生成指定圖像提示的定制高質量的視頻方面實現了最先进的性能。值得注意的是,VideoBooth是一個通用的框架,一個單一模型可以運行在進階傳遞中處理廣泛的圖像提示。
https://arxiv.org/abs/2312.00777
We pursue the goal of developing robots that can interact zero-shot with generic unseen objects via a diverse repertoire of manipulation skills and show how passive human videos can serve as a rich source of data for learning such generalist robots. Unlike typical robot learning approaches which directly learn how a robot should act from interaction data, we adopt a factorized approach that can leverage large-scale human videos to learn how a human would accomplish a desired task (a human plan), followed by translating this plan to the robots embodiment. Specifically, we learn a human plan predictor that, given a current image of a scene and a goal image, predicts the future hand and object configurations. We combine this with a translation module that learns a plan-conditioned robot manipulation policy, and allows following humans plans for generic manipulation tasks in a zero-shot manner with no deployment-time training. Importantly, while the plan predictor can leverage large-scale human videos for learning, the translation module only requires a small amount of in-domain data, and can generalize to tasks not seen during training. We show that our learned system can perform over 16 manipulation skills that generalize to 40 objects, encompassing 100 real-world tasks for table-top manipulation and diverse in-the-wild manipulation. this https URL
我们追求开发能够通过多样化的操作技能与通用未见对象进行零距离交互的机器人,并展示如何静止的人类视频可以作为学习这些通用机器人的丰富数据来源。与典型的机器人学习方法不同,我们采用分解方法,可以利用大规模的人类视频来学习人类如何完成所需任务(人类计划),然后将此计划翻译到机器人的身上。 具体来说,我们学习了一个人类计划预测器,它基于当前场景图像和目标图像预测未来的手和物体配置。我们将此与一个翻译模块相结合,该模块学习了一个基于计划的机器人操作策略,允许在零距离的情况下跟随人类计划进行通用操作任务,无需部署时间训练。 重要的是,虽然计划预测器可以利用大规模的人类视频进行学习,但翻译模块只需要少量的领域数据,并且可以应用于训练过程中没有见过的任务。我们证明了我们所学的系统可以实现超过16种操作技能,涵盖40个物体,涉及100个真实世界任务,包括桌面操作和野外操作的多样性。
https://arxiv.org/abs/2312.00775
Conversational agents leveraging AI, particularly deep learning, are emerging in both academic research and real-world applications. However, these applications still face challenges, including disrespecting knowledge and facts, not personalizing to user preferences, and enormous demand for computational resources during training and inference. Recent research efforts have been focused on addressing these challenges from various aspects, including supplementing various types of auxiliary information to the conversational agents. However, existing methods are still not able to effectively and efficiently exploit relevant information from these auxiliary supplements to further unleash the power of the conversational agents and the language models they use. In this paper, we present a novel method, PK-NCLI, that is able to accurately and efficiently identify relevant auxiliary information to improve the quality of conversational responses by learning the relevance among persona, chat history, and knowledge background through low-level normalized contextual latent interaction. Our experimental results indicate that PK-NCLI outperforms the state-of-the-art method, PK-FoCus, by 47.80%/30.61%/24.14% in terms of perplexity, knowledge grounding, and training efficiency, respectively, and maintained the same level of persona grounding performance. We also provide a detailed analysis of how different factors, including language model choices and trade-offs on training weights, would affect the performance of PK-NCLI.
对话机器人利用人工智能,特别是深度学习,在学术研究和实际应用中逐渐涌现出来。然而,这些应用仍然面临着一些挑战,包括不尊重知识,不个性化以用户喜好,以及在训练和推理过程中需要巨大的计算资源等问题。最近的研究努力主要集中在从不同方面解决这些挑战,包括向对话机器人补充各种类型的辅助信息。然而,现有的方法仍然无法有效地和高效地利用这些辅助信息进一步释放对话机器人和它们使用的语言模型的力量。在本文中,我们提出了一个新颖的方法,PK-NCLI,能够准确和高效地识别与改善对话回复的质量,通过通过低层次的归一化上下文交互来学习个性、聊天历史和知识背景之间的相关性。我们的实验结果表明,PK-NCLI在每条信息上的准确性和效率都超过了最先进的PK-FoCus方法,分别比其高47.80%/30.61%/24.14%。在个性定置方面,保持了与最先进方法的相同水平。我们还对PK-NCLI如何受到不同因素的影响进行了详细分析,包括语言模型选择和训练权重的平衡。
https://arxiv.org/abs/2312.00774
The multitude of makeup products available can make it challenging to find the ideal match for desired attributes. An intelligent approach for product discovery is required to enhance the makeup shopping experience to make it more convenient and satisfying. However, enabling accurate and efficient product discovery requires extracting detailed attributes like color and finish type. Our work introduces an automated pipeline that utilizes multiple customized machine learning models to extract essential material attributes from makeup product images. Our pipeline is versatile and capable of handling various makeup products. To showcase the efficacy of our pipeline, we conduct extensive experiments on eyeshadow products (both single and multi-shade ones), a challenging makeup product known for its diverse range of shapes, colors, and finish types. Furthermore, we demonstrate the applicability of our approach by successfully extending it to other makeup categories like lipstick and foundation, showcasing its adaptability and effectiveness across different beauty products. Additionally, we conduct ablation experiments to demonstrate the superiority of our machine learning pipeline over human labeling methods in terms of reliability. Our proposed method showcases its effectiveness in cross-category product discovery, specifically in recommending makeup products that perfectly match a specified outfit. Lastly, we also demonstrate the application of these material attributes in enabling virtual-try-on experiences which makes makeup shopping experience significantly more engaging.
化妆品品种繁多,要找到理想的产品匹配需求特点并不容易。为了提高化妆品选购体验,需要采用智能的方法进行产品发现。然而,要提取详细的属性,如颜色和妆容类型,就需要提取这些关键属性。我们的工作引入了一个自动化的管道,利用多个自定义的机器学习模型从化妆品图像中提取关键材料属性。我们的管道具有多样化和处理各种化妆品的能力。为了展示我们管道的效果,我们在眼影产品(无论是单色还是多色)上进行了广泛的实验,这款化妆品以其多样化的形状、颜色和妆容类型而闻名。此外,我们还展示了将我们的方法应用于其他化妆品类别(如口红和粉底)的成功,展示了其在不同美容产品上的适应性和效果。此外,我们进行了消融实验,以证明机器学习管道在可靠性方面优于人类标注方法。我们的方法在跨类别产品发现方面的有效性尤其表现在推荐与指定款式完全匹配的化妆品产品上。最后,我们还展示了这些材料属性在使虚拟试妆体验更具趣方面中的应用,使得化妆品购物体验更具吸引力。
https://arxiv.org/abs/2312.00766
Large language model (LLM) powered chatbots are primarily text-based today, and impose a large interactional cognitive load, especially for exploratory or sensemaking tasks such as planning a trip or learning about a new city. Because the interaction is textual, users have little scaffolding in the way of structure, informational "scent", or ability to specify high-level preferences or goals. We introduce ExploreLLM that allows users to structure thoughts, help explore different options, navigate through the choices and recommendations, and to more easily steer models to generate more personalized responses. We conduct a user study and show that users find it helpful to use ExploreLLM for exploratory or planning tasks, because it provides a useful schema-like structure to the task, and guides users in planning. The study also suggests that users can more easily personalize responses with high-level preferences with ExploreLLM. Together, ExploreLLM points to a future where users interact with LLMs beyond the form of chatbots, and instead designed to support complex user tasks with a tighter integration between natural language and graphical user interfaces.
大型语言模型(LLM)驱动的聊天机器人目前主要基于文本,对交互式认知负载较大,尤其是在探索性或推理任务(如规划旅行或了解新书)中。因为交互是文本性的,所以用户在结构、信息“气味”或指定高级偏好或目标方面几乎没有任何支持。我们引入了 ExploreLLM,允许用户在思绪中组织,帮助用户探索不同的选择,导航于选择和推荐,并更轻松地将模型引导以生成更个性化的回应。我们进行了一项用户研究,结果表明,用户发现使用 ExploreLLM进行探索性或规划任务是有帮助的,因为它为任务提供了有用的结构化模式,并引导用户进行规划。该研究还表明,用户可以使用 ExploreLLM更轻松地使用高级偏好个性化回应。ExploreLLM指向一个未来,用户与LLM的交互不再局限于聊天机器人的形式,而是被设计用于支持更复杂用户任务,并通过自然语言和图形用户界面的更紧密整合来支持用户。
https://arxiv.org/abs/2312.00763
Machine unlearning has emerged as a prominent and challenging area of interest, driven in large part by the rising regulatory demands for industries to delete user data upon request and the heightened awareness of privacy. Existing approaches either retrain models from scratch or use several finetuning steps for every deletion request, often constrained by computational resource limitations and restricted access to the original training data. In this work, we introduce a novel class unlearning algorithm designed to strategically eliminate an entire class or a group of classes from the learned model. To that end, our algorithm first estimates the Retain Space and the Forget Space, representing the feature or activation spaces for samples from classes to be retained and unlearned, respectively. To obtain these spaces, we propose a novel singular value decomposition-based technique that requires layer wise collection of network activations from a few forward passes through the network. We then compute the shared information between these spaces and remove it from the forget space to isolate class-discriminatory feature space for unlearning. Finally, we project the model weights in the orthogonal direction of the class-discriminatory space to obtain the unlearned model. We demonstrate our algorithm's efficacy on ImageNet using a Vision Transformer with only $\sim$1.5% drop in retain accuracy compared to the original model while maintaining under 1% accuracy on the unlearned class samples. Further, our algorithm consistently performs well when subject to Membership Inference Attacks showing 7.8% improvement on average across a variety of image classification datasets and network architectures, as compared to other baselines while being $\sim$6x more computationally efficient.
机器学习消退已成为一个突出且具有挑战性的兴趣领域,很大程度上是由对行业在收到请求时删除用户数据的需求增加和隐私意识增强而催生的。现有的方法不是从头重新训练模型,就是对每个删除请求使用多个微调步骤,往往受到计算资源限制和原始训练数据访问的限制。在这项工作中,我们引入了一种名为策略消除类消退的新消退算法,旨在有意识地从已学到的模型中消除整个类或一个类。为此,我们的算法首先估计保留空间和忘记空间,分别表示保留和消退的样本特征或激活空间。为了获得这些空间,我们提出了一种基于层归一化的新的单例值分解技术,该技术要求从网络的前几层中收集网络活动。然后计算这些空间之间的共享信息并将其从忘记空间中移除,以隔离消退类别的特征空间。最后,我们将模型权重在类归一化空间的垂直方向上投影以获得消退模型。我们在ImageNet上使用仅比原始模型保留准确性约下降1.5%的Vision Transformer来验证我们算法的有效性,同时将消退类样本的准确性保持在不到1%的水平。此外,我们的算法在受到成员推断攻击时表现出色,平均将在各种图像分类数据集和网络架构上的准确性提高7.8%,相对于其他基线,同时实现约6倍于其他算法的计算效率。
https://arxiv.org/abs/2312.00761
Foundation models, now powering most of the exciting applications in deep learning, are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures such as linear attention, gated convolution and recurrent models, and structured state space models (SSMs) have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities, allowing the model to selectively propagate or forget information along the sequence length dimension depending on the current token. Second, even though this change prevents the use of efficient convolutions, we design a hardware-aware parallel algorithm in recurrent mode. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5$\times$ higher throughput than Transformers) and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modeling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation.
基础模型,现在大部分深度学习应用都依赖于此,几乎普遍基于Transformer架构及其核心注意力模块。为了解决Transformer在长序列上的计算效率问题,还开发了许多诸如线性注意力、门控卷积和循环模型(SSM)的子quadratic-time架构。然而,它们在重要模式(如语言)上的表现并没有达到与注意力相同的水平。我们发现,这类模型的关键不足在于它们无法进行内容基于的推理,并进行了几项改进。首先,让SSM参数成为输入函数 simple让SSM参数成为输入函数 simpleaddresses their weakness with discrete modalities,让模型根据当前词的可选性地在序列长度维度上选择传播或忘记信息。其次,尽管此变化阻止了使用高效的卷积操作,但我们设计了一个硬件意识到的并行算法。我们将这些选择性的SSM集成到一个简单的端到端神经网络架构中,该架构没有注意力或甚至MLP块(Mamba)。Mamba具有快速的推理(比Transformers快5倍)和线性扩展在序列长度上,而且在实际数据上的性能甚至比百万长度的序列还要好。作为通用的序列模型骨架,Mamba在几个模式(如语言、音频和基因组)上实现了最先进的性能。在语言建模方面,我们的Mamba-3B模型在同等大小和两倍大小的Transformer模型上均优于它们,不仅在预训练阶段,而且在下游评估中也是如此。
https://arxiv.org/abs/2312.00752
Transformers have achieved remarkable success in a wide range of natural language processing and computer vision applications. However, the representation capacity of a deep transformer model is degraded due to the over-smoothing issue in which the token representations become identical when the model's depth grows. In this work, we show that self-attention layers in transformers minimize a functional which promotes smoothness, thereby causing token uniformity. We then propose a novel regularizer that penalizes the norm of the difference between the smooth output tokens from self-attention and the input tokens to preserve the fidelity of the tokens. Minimizing the resulting regularized energy functional, we derive the Neural Transformer with a Regularized Nonlocal Functional (NeuTRENO), a novel class of transformer models that can mitigate the over-smoothing issue. We empirically demonstrate the advantages of NeuTRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.
Transformer模型在自然语言处理和计算机视觉应用领域取得了显著的成功。然而,由于在模型深度增加时出现的过度平滑问题,深度Transformer模型的表示能力会降低。在本文中,我们证明了Transformer中的自注意力层通过一个促进平滑的函数最小化了一个函数,从而导致标记的均匀性。然后,我们提出了一个新颖的 regularizer,它惩罚自注意力和输入标记之间平滑输出标记的范数,以保留标记的准确性。通过最小化由此产生的 regularized energy functional,我们推导出了一种名为 NeurTransformer with Regularized Nonlocal Functional (NeuTRENO) 的神经Transformer模型,这是一种新的Transformer模型,可以减轻过度平滑问题。我们通过实验实证证明NeuTRENO相对于基线Transformer和最先进的方法在各种实际任务(包括对象分类、图像分割和自然语言建模)中减少标记表示过度平滑的优势。
https://arxiv.org/abs/2312.00751
In this study, we explore the application of Large Language Models (LLMs) in "Jubensha" (Chinese murder mystery role-playing games), a novel area in AI-driven gaming. We introduce the first Chinese dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in the game, enhancing the dynamics of Jubensha gameplay. To evaluate these AI agents, we developed specialized methods targeting their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in in-context learning to improve the agents' performance in critical aspects like information gathering, murderer detection, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a fresh perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents to researchers in the field.
在这项研究中,我们探讨了在"Jubensha"(中文谋杀 mystery 角色扮演游戏)中应用大规模语言模型(LLMs)的情况,这是一个 AI 驱动游戏的新兴领域。我们介绍了专门针对 Jubensha 的第一个中文数据集,包括角色脚本和游戏规则,以促进 AI 代理在复杂的故事环境中开发。我们的工作还提出了使用 LLMs 的独特的多代理互动框架,使 AI 代理能够自主参与游戏,增强了 Jubensha 游戏玩法 dynamics。为了评估这些 AI 代理,我们开发了针对其对案情信息和推理能力的主观评估方法。此外,我们还采用了最先进的上下文学习技术,以提高代理在关键方面的性能,如信息收集、凶手检测和逻辑推理。实验结果证实了我们提出方法的有效性。本研究旨在为理解 LLM 能力提供一种新颖的视角,并为该领域的研究人员建立一个新的基准。
https://arxiv.org/abs/2312.00746
Meta-learning is a powerful approach that exploits historical data to quickly solve new tasks from the same distribution. In the low-data regime, methods based on the closed-form posterior of Gaussian processes (GP) together with Bayesian optimization have achieved high performance. However, these methods are either computationally expensive or introduce assumptions that hinder a principled propagation of uncertainty between task models. This may disrupt the balance between exploration and exploitation during optimization. In this paper, we develop ScaML-GP, a modular GP model for meta-learning that is scalable in the number of tasks. Our core contribution is a carefully designed multi-task kernel that enables hierarchical training and task scalability. Conditioning ScaML-GP on the meta-data exposes its modular nature yielding a test-task prior that combines the posteriors of meta-task GPs. In synthetic and real-world meta-learning experiments, we demonstrate that ScaML-GP can learn efficiently both with few and many meta-tasks.
元学习是一种利用历史数据快速解决相同分布中新任务的力量强大的方法。在低数据情况下,基于高斯过程(GP)的闭式后验方法与贝叶斯优化方法已经取得了很好的性能。然而,这些方法要么计算代价高昂,要么引入假设,阻碍了任务模型的不确定性传递。这可能会在优化过程中破坏探索与利用之间的平衡。在本文中,我们开发了 ScaML-GP,一个可扩展的GP模型,用于元学习。我们核心的贡献是一个精心设计的多任务核,使得分层训练和任务可扩展。通过在元数据上条件ScaML-GP,揭示了其模块化本质,产生了一个测试任务 prior,结合了元任务 GPs 的后验。在合成和现实世界的元学习实验中,我们证明了 ScaML-GP可以在少和多元任务上高效学习。
https://arxiv.org/abs/2312.00742
Existing score distillation methods are sensitive to classifier-free guidance (CFG) scale: manifested as over-smoothness or instability at small CFG scales, while over-saturation at large ones. To explain and analyze these issues, we revisit the derivation of Score Distillation Sampling (SDS) and decipher existing score distillation with the Wasserstein Generative Adversarial Network (WGAN) paradigm. With the WGAN paradigm, we find that existing score distillation either employs a fixed sub-optimal discriminator or conducts incomplete discriminator optimization, resulting in the scale-sensitive issue. We propose the Adversarial Score Distillation (ASD), which maintains an optimizable discriminator and updates it using the complete optimization objective. Experiments show that the proposed ASD performs favorably in 2D distillation and text-to-3D tasks against existing methods. Furthermore, to explore the generalization ability of our WGAN paradigm, we extend ASD to the image editing task, which achieves competitive results. The project page and code are at this https URL.
现有的分数蒸馏方法对分类器无关指导(CFG)规模敏感:在小型CFG规模上表现过度平滑或不稳定,而在大尺度上过度饱和。为了解释和分析这些问题,我们重新回顾了分数蒸馏采样的(SDS)导出,并使用Wasserstein生成对抗网络(WGAN)范式阐述和分析现有分数蒸馏。在WGAN范式下,我们发现现有分数蒸馏要么采用固定的小最优分类器,要么进行完整的分类器优化,导致尺度敏感问题。我们提出了Adversarial Score Distillation(ASD),它保持了一个可优化的分类器,并使用完整的优化目标来更新它。实验证明,与现有方法相比,ASD在2D分数蒸馏和文本到3D任务上表现出色。此外,为了探索我们WGAN范式的泛化能力,我们将ASD扩展到图像编辑任务上,实现了竞争力的结果。项目页面和代码位于此链接处。
https://arxiv.org/abs/2312.00739
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
尽管大型语言模型(LLMs)在各种任务中取得了显著的成就,但仍然存在一种倾向,即高资源语言,如英语,常常以低资源语言和地区语言为代价。为了解决这一失衡,我们引入了SeaLLMs,一系列专注于东南亚(SEA)语言的创新语言模型。SeaLLMs基于Llama-2模型,并通过持续的预训练,使用扩展词汇、专业指导和同步调整来更好地捕捉地区语言的复杂性。这使得它们能够尊重和反映当地文化规范、习俗、文体偏好和法律法规。我们对SeaLLM-13b进行全面评估的结果表明,SeaLLM-13b模型在广泛的言语任务和辅助式跟随能力上表现优于与类似开源模型相当的作品。此外,它们在非拉丁语系语言,如泰语、高棉语、老挝语和缅甸语,方面优于ChatGPT-3.5,尽管如此,它们仍然非常轻便且成本效益高。
https://arxiv.org/abs/2312.00738
The recent Gaussian Splatting achieves high-quality and real-time novel-view synthesis of the 3D scenes. However, it is solely concentrated on the appearance and geometry modeling, while lacking in fine-grained object-level scene understanding. To address this issue, we propose Gaussian Grouping, which extends Gaussian Splatting to jointly reconstruct and segment anything in open-world 3D scenes. We augment each Gaussian with a compact Identity Encoding, allowing the Gaussians to be grouped according to their object instance or stuff membership in the 3D scene. Instead of resorting to expensive 3D labels, we supervise the Identity Encodings during the differentiable rendering by leveraging the 2D mask predictions by SAM, along with introduced 3D spatial consistency regularization. Comparing to the implicit NeRF representation, we show that the discrete and grouped 3D Gaussians can reconstruct, segment and edit anything in 3D with high visual quality, fine granularity and efficiency. Based on Gaussian Grouping, we further propose a local Gaussian Editing scheme, which shows efficacy in versatile scene editing applications, including 3D object removal, inpainting, colorization and scene recomposition. Our code and models will be at this https URL.
最近,Gaussian Splatting 实现了高质量和实时三维场景的新视图合成。然而,它仅关注外观和几何建模,而缺乏对细粒度物体级场景理解的优化。为解决这个问题,我们提出了 Gaussian Grouping,它将 Gaussian Splatting 扩展到共同重构和分割开放世界三维场景中的任何东西。我们对每个高斯进行紧凑的身份编码,使得高斯可以根据其在 3D 场景中的物体实例或物品成员进行分组。通过利用 SAM 的 2D 掩码预测和引入的 3D 空间一致性正则化,我们监督身份编码在不同的渲染中进行。与隐式 NeRF 表示相比,我们证明了离散和分组的高斯可以在具有高视觉质量、细粒度和效率的情况下重构、分割和编辑任何东西。基于 Gaussian Grouping,我们进一步提出了一个局部 Gaussian 编辑方案,表现出在各种场景编辑应用中的有效性,包括 3D 物体移除、修复、上色和重新构建场景。我们的代码和模型将位于此链接处:https:// URL。
https://arxiv.org/abs/2312.00732
This paper delves into the problem of safe reinforcement learning (RL) in a partially observable environment with the aim of achieving safe-reachability objectives. In traditional partially observable Markov decision processes (POMDP), ensuring safety typically involves estimating the belief in latent states. However, accurately estimating an optimal Bayesian filter in POMDP to infer latent states from observations in a continuous state space poses a significant challenge, largely due to the intractable likelihood. To tackle this issue, we propose a stochastic model-based approach that guarantees RL safety almost surely in the face of unknown system dynamics and partial observation environments. We leveraged the Predictive State Representation (PSR) and Reproducing Kernel Hilbert Space (RKHS) to represent future multi-step observations analytically, and the results in this context are provable. Furthermore, we derived essential operators from the kernel Bayes' rule, enabling the recursive estimation of future observations using various operators. Under the assumption of \textit{undercompleness}, a polynomial sample complexity is established for the RL algorithm for the infinite size of observation and action spaces, ensuring an $\epsilon-$suboptimal safe policy guarantee.
本文深入探讨了在部分观察环境中进行安全强化学习(RL)的问题,旨在实现安全到达目标。在传统的部分观察马尔可夫决策过程(POMDP)中,确保安全通常涉及估计隐含状态的概率。然而,在POMDP中准确估计最优贝叶斯滤波器以推断连续状态空间中的观察值具有重大挑战,这主要是因为难以计算隐含状态的概率分布。为解决这个问题,我们提出了一个基于随机模型的方法,该方法在未知系统动态和部分观察环境中保证RL安全性几乎 surely。我们利用预测状态表示(PSR)和还原哈希空间(RKHS)来表示未来多步观察的解析形式,且此方面的结果具有可证明性。此外,我们还从 kernel Bayes' rule 中导出关键运算,使得使用各种运算对未来的观察进行递归估计。在假设不满足的情况下,对于观察无限大、动作空间无限大的情况,RL算法的样本复杂度为多项式级数,从而确保了$\epsilon$-最优的安全策略保证。
https://arxiv.org/abs/2312.00727
High-throughput drug screening -- using cell imaging or gene expression measurements as readouts of drug effect -- is a critical tool in biotechnology to assess and understand the relationship between the chemical structure and biological activity of a drug. Since large-scale screens have to be divided into multiple experiments, a key difficulty is dealing with batch effects, which can introduce systematic errors and non-biological associations in the data. We propose InfoCORE, an Information maximization approach for COnfounder REmoval, to effectively deal with batch effects and obtain refined molecular representations. InfoCORE establishes a variational lower bound on the conditional mutual information of the latent representations given a batch identifier. It adaptively reweighs samples to equalize their implied batch distribution. Extensive experiments on drug screening data reveal InfoCORE's superior performance in a multitude of tasks including molecular property prediction and molecule-phenotype retrieval. Additionally, we show results for how InfoCORE offers a versatile framework and resolves general distribution shifts and issues of data fairness by minimizing correlation with spurious features or removing sensitive attributes. The code is available at this https URL.
高通量药物筛选---利用细胞成像或基因表达测量作为药物效果的读数---是生物技术中评估和理解药物化学结构和生物活性的关键工具。由于大规模筛选必须分为多个实验,处理批次效应的一个关键困难是对待测数据中的系统误差和非生物关联。我们提出InfoCORE,一种信息最大化方法,用于处理批次效应,获得精确的分子表示。InfoCORE在给定批次编号的潜在表示的条件下建立了条件 mutual information 的下界。它根据样本调整样本权重以平衡隐含的批次分布。对药物筛选数据的大规模实验揭示了InfoCORE在分子性质预测和分子-表型检索等方面的卓越性能。此外,我们还展示了InfoCORE如何提供一种多样化的框架,通过最小化与伪特征的相关性或删除敏感属性来解决数据公平性问题。代码可在此处下载:https://url.cn/
https://arxiv.org/abs/2312.00718
Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications, offering a unified space for sensor data fusion and supporting various downstream tasks. However, conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this, we propose PointBeV, a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage, enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training, enabling focused computation on regions of interest. At inference time, it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle, pedestrian, and lane segmentation, showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We will release our code along with two new efficient modules used in the architecture: Sparse Feature Pulling, designed for the effective extraction of features from images to BeV, and Submanifold Attention, which enables efficient temporal modeling. Our code is available at this https URL.
bird's-eye view (BeV) 表示已经成为了驾驶应用程序的事实共享空间,提供了一个统一的传感器数据融合空间,支持各种下游任务。然而,传统的模型使用固定的分辨率范围和网格,由于所有细胞资源均匀分配,导致计算效率低下。为了解决这个问题,我们提出了 PointBeV,一种新颖的稀疏 BeV 分割模型, operating on sparse BeV 细胞而不是密集网格。这种方法可以精确控制内存使用,使得可以使用长时上下文,并适应具有内存限制的平台。PointBeV 使用一种高效的两步策略进行训练,使得在感兴趣的区域进行集中计算。在推理时,它可以与各种内存/性能权衡灵活配合,并能够适应新的具体应用场景。PointBeV 在 nuScenes 数据集上取得了最先进的分数,在车辆、行人、车道分割中展示了在静态和时间设置下的卓越性能,尽管它仅使用稀疏信号进行训练。我们将发布我们的代码以及用于架构的两个新的高效模块:稀疏特征提取,用于从图像中有效提取特征到 BeV;子流注意,它允许有效的时序建模。我们的代码可在此处访问:https://www.aclweb.org/anthology/N22-11966
https://arxiv.org/abs/2312.00703