Contemporary 3D research, particularly in reconstruction and generation, heavily relies on 2D images for inputs or supervision. However, current designs for these 2D-3D mapping are memory-intensive, posing a significant bottleneck for existing methods and hindering new applications. In response, we propose a pair of highly scalable components for 3D neural fields: Lightplane Render and Splatter, which significantly reduce memory usage in 2D-3D mapping. These innovations enable the processing of vastly more and higher resolution images with small memory and computational costs. We demonstrate their utility in various applications, from benefiting single-scene optimization with image-level losses to realizing a versatile pipeline for dramatically scaling 3D reconstruction and generation. Code: \url{this https URL}.
当代 3D 研究,尤其是在建模和生成方面,严重依赖 2D 图像作为输入或指导。然而,当前为这些 2D-3D 映射设计的现有方法是内存密集型,这为现有方法和阻碍新的应用造成了显著的瓶颈。为了应对这一问题,我们提出了名为 Lightplane Render 和 Splatter 的两个高度可扩展的 3D 神经网络组件,它们大幅减少了 2D-3D 映射中的内存使用。这些创新使得用较小的内存和计算成本处理大量更高质量的图像成为可能。我们在各种应用中证明了它们的价值,从通过图像级别损失实现单场景优化,到实现用于大幅度扩展 3D 建模和生成的大幅度流程。代码:\url{这个链接}。
https://arxiv.org/abs/2404.19760
This work introduces MotionLCM, extending controllable motion generation to a real-time level. Existing methods for spatial control in text-conditioned motion generation suffer from significant runtime inefficiency. To address this issue, we first propose the motion latent consistency model (MotionLCM) for motion generation, building upon the latent diffusion model (MLD). By employing one-step (or few-step) inference, we further improve the runtime efficiency of the motion latent diffusion model for motion generation. To ensure effective controllability, we incorporate a motion ControlNet within the latent space of MotionLCM and enable explicit control signals (e.g., pelvis trajectory) in the vanilla motion space to control the generation process directly, similar to controlling other latent-free diffusion models for motion generation. By employing these techniques, our approach can generate human motions with text and control signals in real-time. Experimental results demonstrate the remarkable generation and controlling capabilities of MotionLCM while maintaining real-time runtime efficiency.
本文介绍了MotionLCM,将可控制运动生成扩展到实时级别。现有的在文本条件下运动生成方法在运行效率上存在显著的运行时开销。为解决这一问题,我们首先提出了运动隐式一致性模型(MotionLCM),基于隐扩散模型(MLD)。通过采用一步(或几步)推理,我们进一步提高了运动隐式扩散模型用于运动生成的运行效率。为了确保有效的可控制性,我们将运动控制网络集成到MotionLCM的隐空间中,并在普通运动空间中实现显式控制信号(例如骨盆轨迹),以直接控制生成过程,类似于控制其他无隐式扩散的运动生成模型。通过采用这些技术,我们的方法可以在实时文本和控制信号下生成人类运动。实验结果证明了MotionLCM在生成和控制方面的非凡能力,同时保持实时运行效率。
https://arxiv.org/abs/2404.19759
3D scene generation has quickly become a challenging new research direction, fueled by consistent improvements of 2D generative diffusion models. Most prior work in this area generates scenes by iteratively stitching newly generated frames with existing geometry. These works often depend on pre-trained monocular depth estimators to lift the generated images into 3D, fusing them with the existing scene representation. These approaches are then often evaluated via a text metric, measuring the similarity between the generated images and a given text prompt. In this work, we make two fundamental contributions to the field of 3D scene generation. First, we note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene. We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene. Second, we introduce a new benchmarking scheme for scene generation methods that is based on ground truth geometry, and thus measures the quality of the structure of the scene.
3D场景生成已经成为一个快速发展的研究方向,得到了2D生成扩散模型的持续改进。在这个领域中,之前的工作通常通过迭代地将新产生的帧与现有的几何知识拼接在一起来生成场景。这些工作通常依赖于预训练的单目深度估计器将生成的图像提升到3D,将它们与现有的场景表示融合。然后,这些方法通常通过文本度量来评估,测量生成图像与给定文本提示之间的相似性。在这项工作中,我们为3D场景生成领域做出了两个根本性的贡献。首先,我们指出,使用单目深度估计模型将图像提升到3D是不最优的,因为它忽略了现有场景的几何结构。因此,我们引入了一种新的深度完成模型,通过教师蒸馏和自训练学习3D融合过程,从而改善了场景的几何连贯性。其次,我们引入了一种基于真实几何的新场景生成基准方案,因此它衡量了场景结构的质量。
https://arxiv.org/abs/2404.19758
Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs.
受到Kolmogorov-Arnold表示定理的启发,我们提出了Kolmogorov-Arnold网络(KANs)作为多层感知器(MLPs)的有前景的替代方案。虽然MLPs在节点("神经元")上具有固定的激活函数,但KANs在边缘("权重")上具有可学习激活函数。KANs完全没有线性权重——每个权重参数都用一个一维函数参数化,该函数称为平滑函数。我们证明了这一看似简单的变化使KANs在准确性和可解释性方面优于MLPs。在准确性方面,尽管KANs在数据拟合和PDE求解中实现的功能比MLPs小得多,但它们可以达到与MLPs相当或更好的准确性。从理论到经验,KAN具有比MLPs更快的神经网络缩放定律。对于可解释性,KAN可以直观地绘制,并容易地与人类用户互动。通过数学和物理学中的两个示例,KAN被证明是帮助科学家(重新)发现数学和物理定律的有价值的合作者。总之,KAN是MLPs的有前景的替代方案,为依赖MLPs的现代深度学习模型提供了进一步改进的机会。
https://arxiv.org/abs/2404.19756
Vision-language datasets are vital for both text-to-image (T2I) and image-to-text (I2T) research. However, current datasets lack descriptions with fine-grained detail that would allow for richer associations to be learned by models. To fill the gap, we introduce Descriptions of Connected and Contrasting Images (DOCCI), a dataset with long, human-annotated English descriptions for 15k images that were taken, curated and donated by a single researcher intent on capturing key challenges such as spatial relations, counting, text rendering, world knowledge, and more. We instruct human annotators to create comprehensive descriptions for each image; these average 136 words in length and are crafted to clearly distinguish each image from those that are related or similar. Each description is highly compositional and typically encompasses multiple challenges. Through both quantitative and qualitative analyses, we demonstrate that DOCCI serves as an effective training resource for image-to-text generation -- a PaLI 5B model finetuned on DOCCI shows equal or superior results compared to highly-performant larger models like LLaVA-1.5 7B and InstructBLIP 7B. Furthermore, we show that DOCCI is a useful testbed for text-to-image generation, highlighting the limitations of current text-to-image models in capturing long descriptions and fine details.
视觉语言数据对于文本到图像(T2I)和图像到文本(I2T)研究来说至关重要。然而,现有的数据集缺乏描述性,这些描述可以让模型更丰富地学习关联。为了填补这个空白,我们引入了连接和对比图像的描述(DOCCI)数据集,这是一个由单个研究人员收集、策划和捐赠的15000张图片的数据集,旨在捕捉一些关键挑战,如空间关系、计数、文本渲染、世界知识等。我们指示人类标注者为每张图片创建全面的描述;这些描述通常长度为136个词,并刻意区分彼此的关系或相似性。每个描述高度可组合,通常涵盖多个挑战。通过数量和定性分析,我们证明DOCCI是一个有效的图像到文本生成训练资源——在DOCCI上训练的PaLI 5B模型与高度表现的大模型(如LaVA-1.5 7B和InstructBLIP 7B)相比,表现出相同或更好的效果。此外,我们还展示了DOCCI对于文本到图像生成的测试平台的价值,突出了当前文本到图像模型的局限性,即捕捉不了长描述和细节。
https://arxiv.org/abs/2404.19753
Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
目前,为视觉内容设计的自动摘要方法面临着缺乏细节、内容偏差和差劲的指令等挑战。在这项工作中,我们提出了VisualFactChecker(VFC),一种灵活的训练免费管道,为2D图像和3D对象生成高保真度和详细摘要。VFC包括三个步骤:1)提议,其中图像到文本摘要模型提出多个初始摘要;2)验证,其中大型语言模型(LLM)利用诸如物体检测和VQA模型等工具对提议的摘要进行验证;3)摘要,其中LLM通过总结摘要建议和验证结果生成最终的摘要。在这一步骤,VFC可以根据复杂指令灵活生成各种风格的摘要。我们使用四个指标对全面摘要评估:1)CLIP-Score,衡量图像与文本相似度;2)CLIP-Image-Score,衡量原图像和由文本到图像模型生成的图像之间的图像图像相似度;3)在Amazon Mechanical Turk上的人类研究;4)GPT-4V进行微细化评估。评估结果显示,VFC在COCO数据集上的2D图像上的表现优于最先进的开源摘要方法,而在Objaverse数据集上的3D资产上的表现也优于最先进的开放式源代码方法。我们的研究证明了通过将开源模型集成到管道中,我们可以实现与 proprietary 模型如GPT-4V相当的摘要能力,尽管模型的规模是开源模型的10倍以上。
https://arxiv.org/abs/2404.19752
Every year, plant parasitic nematodes, one of the major groups of plant pathogens, cause a significant loss of crops worldwide. To mitigate crop yield losses caused by nematodes, an efficient nematode monitoring method is essential for plant and crop disease management. In other respects, efficient nematode detection contributes to medical research and drug discovery, as nematodes are model organisms. With the rapid development of computer technology, computer vision techniques provide a feasible solution for quantifying nematodes or nematode infections. In this paper, we survey and categorise the studies and available datasets on nematode detection through deep-learning models. To stimulate progress in related research, this survey presents the potential state-of-the-art object detection models, training techniques, optimisation techniques, and evaluation metrics for deep learning beginners. Moreover, seven state-of-the-art object detection models are validated on three public datasets and the AgriNema dataset for plant parasitic nematodes to construct a baseline for nematode detection.
每年,植物寄生性线虫在全球范围内都会导致严重的粮食产量损失。为了减轻线虫引起的粮食产量损失,有必要开发一种高效的线虫监测方法,以进行植物和植物病害管理。在其他方面,高效的线虫检测对于医学研究和药物发现都有助于,因为线虫是模型生物。随着计算机技术的快速发展,计算机视觉技术为量化线虫或线虫感染提供了可行的解决方案。在本文中,我们调查和分类了通过深度学习模型进行线虫检测的研究和可用数据。为了刺激相关研究的进展,本文调查了最先进的状态-of-the-art物体检测模型、训练技术、优化技术和评估指标,以及基于植物寄生性线虫的三个公开数据集和AgriNema数据集的验证。此外,我们还验证了七种最先进的物体检测模型在植物寄生性线虫上的应用,以建立线虫检测的基准。
https://arxiv.org/abs/2404.19748
Data protection and privacy is becoming increasingly crucial in the digital era. Numerous companies depend on third-party vendors and service providers to carry out critical functions within their operations, encompassing tasks such as data handling and storage. However, this reliance introduces potential vulnerabilities, as these vendors' security measures and practices may not always align with the standards expected by regulatory bodies. Businesses are required, often under the penalty of law, to ensure compliance with the evolving regulatory rules. Interpreting and implementing these regulations pose challenges due to their complexity. Regulatory documents are extensive, demanding significant effort for interpretation, while vendor-drafted privacy policies often lack the detail required for full legal compliance, leading to ambiguity. To ensure a concise interpretation of the regulatory requirements and compliance of organizational privacy policy with said regulations, we propose a Large Language Model (LLM) and Semantic Web based approach for privacy compliance. In this paper, we develop the novel Privacy Policy Compliance Verification Knowledge Graph, PrivComp-KG. It is designed to efficiently store and retrieve comprehensive information concerning privacy policies, regulatory frameworks, and domain-specific knowledge pertaining to the legal landscape of privacy. Using Retrieval Augmented Generation, we identify the relevant sections in a privacy policy with corresponding regulatory rules. This information about individual privacy policies is populated into the PrivComp-KG. Combining this with the domain context and rules, the PrivComp-KG can be queried to check for compliance with privacy policies by each vendor against relevant policy regulations. We demonstrate the relevance of the PrivComp-KG, by verifying compliance of privacy policy documents for various organizations.
数据保护和隐私在数字时代变得越来越重要。许多公司依赖第三方供应商和提供商执行运营中的关键功能,包括数据处理和存储。然而,这种依赖会引入潜在的安全漏洞,因为这些供应商的安全措施和做法可能并不总是符合监管机构的期望标准。根据法律要求,企业必须确保符合不断变化的监管规则。解释和实施这些法规由于其复杂性而面临挑战。监管文件涵盖广泛,需要大量精力进行解释,而供应商编写的隐私政策往往缺乏所需的详细信息,导致模糊不清。为了确保对监管要求和组织隐私政策的一致解释和合规,我们提出了一个基于大型语言模型(LLM)和语义网络的隐私合规方法。在本文中,我们开发了新颖的隐私政策合规验证知识图谱,PrivComp-KG。该知识图谱旨在有效地存储和检索关于隐私政策、监管框架和隐私法律领域的全面信息。通过检索增强生成,我们识别隐私政策中相应的法规条文。将个人隐私政策的信息充实到PrivComp-KG中。结合领域上下文和规则,PrivComp-KG可以用于检查每个供应商是否符合相关政策法规。我们通过验证各种组织的隐私政策文件对隐私政策的一致性进行验证,展示了PrivComp-KG的相關性。
https://arxiv.org/abs/2404.19744
Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.
大型语言模型(如GPT和Llama)通过下一个词预测损失进行训练。在本文中,我们建议将训练语言模型预测多个未来词的效果会更高。具体来说,在训练语料库的每个位置,我们要求模型使用n个独立输出头预测以下n个词,在共享模型干线上操作。将多词预测视为辅助训练任务,我们衡量了对于代码和自然语言模型的下游能力没有训练时间开销的改善。这种方法对于较大的模型大小越来越有用,并且在训练多个周期时仍然具有吸引力。在生成基准测试中(如编码),我们的模型在几个百分点的范围内显著优于强大的基线。我们的13B参数模型在HumanEval和MBPP上解决了12%的问题比相近的下一个词模型。在小型算法任务上的实验表明,多词预测对于发展归纳头和算法推理能力具有优势。作为额外的好处,使用4个词预测训练的模型在推理过程中速度快3倍,即使在大批量的情况下也是如此。
https://arxiv.org/abs/2404.19737
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy for Llama-2-70B-Chat from 55.6% to 81.6% on GSM8K (and 88.7% with majority voting out of 32 samples), from 12.5% to 20.8% on MATH, and from 77.8% to 86.7% on ARC-Challenge, which outperforms other Llama-2-based models not relying on additionally sourced datasets.
迭代偏好优化方法最近在通用指令调整任务中表现良好,但通常在推理任务上没有显著改进(Yuan等人,2024;Chen等人,2024)。在这项工作中,我们开发了一种迭代方法,通过优化赢得或输出的推理步骤,使得竞争生成的链式思考(CoT)候选者之间的偏好得到优化。我们使用了一种修改后的DPO损失函数(Rafailov等人,2023),并增加了一个负对数似然项,我们发现这一项至关重要。我们证明了这种方案在迭代迭代过程中推理能力有所提高。 虽然仅仅依赖训练集中的示例,但我们的方法使得Llama-2-70B-Chat的准确率从55.6%提高到了81.6%(在GSM8K上),从12.5%提高到了20.8%(在MATH上),从77.8%提高到了86.7%(在ARC-Challenge上),这超过了不依赖于额外数据集的 其他Llama-2基模型的表现。
https://arxiv.org/abs/2404.19733
External knowledge graphs (KGs) can be used to augment large language models (LLMs), while simultaneously providing an explainable knowledge base of facts that can be inspected by a human. This approach may be particularly valuable in domains where explainability is critical, like human trafficking data analysis. However, creating KGs can pose challenges. KGs parsed from documents may comprise explicit connections (those directly stated by a document) but miss implicit connections (those obvious to a human although not directly stated). To address these challenges, this preliminary research introduces the GAME-KG framework, standing for "Gaming for Augmenting Metadata and Enhancing Knowledge Graphs." GAME-KG is a federated approach to modifying explicit as well as implicit connections in KGs by using crowdsourced feedback collected through video games. GAME-KG is shown through two demonstrations: a Unity test scenario from Dark Shadows, a video game that collects feedback on KGs parsed from US Department of Justice (DOJ) Press Releases on human trafficking, and a following experiment where OpenAI's GPT-4 is prompted to answer questions based on a modified and unmodified KG. Initial results suggest that GAME-KG can be an effective framework for enhancing KGs, while simultaneously providing an explainable set of structured facts verified by humans.
外部知识图(KGs)可以用于增强大型语言模型(LLMs),同时为人类提供可检查的事实知识库。这种方法在需要解释性的领域(如打击人口贩卖数据分析)可能特别有价值。然而,创建KGs可能带来挑战。从文档中解析的KGs可能包括明确的连接(直接由文档中陈述的连接),但也会错过隐性的连接(对人类明显可见,尽管不是直接陈述的连接)。为解决这些挑战,这项初步研究引入了GAME-KG框架,它代表“通过游戏增强元数据和增强知识图”。GAME-KG是一种通过使用通过视频游戏收集的民办公众反馈修改KGs中明确和隐性连接的联邦方法。GAME-KG通过两个演示来说明:一个来自《黑暗影子》的Unity测试场景,该游戏收集了关于美国司法部(DOJ) press releases中的人口贩卖KGs的反馈,另一个是OpenAI的GPT-4被要求根据修改后的KG回答问题的实验。初步结果表明,GAME-KG可以成为增强KGs的有效框架,同时为人类提供经过验证的结构化事实知识库。
https://arxiv.org/abs/2404.19729
Federated learning (FL) enables collaborative model training while preserving data privacy, making it suitable for decentralized human-centered AI applications. However, a significant research gap remains in ensuring fairness in these systems. Current fairness strategies in FL require knowledge of bias-creating/sensitive attributes, clashing with FL's privacy principles. Moreover, in human-centered datasets, sensitive attributes may remain latent. To tackle these challenges, we present a novel bias mitigation approach inspired by "Fairness without Demographics" in machine learning. The presented approach achieves fairness without needing knowledge of sensitive attributes by minimizing the top eigenvalue of the Hessian matrix during training, ensuring equitable loss landscapes across FL participants. Notably, we introduce a novel FL aggregation scheme that promotes participating models based on error rates and loss landscape curvature attributes, fostering fairness across the FL system. This work represents the first approach to attaining "Fairness without Demographics" in human-centered FL. Through comprehensive evaluation, our approach demonstrates effectiveness in balancing fairness and efficacy across various real-world applications, FL setups, and scenarios involving single and multiple bias-inducing factors, representing a significant advancement in human-centered FL.
联邦学习(FL)通过在保护数据隐私的同时实现协作模型训练,使得分布式以人为本人工智能应用成为可能。然而,在确保公平性的这些系统中仍然存在一个显著的研究空白。当前的FL公平策略需要了解偏见创造/敏感属性,这与其隐私原则相冲突。此外,在以人为本的数据中,敏感属性可能仍然是隐性的。为解决这些挑战,我们提出了一个新的人工智能机器学习中的偏见缓解方法,灵感来自“公平性,而无需 demographics”。所提出的方法通过在训练过程中最小化Hessian矩阵的顶特征值来实现公平性,确保FL参与者之间的公正损失场景。值得注意的是,我们引入了一种新型的FL聚合方案,基于错误率和损失场景凸性属性,促进FL系统内的参与模型,从而促进公平性。这项工作代表了在以人为本FL中实现“公平性,而无需 demographics”的第一种方法。通过全面的评估,我们的方法在平衡公平性和有效性方面取得了良好的成果,涉及各种现实世界的应用、FL设置和场景,包括单和多个偏见因素。这表明,在以人为本FL中,我们取得了显著的进展。
https://arxiv.org/abs/2404.19725
Recent popular decoder-only text-to-speech models are known for their ability of generating natural-sounding speech. However, such models sometimes suffer from word skipping and repeating due to the lack of explicit monotonic alignment constraints. In this paper, we notice from the attention maps that some particular attention heads of the decoder-only model indicate the alignments between speech and text. We call the attention maps of those heads Alignment-Emerged Attention Maps (AEAMs). Based on this discovery, we propose a novel inference method without altering the training process, named Attention-Constrained Inference (ACI), to facilitate monotonic synthesis. It first identifies AEAMs using the Attention Sweeping algorithm and then applies constraining masks on AEAMs. Our experimental results on decoder-only TTS model VALL-E show that the WER of synthesized speech is reduced by up to 20.5% relatively with ACI while the naturalness and speaker similarity are comparable.
近年来,流行的解码器-Only文本-到-语音模型以其生成自然声音的 speech 能力而闻名。然而,这些模型有时会因为缺乏明确的单音节对齐约束而出现单词跳读和重复的问题。在本文中,我们注意到一些特定注意力头部的注意力图表示了语音和文本之间的对齐。我们称这些头部的注意力图为 Alignment-Emerged Attention Maps (AEAMs)。基于这个发现,我们提出了一个不修改训练过程的新型推理方法,名为 Attention-Constrained Inference (ACI),以促进单音节合成。它首先使用注意力扫过算法识别 AEAMs,然后对 AEAMs 应用约束掩码。我们对解码器-Only TTS 模型 VALL-E 的实验结果进行了观察,结果显示,与 ACI 相比,合成的语音的 WER 降低了 20.5% 左右,同时自然性和说话者相似性相当。
https://arxiv.org/abs/2404.19723
We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. Recent pedestrian animation frameworks have a significant limitation wherein they primarily focus on either following trajectory [46] or the content of the reference video [57], consequently overlooking the potential diversity of human motion within such scenarios. This limitation restricts the ability to generate pedestrian behaviors that exhibit a wider range of variations and realistic motions and therefore restricts its usage to provide rich motion content for other components in the driving simulation system, e.g., suddenly changed motion to which the autonomous vehicle should respond. In our approach, we strive to surpass the limitation by showcasing diverse human motions obtained from various sources, such as generated human motions, in addition to following the given trajectory. The fundamental contribution of our framework lies in combining the motion tracking task with trajectory following, which enables the tracking of specific motion parts (e.g., upper body) while simultaneously following the given trajectory by a single policy. This way, we significantly enhance both the diversity of simulated human motion within the given scenario and the controllability of the content, including language-based control. Our framework facilitates the generation of a wide range of human motions, contributing to greater realism and adaptability in pedestrian simulations for driving scenarios. More information is on our project page this https URL .
我们在行人仿真中面临内容多样性和可控制性的挑战。最近的人行动画框架有一个显著的局限性,即它们主要关注于跟踪轨迹[46]或参考视频[57],从而忽视了场景中人类运动多样性的潜在可能性。这个局限性限制了能够生成具有更广泛变化和真实感的行人行为的能力,因此限制了它在驾驶仿真系统中的使用,例如突然改变的运动,这是自动驾驶应该响应的。在我们的方法中,我们力求超越这一限制,通过展示从各种来源获得的多样人类运动,包括生成的行人运动,来超越这一限制。我们框架的基本贡献在于将运动跟踪任务与轨迹跟踪相结合,使得在单策略下同时跟踪给定的轨迹和特定运动部分(例如上半身)的同时,能够跟踪给定场景中的人类运动多样性。这样,我们显著增强了给定场景中仿真人类运动的多样性,并提高了内容的可控性,包括基于语言的控制。我们的框架有助于生成一系列行人运动,从而提高驾驶场景中行人仿真的真实性和适应性。更多相关信息,请访问我们的项目页面,链接如下:https://www.example.com/project 。
https://arxiv.org/abs/2404.19722
This research introduces Procedural Artificial Narrative using Generative AI (PANGeA), a structured approach for leveraging large language models (LLMs), guided by a game designer's high-level criteria, to generate narrative content for turn-based role-playing video games (RPGs). Distinct from prior applications of LLMs used for video game design, PANGeA innovates by not only generating game level data (which includes, but is not limited to, setting, key items, and non-playable characters (NPCs)), but by also fostering dynamic, free-form interactions between the player and the environment that align with the procedural game narrative. The NPCs generated by PANGeA are personality-biased and express traits from the Big 5 Personality Model in their generated responses. PANGeA addresses challenges behind ingesting free-form text input, which can prompt LLM responses beyond the scope of the game narrative. A novel validation system that uses the LLM's intelligence evaluates text input and aligns generated responses with the unfolding narrative. Making these interactions possible, PANGeA is supported by a server that hosts a custom memory system that supplies context for augmenting generated responses thus aligning them with the procedural narrative. For its broad application, the server has a REST interface enabling any game engine to integrate directly with PANGeA, as well as an LLM interface adaptable with local or private LLMs. PANGeA's ability to foster dynamic narrative generation by aligning responses with the procedural narrative is demonstrated through an empirical study and ablation test of two versions of a demo game. These are, a custom, browser-based GPT and a Unity demo. As the results show, PANGeA holds potential to assist game designers in using LLMs to generate narrative-consistent content even when provided varied and unpredictable, free-form text input.
这项研究采用了 Procedural Artificial Narrative using Generative AI (PANGeA),一种基于游戏设计师高级准则的结构化方法,利用大型语言模型 (LLMs) 来生成回合制角色扮演游戏 (RPG) 的叙事内容。与用于游戏设计的 LLMs 的先前应用不同,PANGeA 创新地通过不仅生成游戏等级数据(包括,但不仅限于,设置、关键物品和非玩家角色(NPCs)),而且通过促进玩家与环境之间的动态、自由形式交互,使游戏叙事与程序化游戏叙事保持一致。由 PANGeA 生成的 NPC 表现出人格倾向,其生成的响应反映了 Big 5 人格模型中的特质。PANGeA 解决了在自由文本输入中消化的问题,这可能导致 LLM 响应超出游戏叙事的范围。通过使用 LLM 的智能评估系统评估文本输入并将其与生成的响应对齐,PANGeA 得到了支持它的服务器,该服务器托管了一个自定义内存系统,可以为增强生成的响应提供上下文,使其与程序化叙事保持一致。为了广泛的应用,服务器具有 REST 接口,使任何游戏引擎都能直接与 PANGeA 集成,以及一个适用于本地或私有 LLM 的 LLM 接口。通过使这些交互成为可能,PANGeA 得到了支持其通过将响应与程序化叙事对齐来促进动态故事内容生成的能力,这是通过两个演示游戏中进行的实证研究和消融测试来证明的。这些演示游戏是自定义浏览器版 GPT 和 Unity 演示。结果表明,PANGeA 有可能帮助游戏设计师在使用 LLM 时生成与叙事保持一致的内容,即使提供多样化和不可预测的自由文本输入。
https://arxiv.org/abs/2404.19721
This paper describes our participation in Task 3 and Task 5 of the #SMM4H (Social Media Mining for Health) 2024 Workshop, explicitly targeting the classification challenges within tweet data. Task 3 is a multi-class classification task centered on tweets discussing the impact of outdoor environments on symptoms of social anxiety. Task 5 involves a binary classification task focusing on tweets reporting medical disorders in children. We applied transfer learning from pre-trained encoder-decoder models such as BART-base and T5-small to identify the labels of a set of given tweets. We also presented some data augmentation methods to see their impact on the model performance. Finally, the systems obtained the best F1 score of 0.627 in Task 3 and the best F1 score of 0.841 in Task 5.
本文描述了我们在#SMM4H(社交媒体挖掘为健康)2024工作会上参与任务3和任务5,明确针对推特数据中的分类挑战。任务3是一个多分类分类任务,旨在讨论户外环境对社交焦虑症状的影响。任务5是一个二分类分类任务,重点关注报道儿童医疗疾病的推特。我们将从像BART-base和T5-small这样的预训练编码器-解码器模型中应用迁移学习,以识别给定一组推文的标签。我们还介绍了一些数据增强方法,以观察它们对模型性能的影响。最后,在第3个任务中,系统获得了0.627的F1得分,而在第5个任务中获得了0.841的F1得分。
https://arxiv.org/abs/2404.19714
This study introduces a transformative framework for medical education by integrating semi-structured data with Large Language Models (LLMs), primarily OpenAIs ChatGPT3.5, to automate the creation of medical simulation scenarios. Traditionally, developing these scenarios was a time-intensive process with limited flexibility to meet diverse educational needs. The proposed approach utilizes AI to efficiently generate detailed, clinically relevant scenarios that are tailored to specific educational objectives. This innovation has significantly reduced the time and resources required for scenario development, allowing for a broader variety of simulations. Preliminary feedback from educators and learners has shown enhanced engagement and improved knowledge acquisition, confirming the effectiveness of this AI-enhanced methodology in simulation-based learning. The integration of structured data with LLMs not only streamlines the creation process but also offers a scalable, dynamic solution that could revolutionize medical training, highlighting the critical role of AI in advancing educational outcomes and patient care standards.
这项研究引入了一种变革性的教育框架,通过将半结构化数据与大型语言模型(LLMs)相结合,主要使用OpenAis ChatGPT3.5,来自动创建医疗仿真场景。传统上,开发这些场景是一个耗时且缺乏灵活性的过程,以满足多样化的教育需求。所提出的方法利用AI来高效生成详细、具有临床相关性的场景,以满足特定的教育目标。这一创新已经显著减少了场景开发的时间和资源需求,允许开发更广泛的模拟。教育者和学习者的初步反馈表明,这种AI增强的学习方法提高了参与度并改善了知识获取,证实了这种AI增强方法在仿真为基础的学习中的有效性。将结构化数据与LLMs相结合不仅简化了创建过程,而且提供了一个可扩展、动态的解决方案,这有可能彻底改变医疗培训,突出AI在提高教育成果和患者护理标准中的关键作用。
https://arxiv.org/abs/2404.19713
We introduce an intuitive method to test the robustness (stability and explainability) of any black-box LLM in real-time, based upon the local deviation from harmoniticity, denoted as $\gamma$. To the best of our knowledge this is the first completely model-agnostic and unsupervised method of measuring the robustness of any given response from an LLM, based upon the model itself conforming to a purely mathematical standard. We conduct human annotation experiments to show the positive correlation of $\gamma$ with false or misleading answers, and demonstrate that following the gradient of $\gamma$ in stochastic gradient ascent efficiently exposes adversarial prompts. Measuring $\gamma$ across thousands of queries in popular LLMs (GPT-4, ChatGPT, Claude-2.1, Mixtral-8x7B, Smaug-72B, Llama2-7B, and MPT-7B) allows us to estimate the liklihood of wrong or hallucinatory answers automatically and quantitatively rank the reliability of these models in various objective domains (Web QA, TruthfulQA, and Programming QA). Across all models and domains tested, human ratings confirm that $\gamma \to 0$ indicates trustworthiness, and the low-$\gamma$ leaders among these models are GPT-4, ChatGPT, and Smaug-72B.
我们提出了一种直观的方法来实时测试任何黑盒LLM的稳健性(稳定性和可解释性),基于离散偏差,称为$\gamma$。据我们所知,这是基于模型本身遵循纯数学标准来测量任何给定LLM响应稳健性的第一个完全模型无关和无监督的方法。我们进行了人类注释实验来表明$\gamma$与虚假或误导性答案之间的正相关性,并证明在随机梯度上升过程中,沿着$\gamma$的梯度可以有效地揭示对抗性提示。通过对流行LLM(GPT-4,ChatGPT,Claude-2.1,Mixtral-8x7B,Smaug-72B,Llama2-7B和MPT-7B)成千上万个查询的$\gamma$进行测量,使我们能够自动估计错误或幻觉性答案的概率,并定量排名这些模型的可靠性在各种客观领域(Web QA,Truthful QA和编程 QA)上。在所有模型和领域测试中,人类评分证实了$\gamma \to 0$表示可信度,这些模型中低$\gamma$的领导是GPT-4,ChatGPT和Smaug-72B。
https://arxiv.org/abs/2404.19708
We propose RTG-SLAM, a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting. RTG-SLAM features a compact Gaussian representation and a highly efficient on-the-fly Gaussian optimization scheme. We force each Gaussian to be either opaque or nearly transparent, with the opaque ones fitting the surface and dominant colors, and transparent ones fitting residual colors. By rendering depth in a different way from color rendering, we let a single opaque Gaussian well fit a local surface region without the need of multiple overlapping Gaussians, hence largely reducing the memory and computation cost. For on-the-fly Gaussian optimization, we explicitly add Gaussians for three types of pixels per frame: newly observed, with large color errors and with large depth errors. We also categorize all Gaussians into stable and unstable ones, where the stable Gaussians are expected to well fit previously observed RGBD images and otherwise unstable. We only optimize the unstable Gaussians and only render the pixels occupied by unstable Gaussians. In this way, both the number of Gaussians to be optimized and pixels to be rendered are largely reduced, and the optimization can be done in real time. We show real-time reconstructions of a variety of real large scenes. Compared with the state-of-the-art NeRF-based RGBD SLAM, our system achieves comparable high-quality reconstruction but with around twice the speed and half the memory cost, and shows superior performance in the realism of novel view synthesis and camera tracking accuracy.
我们提出了RTG-SLAM,一种基于Gaussian分割的大规模环境下的实时3D重建系统。RTG-SLAM具有紧凑的Gaussian表示和高效的on-the-fly Gaussian优化方案。我们强制每个Gaussian要么是透明的,要么是几乎透明的,其中透明的Gaussian适合于表面和主导颜色,而透明的Gaussian适合于残余颜色。通过以与颜色渲染不同的方式渲染深度,我们使得一个透明的Gaussian可以适应用户本地表面区域,而无需多个重叠的Gaussian,从而大大降低了内存和计算成本。 对于on-the-fly Gaussian优化,我们明确地添加了每帧三种不同类型的像素的Gaussian:新观察到的,具有大的颜色误差和大的深度误差。我们还将所有Gaussian分为稳定和不稳定两类,其中稳定Gaussian预计将很好地适应用户之前观察到的RGBD图像,而其他Gaussian则是不稳定的。我们仅优化不稳定Gaussian,并仅渲染稳定Gaussian占用的像素。 通过这种方式,Gaussians要优化的数量和需要渲染的像素数量都大大减少,优化可以在实时过程中进行。我们展示了各种真实大场景的实时重构。与基于NeRF的RGBD SLAM的状态相比,我们的系统在质量和高速度方面具有相似的表现,同时将速度和内存成本降低约一半,并在新颖视图合成和相机跟踪精度的现实性方面具有卓越的表现。
https://arxiv.org/abs/2404.19706
In this paper, we demonstrate how Large Language Models (LLMs) can effectively learn to use an off-the-shelf information retrieval (IR) system specifically when additional context is required to answer a given question. Given the performance of IR systems, the optimal strategy for question answering does not always entail external information retrieval; rather, it often involves leveraging the parametric memory of the LLM itself. Prior research has identified this phenomenon in the PopQA dataset, wherein the most popular questions are effectively addressed using the LLM's parametric memory, while less popular ones require IR system usage. Following this, we propose a tailored training approach for LLMs, leveraging existing open-domain question answering datasets. Here, LLMs are trained to generate a special token, <RET>, when they do not know the answer to a question. Our evaluation of the Adaptive Retrieval LLM (Adapt-LLM) on the PopQA dataset showcases improvements over the same LLM under three configurations: (i) retrieving information for all the questions, (ii) using always the parametric memory of the LLM, and (iii) using a popularity threshold to decide when to use a retriever. Through our analysis, we demonstrate that Adapt-LLM is able to generate the <RET> token when it determines that it does not know how to answer a question, indicating the need for IR, while it achieves notably high accuracy levels when it chooses to rely only on its parametric memory.
在本文中,我们证明了大型语言模型(LLMs)在需要额外上下文来回答给定问题时,可以有效地学习使用标准的信息检索(IR)系统。考虑到IR系统的性能,问题回答的最佳策略并不总是涉及外部信息检索,而是通常利用LLM本身的参数化记忆。之前的研究已经发现了这个现象在PopQA数据集中,其中最流行的问题有效地使用LLM的参数化记忆来回答,而较不流行的问题则需要使用IR系统。接着,我们为LLMs提出了一个针对现有开放领域问题回答数据集的定制化训练方法。在这里,LLMs在不知道答案时生成一个特殊标记<RET>。我们对PopQA数据集上的自适应检索LLM(Adapt-LLM)的评估展示了在三种配置下的改进:(i)检索所有问题,(ii)始终使用LLM的参数化记忆,(iii)根据流行度阈值来决定何时使用检索器。通过我们的分析,我们证明了Adapt-LLM能够生成<RET>标记,当它确定自己无法回答问题时,表明需要IR,而当它仅依赖参数化记忆时,其准确率显著提高。
https://arxiv.org/abs/2404.19705