As large language models (LLMs) evolve, their integration with 3D spatial data (3D-LLMs) has seen rapid progress, offering unprecedented capabilities for understanding and interacting with physical spaces. This survey provides a comprehensive overview of the methodologies enabling LLMs to process, understand, and generate 3D data. Highlighting the unique advantages of LLMs, such as in-context learning, step-by-step reasoning, open-vocabulary capabilities, and extensive world knowledge, we underscore their potential to significantly advance spatial comprehension and interaction within embodied Artificial Intelligence (AI) systems. Our investigation spans various 3D data representations, from point clouds to Neural Radiance Fields (NeRFs). It examines their integration with LLMs for tasks such as 3D scene understanding, captioning, question-answering, and dialogue, as well as LLM-based agents for spatial reasoning, planning, and navigation. The paper also includes a brief review of other methods that integrate 3D and language. The meta-analysis presented in this paper reveals significant progress yet underscores the necessity for novel approaches to harness the full potential of 3D-LLMs. Hence, with this paper, we aim to chart a course for future research that explores and expands the capabilities of 3D-LLMs in understanding and interacting with the complex 3D world. To support this survey, we have established a project page where papers related to our topic are organized and listed: this https URL.
随着大型语言模型(LLMs)的不断发展,其与3D空间数据的集成取得了快速进展,为理解和与物理空间进行交互提供了前所未有的能力。这项调查对LLMs处理、理解和生成3D数据的方法进行了全面概述。强调LLMs的独特优势,如上下文学习、逐步推理、开放词汇功能和广泛的世界知识,我们强调它们在智能体人工智能(AI)系统中将空间理解和交互的重大潜力。我们的研究跨越了各种3D数据表示,从点云到神经辐射场(NeRFs)。它研究了LLM与3D场景理解、捕捉、问答和对话等任务的集成,以及基于LLM的空间推理、规划和导航等代理。此外,本文还简要回顾了其他将3D和语言集成的方法。本文中的元分析揭示了显著的进展,但同时也强调了需要新的方法来充分利用3D-LLMs的全部潜力。因此,本文旨在为未来研究绘制一个探索和扩展3D-LLMs理解和服务于复杂3D世界的道路的路线图。为了支持这项调查,我们建立了一个与我们的主题相关的项目页,其中列出了相关的论文:https://url。
https://arxiv.org/abs/2405.10255
In the typical urban intersection scenario, both vehicles and infrastructures are equipped with visual and LiDAR sensors. By successfully integrating the data from vehicle-side and road monitoring devices, a more comprehensive and accurate environmental perception and information acquisition can be achieved. The Calibration of sensors, as an essential component of autonomous driving technology, has consistently drawn significant attention. Particularly in scenarios involving multiple sensors collaboratively perceiving and addressing localization challenges, the requirement for inter-sensor calibration becomes crucial. Recent years have witnessed the emergence of the concept of multi-end cooperation, where infrastructure captures and transmits surrounding environment information to vehicles, bolstering their perception capabilities while mitigating costs. However, this also poses technical complexities, underscoring the pressing need for diverse end calibration. Camera and LiDAR, the bedrock sensors in autonomous driving, exhibit expansive applicability. This paper comprehensively examines and analyzes the calibration of multi-end camera-LiDAR setups from vehicle, roadside, and vehicle-road cooperation perspectives, outlining their relevant applications and profound significance. Concluding with a summary, we present our future-oriented ideas and hypotheses.
在典型的城市交叉口场景中,车辆和基础设施都配备了视觉和激光雷达传感器。通过成功将来自车辆侧和道路监测设备的数据显示集成在一起,可以实现更全面、更准确的感知和信息获取。传感器校准作为自动驾驶技术的重要组成部分,一直引起了广泛关注。特别是在涉及多个传感器协同感知和解决定位挑战的场景中,对传感器之间的互校准需求变得至关重要。近年来,出现了多端合作的概念,基础设施捕获并传输周围环境信息给车辆,增强他们的感知能力,同时减轻成本负担。然而,这也带来了技术复杂性,凸显了多样端校准的迫切需求。相机和激光雷达,自动驾驶的基础传感器,具有广泛的适用性。本文从车辆、路侧和车辆-道路合作的角度,全面探讨和分析多端相机-激光雷达设置的校准问题,阐述它们的相关应用和深远意义。最后以总结作为结束,我们呈现了未来导向的想法和假设。
https://arxiv.org/abs/2405.10132
In this paper, we review the NTIRE 2024 challenge on Restore Any Image Model (RAIM) in the Wild. The RAIM challenge constructed a benchmark for image restoration in the wild, including real-world images with/without reference ground truth in various scenarios from real applications. The participants were required to restore the real-captured images from complex and unknown degradation, where generative perceptual quality and fidelity are desired in the restoration result. The challenge consisted of two tasks. Task one employed real referenced data pairs, where quantitative evaluation is available. Task two used unpaired images, and a comprehensive user study was conducted. The challenge attracted more than 200 registrations, where 39 of them submitted results with more than 400 submissions. Top-ranked methods improved the state-of-the-art restoration performance and obtained unanimous recognition from all 18 judges. The proposed datasets are available at this https URL and the homepage of this challenge is at this https URL.
在本文中,我们对在野中的NTIRE 2024挑战进行了回顾。RAIM挑战为图像修复领域树立了一个基准,包括各种真实世界图像,这些图像带有/不带参考真实世界地面真实情况下的图像修复情况。参与者被要求从复杂和未知退化中恢复真实捕获的图像,要求在修复结果中实现生成感知质量和一致性。挑战包括两个任务。任务一是使用真实参考数据对,有定量的评估可用。任务二是使用未配对图像,并进行了全面的用户研究。挑战吸引了超过200个注册,其中39个提交了超过400个结果。排名前几的方法提高了最先进的修复性能,并得到了所有18个评委的一致认可。提出的数据集可在此处访问:<https://url> 和<https://url>。
https://arxiv.org/abs/2405.09923
Learning commonsense reasoning from visual contexts and scenes in real-world is a crucial step toward advanced artificial intelligence. However, existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations, specifically within dynamic, open-world, and structured context knowledge. We propose a new benchmark (SOK-Bench), consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset, we propose an automatic and scalable generation method to generate question-answer pairs, knowledge graphs, and rationales by instructing the combinations of LLMs and MLLMs. Concretely, we first extract observable situated entities, relations, and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense, we generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision-language models on the benchmark and found several insightful conclusions. For more information, please refer to our benchmark at this http URL.
从现实世界中的视觉上下文和场景中学习常识推理是一个向高级人工智能的重要步骤。然而,现有的视频推理基准 still 不足,因为它们主要针对事实或情境推理而设计,并且很少涉及现实世界更广泛的知识。我们的工作旨在更深入地研究推理评估,特别是在动态、开放世界和结构化知识背景下。我们提出了一个新的基准(SOK-Bench),包括 44K 个问题和 10K 个情况,并在视频中绘制了实例级别的注释。推理过程需要理解并应用情境知识和一般知识来进行问题解决。为了创建这样一个数据集,我们提出了一个自动和可扩展的方法,通过指示LLM和MLLM的组合来生成问题-答案对、知识图和推理。具体来说,我们首先从视频中提取情境知识中的可观察的情境实体、关系和过程,然后扩展到可见内容之外的开箱世界知识。任务生成是通过多个对话作为迭代,并随后通过我们设计的自我提示和演示进行修正和优化。与隐式常识和明确的事实知识相结合,我们生成相关的问题-答案对和推理过程,最后通过手动审核进行质量保证。我们对最近的大型视觉语言模型进行了基准测试,并得出了几个有趣的结论。更多信息,请参阅我们的基准网站:http://www.aclweb.org/anthology/N18-1196
https://arxiv.org/abs/2405.09713
Objective: Federated Learning (FL) enables collaborative model training while keeping data locally. Currently, most FL studies in radiology are conducted in simulated environments due to numerous hurdles impeding its translation into practice. The few existing real-world FL initiatives rarely communicate specific measures taken to overcome these hurdles, leaving behind a significant knowledge gap. Minding efforts to implement real-world FL, there is a notable lack of comprehensive assessment comparing FL to less complex alternatives. Materials & Methods: We extensively reviewed FL literature, categorizing insights along with our findings according to their nature and phase while establishing a FL initiative, summarized to a comprehensive guide. We developed our own FL infrastructure within the German Radiological Cooperative Network (RACOON) and demonstrated its functionality by training FL models on lung pathology segmentation tasks across six university hospitals. We extensively evaluated FL against less complex alternatives in three distinct evaluation scenarios. Results: The proposed guide outlines essential steps, identified hurdles, and proposed solutions for establishing successful FL initiatives conducting real-world experiments. Our experimental results show that FL outperforms less complex alternatives in all evaluation scenarios, justifying the effort required to translate FL into real-world applications. Discussion & Conclusion: Our proposed guide aims to aid future FL researchers in circumventing pitfalls and accelerating translation of FL into radiological applications. Our results underscore the value of efforts needed to translate FL into real-world applications by demonstrating advantageous performance over alternatives, and emphasize the importance of strategic organization, robust management of distributed data and infrastructure in real-world settings.
目标:联邦学习(FL)可以在保留数据本地的情况下实现协作模型训练。目前,由于许多阻碍将其转化为实践的障碍,大多数放射学领域的FL研究都是在模拟环境中进行的。少数现有现实世界的FL倡议很少详细介绍为克服这些障碍所采取的具体措施,留下了相当大的知识空白。关注实施现实世界FL,在比较FL与其他更简单选项的全局评估方面存在显著的不足。材料和方法:我们广泛审查了FL文献,根据其性质和阶段将见解进行分类,同时建立FL倡议,并将其总结为一本全面的指南。我们还在德国放射学合作网络(RACOON)内开发了自己的FL基础设施,并通过在六所大学医院的肺病理分割任务上训练FL模型来展示其功能性。我们在三个不同的评估场景对FL与更简单的替代方案进行了广泛评估。结果:所提出的指南概述了建立成功FL倡议的必要步骤,识别了障碍并提出了解决方案。我们的实验结果表明,FL在所有评估场景中都优于更简单的替代方案,从而为将FL融入放射学应用付出了所需的努力。讨论与结论:我们的指南旨在帮助未来的FL研究人员避免陷阱,加速FL向放射学应用的转化。我们的结果强调了将FL融入现实世界应用所需的努力,通过证明其相对于替代方案的优越性能,并着重强调了在现实环境中的战略组织、分布式数据和基础设施的稳健管理的重要性。
https://arxiv.org/abs/2405.09409
Generative artificial intelligence (AI) technologies and large models are producing realistic outputs across various domains, such as images, text, speech, and music. Creating these advanced generative models requires significant resources, particularly large and high-quality datasets. To minimize training expenses, many algorithm developers use data created by the models themselves as a cost-effective training solution. However, not all synthetic data effectively improve model performance, necessitating a strategic balance in the use of real versus synthetic data to optimize outcomes. Currently, the previously well-controlled integration of real and synthetic data is becoming uncontrollable. The widespread and unregulated dissemination of synthetic data online leads to the contamination of datasets traditionally compiled through web scraping, now mixed with unlabeled synthetic data. This trend portends a future where generative AI systems may increasingly rely blindly on consuming self-generated data, raising concerns about model performance and ethical issues. What will happen if generative AI continuously consumes itself without discernment? What measures can we take to mitigate the potential adverse effects? There is a significant gap in the scientific literature regarding the impact of synthetic data use in generative AI, particularly in terms of the fusion of multimodal information. To address this research gap, this review investigates the consequences of integrating synthetic data blindly on training generative AI on both image and text modalities and explores strategies to mitigate these effects. The goal is to offer a comprehensive view of synthetic data's role, advocating for a balanced approach to its use and exploring practices that promote the sustainable development of generative AI technologies in the era of large models.
生成人工智能(AI)技术和大模型在各种领域产生逼真的输出,如图像、文本、语音和音乐。创建这些先进的生成模型需要大量的资源,特别是大型和高质量的数据集。为了最小化训练成本,许多算法开发者使用模型自身产生的数据作为有效的训练解决方案。然而,不是所有合成数据都能有效提高模型性能,导致在真实数据和合成数据的使用上进行战略平衡以优化结果。目前,严格控制真实和合成数据的集成变得越来越不可控。广泛且未经监管的在线传播合成数据,导致通过网络爬取编写的传统数据集现在被混合了未经标签的合成数据所污染。这种趋势预示着,生成AI系统可能越来越多地依赖盲目地消费自生数据,引发关于模型性能和伦理问题的担忧。如果生成AI连续不断地自我消费而没有察觉,会发生什么?我们该如何减轻这些潜在的负面影响?关于合成数据在生成AI中的使用,特别是在多模态信息的融合方面,科学文献中的研究存在很大的空白。为了填补这一研究空白,本综述调查了在图像和文本模态上盲目集成合成数据对训练生成AI的影响,并探讨了减轻这些影响的方法。目标是提供一个全面的了解合成数据在生成AI中的作用,提倡平衡使用合成数据,探讨在大型模型时代促进生成AI技术可持续发展的一些实践。
https://arxiv.org/abs/2405.09597
This paper focuses on the integration of generative techniques into spatial-temporal data mining, considering the significant growth and diverse nature of spatial-temporal data. With the advancements in RNNs, CNNs, and other non-generative techniques, researchers have explored their application in capturing temporal and spatial dependencies within spatial-temporal data. However, the emergence of generative techniques such as LLMs, SSL, Seq2Seq and diffusion models has opened up new possibilities for enhancing spatial-temporal data mining further. The paper provides a comprehensive analysis of generative technique-based spatial-temporal methods and introduces a standardized framework specifically designed for the spatial-temporal data mining pipeline. By offering a detailed review and a novel taxonomy of spatial-temporal methodology utilizing generative techniques, the paper enables a deeper understanding of the various techniques employed in this field. Furthermore, the paper highlights promising future research directions, urging researchers to delve deeper into spatial-temporal data mining. It emphasizes the need to explore untapped opportunities and push the boundaries of knowledge to unlock new insights and improve the effectiveness and efficiency of spatial-temporal data mining. By integrating generative techniques and providing a standardized framework, the paper contributes to advancing the field and encourages researchers to explore the vast potential of generative techniques in spatial-temporal data mining.
本文重点关注将生成技术集成到空间-时间数据挖掘中,考虑到空间-时间数据的增长和多样性。随着循环神经网络(RNNs)、卷积神经网络(CNNs)和其他非生成技术的发展,研究人员已经探讨了它们在捕捉空间-时间数据中的时间依赖性和空间依赖性的应用。然而,像LLM、SSL、Seq2Seq和扩散模型这样的生成技术的出现为进一步增强空间-时间数据挖掘提供了新的可能性。本文对基于生成技术的空间-时间方法进行了全面分析,并引入了一个专门为空间-时间数据挖掘流程设计的标准化框架。通过提供详细的回顾和利用生成技术的新分类方法,本文使人们能够更深入地理解这个领域中采用的各种技术。此外,本文强调了值得关注的研究方向,呼吁研究人员深入研究空间-时间数据挖掘。它强调了需要探索未开拓的机会,推动知识的边界,以获得新的见解,并提高空间-时间数据挖掘的效率和效果。通过整合生成技术和提供标准化框架,本文为这个领域的发展做出了贡献,鼓励研究人员探讨生成技术在空间-时间数据挖掘中的巨大潜力。
https://arxiv.org/abs/2405.09592
The rapid advancement of foundation models (FMs) across language, image, audio, and video domains has shown remarkable capabilities in diverse tasks. However, the proliferation of FMs brings forth a critical challenge: the potential to generate hallucinated outputs, particularly in high-stakes applications. The tendency of foundation models to produce hallucinated content arguably represents the biggest hindrance to their widespread adoption in real-world scenarios, especially in domains where reliability and accuracy are paramount. This survey paper presents a comprehensive overview of recent developments that aim to identify and mitigate the problem of hallucination in FMs, spanning text, image, video, and audio modalities. By synthesizing recent advancements in detecting and mitigating hallucination across various modalities, the paper aims to provide valuable insights for researchers, developers, and practitioners. Essentially, it establishes a clear framework encompassing definition, taxonomy, and detection strategies for addressing hallucination in multimodal foundation models, laying the foundation for future research in this pivotal area.
基础模型(FMs)在语言、图像、音频和视频领域的快速发展表明其在各种任务中具有显著的能力。然而,基础模型的流行也带来了关键挑战:产生虚假输出,尤其是在高风险应用中。基础模型产生虚假内容,无疑是对其在现实场景中广泛采用的最大障碍,尤其是在可靠性和准确性至关重要的领域。 这项调查论文全面回顾了旨在识别和减轻FMs中幻觉问题的最新发展,跨越了文本、图像、视频和音频维度。通过合成各种维度中检测和减轻幻觉的最新进展,论文旨在为研究人员、开发人员和实践者提供宝贵的见解。本质上,它建立了一个明确包含定义、分类和检测策略的框架,以解决多模态基础模型中幻觉问题,为未来在这个关键领域的研究奠定了基础。
https://arxiv.org/abs/2405.09589
In recent years, street view imagery has grown to become one of the most important sources of geospatial data collection and urban analytics, which facilitates generating meaningful insights and assisting in decision-making. Synthesizing a street-view image from its corresponding satellite image is a challenging task due to the significant differences in appearance and viewpoint between the two domains. In this study, we screened 20 recent research papers to provide a thorough review of the state-of-the-art of how street-view images are synthesized from their corresponding satellite counterparts. The main findings are: (i) novel deep learning techniques are required for synthesizing more realistic and accurate street-view images; (ii) more datasets need to be collected for public usage; and (iii) more specific evaluation metrics need to be investigated for evaluating the generated images appropriately. We conclude that, due to applying outdated deep learning techniques, the recent literature failed to generate detailed and diverse street-view images.
近年来,街景图像已成为地理空间数据收集和城市分析中最重要的数据来源之一,从而促进了有意义的见解和决策支持。从相应的卫星图像合成街景图像是一个具有挑战性的任务,因为两个领域之间的外观和视点存在显著差异。在这项研究中,我们审查了20篇最近的研究论文,以全面回顾从相应卫星图像合成街景图像的最佳现状。研究结果是: (i)需要使用新颖的深度学习技术合成更真实和准确的街景图像;(ii)需要收集更多的数据用于公共使用;(iii)需要研究更多的评估指标,以便适当地评估生成的图像。我们得出结论,由于应用过时的深度学习技术,最近的文章没有生成详细和多样化的街景图像。
https://arxiv.org/abs/2405.08961
The increasing complexity of automated driving functions and their growing operational design domains imply more demanding requirements on their validation. Classical methods such as field tests or formal analyses are not sufficient anymore and need to be complemented by simulations. For simulations, the standard approach is scenario-based testing, as opposed to distance-based testing primarily performed in field tests. Currently, the time evolution of specific scenarios is mainly described using trajectories, which limit or at least hamper generalizations towards variations. As an alternative, maneuver-based approaches have been proposed. We shed light on the state of the art and available foundations for this new method through a literature review of early and recent works related to maneuver-based scenario description. It includes related modeling approaches originally developed for other applications. Current limitations and research gaps are identified.
自动驾驶功能的复杂性和其不断增长的操作设计领域意味着对其验证提出更为严格的要求。传统的测试方法,如现场测试或形式分析,已经不再足够,需要通过仿真进行补充。对于仿真,标准方法是基于场景的测试,而不仅仅是在现场测试中进行的距离测试。目前,特定场景的时间演化主要是通过轨迹来描述,这限制或至少阻碍了向变化进行的一般化。作为替代方法,提出了基于操纵的方法。通过关于操纵场景描述的早期和最近工作的文献回顾,揭示了这种新方法的现状和可用基础。包括为其他应用最初开发的相关建模方法。指出了当前的局限性和研究空白。
https://arxiv.org/abs/2405.08626
Neural Radiance Field(NeRF) is an novel implicit method to achieve the 3D reconstruction and representation with a high resolution. After the first research of NeRF is proposed, NeRF has gained a robust developing power and is booming in the 3D modeling, representation and reconstruction areas. However the first and most of the followed research projects based on NeRF is static, which are weak in the practical applications. Therefore, more researcher are interested and focused on the study of dynamic NeRF that is more feasible and useful in practical applications or situations. Compared with the static NeRF, implementing the Dynamic NeRF is more difficult and complex. But Dynamic is more potential in the future even is the basic of Editable NeRF. In this review, we made a detailed and abundant statement for the development and important implementation principles of Dynamci NeRF. The analysis of main principle and development of Dynamic NeRF is from 2021 to 2023, including the most of the Dynamic NeRF projects. What is more, with colorful and novel special designed figures and table, We also made a detailed comparison and analysis of different features of various of Dynamic. Besides, we analyzed and discussed the key methods to implement a Dynamic NeRF. The volume of the reference papers is large. The statements and comparisons are multidimensional. With a reading of this review, the whole development history and most of the main design method or principles of Dynamic NeRF can be easy understood and gained.
Neural Radiance Field(NeRF)是一种新型 implicit方法,旨在以高分辨率实现三维重建和表示。在NeRF首次研究提出后,NeRF获得了强大的发展动力,并在三维建模、表示和重建领域蓬勃发展。然而,大多数基于NeRF的研究项目是静态的,在实际应用中效果较弱。因此,越来越多的研究者对研究动态NeRF感兴趣,这是一个更实用且具有前景的方法。与静态NeRF相比,实现动态NeRF更具挑战性和复杂性。但动态NeRF在未来的发展前景仍相当广阔,即使是最基本的编辑NeRF方法。 在本文中,我们对动态NeRF的发展和重要实施原则进行了详细而丰富的阐述。分析主要原则和动态NeRF的发展是从2021年到2023年,包括大部分动态NeRF项目。此外,我们还通过丰富的彩色和新颖的图案以及对比,对各种动态特征进行了深入的比较和分析。此外,我们分析了并讨论了实现动态NeRF的关键方法。参考文献的体积很大。陈述和比较是多维的。通过阅读本综述,可以轻松理解和掌握动态NeRF的发展历程和主要设计原则。
https://arxiv.org/abs/2405.08609
Machine learning (ML)-based content moderation tools are essential to keep online spaces free from hateful communication. Yet, ML tools can only be as capable as the quality of the data they are trained on allows them. While there is increasing evidence that they underperform in detecting hateful communications directed towards specific identities and may discriminate against them, we know surprisingly little about the provenance of such bias. To fill this gap, we present a systematic review of the datasets for the automated detection of hateful communication introduced over the past decade, and unpack the quality of the datasets in terms of the identities that they embody: those of the targets of hateful communication that the data curators focused on, as well as those unintentionally included in the datasets. We find, overall, a skewed representation of selected target identities and mismatches between the targets that research conceptualizes and ultimately includes in datasets. Yet, by contextualizing these findings in the language and location of origin of the datasets, we highlight a positive trend towards the broadening and diversification of this research space.
机器学习(ML)为基础的内容审查工具对于保持网络空间免于仇恨言论至关重要。然而,ML工具只能以其训练数据允许其实现的能力水平来保持这种功能。尽管有越来越多的证据表明,它们在检测针对特定身份的仇恨言论方面表现不佳,甚至可能歧视他们,但我们对于这种偏见来源的了解仍然非常有限。为了填补这一空白,我们回顾了过去十年中介绍的自动检测仇恨言论的数据集,并探讨了这些数据集的质量:这些数据集的 targets(即受到仇恨言论攻击的对象)以及无意中包括在这些数据集中的其他目标。我们发现,总体而言,选择目标的 representation 存在偏差,研究概念与最终包含在数据集中的目标之间的差距存在。然而,通过将这些发现置于数据集的起源语言和位置的背景下,我们强调了研究空间拓宽和多样化的积极趋势。
https://arxiv.org/abs/2405.08562
As humans, we aspire to create media content that is both freely willed and readily controlled. Thanks to the prominent development of generative techniques, we now can easily utilize 2D diffusion methods to synthesize images controlled by raw sketch or designated human poses, and even progressively edit/regenerate local regions with masked inpainting. However, similar workflows in 3D modeling tasks are still unavailable due to the lack of controllability and efficiency in 3D generation. In this paper, we present a novel controllable and interactive 3D assets modeling framework, named Coin3D. Coin3D allows users to control the 3D generation using a coarse geometry proxy assembled from basic shapes, and introduces an interactive generation workflow to support seamless local part editing while delivering responsive 3D object previewing within a few seconds. To this end, we develop several techniques, including the 3D adapter that applies volumetric coarse shape control to the diffusion model, proxy-bounded editing strategy for precise part editing, progressive volume cache to support responsive preview, and volume-SDS to ensure consistent mesh reconstruction. Extensive experiments of interactive generation and editing on diverse shape proxies demonstrate that our method achieves superior controllability and flexibility in the 3D assets generation task.
作为人类,我们渴望创建既自由又易于控制的媒体内容。得益于生成技术的显著发展,我们现在可以轻松地利用2D扩散方法合成由原始草图或指定人体姿势控制的图像,甚至可以逐步编辑/再生带有遮罩的局部区域。然而,在3D建模任务中,类似的工作流程仍然无法实现,因为3D生成的可控性和效率不高。在本文中,我们提出了一个新颖的可控且具有交互性的3D资产建模框架,名为Coin3D。Coin3D允许用户使用由基本形状组成的粗略几何代理来控制3D生成,并引入了交互式生成工作流程,以支持在几秒钟内提供响应式的3D物体预览。为此,我们开发了几个技术,包括对扩散模型应用体积粗略形状控制的3D适配器,用于精确部分编辑的代理边界编辑策略,用于支持响应式预览的渐进式体积缓存,以及体积-SDS,以确保一致的网格重建。对不同形状代理的交互式生成和编辑的广泛实验证明,我们的方法在3D资产生成任务中实现了卓越的可控性和灵活性。
https://arxiv.org/abs/2405.08054
Detecting an ingestion environment is an important aspect of monitoring dietary intake. It provides insightful information for dietary assessment. However, it is a challenging problem where human-based reviewing can be tedious, and algorithm-based review suffers from data imbalance and perceptual aliasing problems. To address these issues, we propose a neural network-based method with a two-stage training framework that tactfully combines fine-tuning and transfer learning techniques. Our method is evaluated on a newly collected dataset called ``UA Free Living Study", which uses an egocentric wearable camera, AIM-2 sensor, to simulate food consumption in free-living conditions. The proposed training framework is applied to common neural network backbones, combined with approaches in the general imbalanced classification field. Experimental results on the collected dataset show that our proposed method for automatic ingestion environment recognition successfully addresses the challenging data imbalance problem in the dataset and achieves a promising overall classification accuracy of 96.63%.
检测摄入环境是监控饮食摄入的重要方面。它为饮食评估提供了有洞察力的信息。然而,它是一个具有挑战性的问题,基于人类审查的方法可以让人感到乏味,而基于算法审查的方法则受到数据不平衡和感知混淆问题的困扰。为了解决这些问题,我们提出了一个基于神经网络的方法,具有两个训练框架,巧妙地将微调和支持学习技术相结合。我们对该方法在一个名为“UA自由生活研究”的新数据集上进行了评估,该数据集使用一个以自我为中心的智能穿戴相机、AIM-2传感器等设备,在自由生活条件下模拟食物摄入。所提出的训练框架应用于常见的神经网络骨干,结合了通用不平衡分类领域的方法。对收集到的数据集的实验结果表明,我们提出的自动摄入环境识别方法成功地解决了数据不平衡问题,并实现了96.63%的准确分类准确率,这是一个有前景的结果。
https://arxiv.org/abs/2405.07827
Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, i.e., instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at this https URL.
对象姿态估计是一个在增强现实和机器人领域具有广泛应用的基本计算机视觉问题。在过去的十年里,由于其卓越的准确性和鲁棒性,深度学习模型越来越多地取代了依赖于人工点对特征的传统算法。然而,在当代方法中仍然存在几个挑战,包括对标注训练数据的依赖,模型的紧凑性,在复杂条件下的鲁棒性以及泛化到新颖未知物的能力。一份最近关于该领域进展的调查讨论了这个问题,突出了一些挑战和有前景的未来方向,但缺少了对这个领域的深入讨论。为了填补这个空白,我们讨论了基于深度学习的对象姿态估计的最新进展,涵盖了问题的所有三种形式,即实例级、类别级和未见物体姿态估计。我们的调查还涵盖了多个输入数据模态,输出姿态的自由度,物体属性以及下游任务,为读者提供了对这一领域的全面了解。此外,它还讨论了不同领域的训练范式、推理模式、应用领域、评估指标和基准数据集,以及报告了最先进方法在这些基准上的性能。最后,调查列举了关键挑战,回顾了现有趋势的优缺点,并提出了未来研究的建议。我们还在这个链接上持续追踪最新的工作。
https://arxiv.org/abs/2405.07801
"Human-aware" has become a popular keyword used to describe a particular class of AI systems that are designed to work and interact with humans. While there exists a surprising level of consistency among the works that use the label human-aware, the term itself mostly remains poorly understood. In this work, we retroactively try to provide an account of what constitutes a human-aware AI system. We see that human-aware AI is a design-oriented paradigm, one that focuses on the need for modeling the humans it may interact with. Additionally, we see that this paradigm offers us intuitive dimensions to understand and categorize the kinds of interactions these systems might have with humans. We show the pedagogical value of these dimensions by using them as a tool to understand and review the current landscape of work related to human-AI systems that purport some form of human modeling. To fit the scope of a workshop paper, we specifically narrowed our review to papers that deal with sequential decision-making and were published in a major AI conference in the last three years. Our analysis helps identify the space of potential research problems that are currently being overlooked. We perform additional analysis on the degree to which these works make explicit reference to results from social science and whether they actually perform user-studies to validate their systems. We also provide an accounting of the various AI methods used by these works.
"Human-aware"已成为用于描述一类旨在与人类互动和工作的AI系统的流行关键词。尽管在使用该标签的作品之间存在相当程度的一致性,但该术语本身仍然存在很大的误解。在这篇工作中,我们试图通过回顾来提供关于如何定义一个 human-aware AI 系统的说明。我们发现,human-aware AI 是一个设计导向的范式,该范式关注于可能与其互动的人类的需求建模。此外,我们还发现,这个范式提供了一个直观的维度来理解并分类这些系统与人类之间的互动。我们用这些维度作为工具来了解和审查声称某种形式的人类建模的现有工作格局。为了符合一场研讨会的论文范围,我们特别将分析缩小到在最近三年内在大型AI会议上发表的涉及序列决策的论文。我们的分析有助于识别当前被忽视的研究问题领域。我们进一步分析了这些作品明确提到社会科学结果以及实际上进行用户研究来验证其系统的程度。我们还提供了这些作品中使用的各种AI方法的详细说明。
https://arxiv.org/abs/2405.07773
Over the course of the recent decade, tremendous progress has been made in the areas of machine learning and natural language processing, which opened up vast areas of potential application use cases, including hiring and human resource management. We review the use cases for text analytics in the realm of human resources/personnel management, including actually realized as well as potential but not yet implemented ones, and we analyze the opportunities and risks of these.
在过去的十年里,机器学习和自然语言处理领域取得了巨大的进展,为应用场景打开了广阔的空间,包括招聘和人力资源管理。我们回顾了人力资源/人员管理领域中文本分析的使用案例,包括实际实现的和尚未实现的潜在应用,并分析这些机会和风险。
https://arxiv.org/abs/2405.07766
The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using a publicly available dataset, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code.
为了生成人类动作,发展了大量的生成人工智能,需要一个统一的评估框架。本文对人类动作生成方面的八个评估指标进行了详细回顾,强调了它们的独特特点和不足之处。我们通过统一的评估设置提出了标准化实践,以便进行一致的模型比较。此外,我们还引入了一个新的指标,通过分析畸变多样性来评估时间扭曲的多样性,从而提高了对时间数据的评估。我们还使用公开可用的数据集对三种生成模型进行了实验分析,为特定场景场景提供了关于每个指标的解释。我们的目标是提供一个清晰、易于使用的评估框架,并提供公开可用的代码作为补充。
https://arxiv.org/abs/2405.07680
High-quality images are crucial in remote sensing and UAV applications, but atmospheric haze can severely degrade image quality, making image dehazing a critical research area. Since the introduction of deep convolutional neural networks, numerous approaches have been proposed, and even more have emerged with the development of vision transformers and contrastive/few-shot learning. Simultaneously, papers describing dehazing architectures applicable to various Remote Sensing (RS) domains are also being published. This review goes beyond the traditional focus on benchmarked haze datasets, as we also explore the application of dehazing techniques to remote sensing and UAV datasets, providing a comprehensive overview of both deep learning and prior-based approaches in these domains. We identify key challenges, including the lack of large-scale RS datasets and the need for more robust evaluation metrics, and outline potential solutions and future research directions to address them. This review is the first, to our knowledge, to provide comprehensive discussions on both existing and very recent dehazing approaches (as of 2024) on benchmarked and RS datasets, including UAV-based imagery.
高质量的图像在遥感和无人机应用中至关重要,但大气雾霾会严重破坏图像质量,使图像去雾成为一个关键的研究领域。自深度卷积神经网络的引入,已经提出了许多方法,随着视觉变压器和对比/零样本学习的发展,更多方法也应运而生。同时,描述适用于各种遥感(RS)领域的去雾架构的论文也在不断发表。本综述超越了传统关注基准雾数据集的范围,我们还在遥感和无人机数据上探讨了去雾技术的应用,为这些领域提供了一个全面的深度学习和基于先验方法的研究概述。我们指出了关键挑战,包括缺乏大规模 RS 数据集和需要更健壮的评估指标,并提出了可能的解决方案和未来的研究方向来解决这些挑战。据我们所知,这是第一部关于基准和 RS 数据集上现有和非常最近去雾方法的综合讨论(截至 2024 年)。包括基于 UAV 的图像。
https://arxiv.org/abs/2405.07520
As the right to be forgotten has been legislated worldwide, many studies attempt to design unlearning mechanisms to protect users' privacy when they want to leave machine learning service platforms. Specifically, machine unlearning is to make a trained model to remove the contribution of an erased subset of the training dataset. This survey aims to systematically classify a wide range of machine unlearning and discuss their differences, connections and open problems. We categorize current unlearning methods into four scenarios: centralized unlearning, distributed and irregular data unlearning, unlearning verification, and privacy and security issues in unlearning. Since centralized unlearning is the primary domain, we use two parts to introduce: firstly, we classify centralized unlearning into exact unlearning and approximate unlearning; secondly, we offer a detailed introduction to the techniques of these methods. Besides the centralized unlearning, we notice some studies about distributed and irregular data unlearning and introduce federated unlearning and graph unlearning as the two representative directions. After introducing unlearning methods, we review studies about unlearning verification. Moreover, we consider the privacy and security issues essential in machine unlearning and organize the latest related literature. Finally, we discuss the challenges of various unlearning scenarios and address the potential research directions.
随着全球范围内对隐私保护的立法,许多研究试图设计 unlearning 机制以保护用户在想要离开机器学习服务平台时泄露隐私。具体来说,机器 unlearning 是指训练一个模型,从训练数据集中的被删除子集的贡献中移除贡献。本调查旨在系统地分类机器 unlearning 并提出它们的差异、联系和未解决的问题。我们将当前的 unlearning 方法分为四个场景:集中 unlearning、分布式和 irregular 数据 unlearning、 unlearning 验证和隐私和安全问题。由于集中 unlearning 是主要的领域,我们使用两个部分来介绍:首先,我们将集中 unlearning 分为精确 unlearning 和近似 unlearning;其次,我们详细介绍了这些方法的技巧。除了集中 unlearning,我们还注意到了一些关于分布式和 irregular 数据 unlearning 的研究,并将 federated unlearning 和 graph unlearning 作为两个具有代表性的方向。在介绍 unlearning 方法之后,我们回顾了关于 unlearning 验证的研究。此外,我们考虑了机器 unlearning 中隐私和安全问题至关重要,并组织最新的相关文献。最后,我们讨论了各种 unlearning 场景的挑战,并提出了潜在的研究方向。
https://arxiv.org/abs/2405.07406