User-friendly 3D object editing is a challenging task that has attracted significant attention recently. The limitations of direct 3D object editing without 2D prior knowledge have prompted increased attention towards utilizing 2D generative models for 3D editing. While existing methods like Instruct NeRF-to-NeRF offer a solution, they often lack user-friendliness, particularly due to semantic guided editing. In the realm of 3D representation, 3D Gaussian Splatting emerges as a promising approach for its efficiency and natural explicit property, facilitating precise editing tasks. Building upon these insights, we propose DragGaussian, a 3D object drag-editing framework based on 3D Gaussian Splatting, leveraging diffusion models for interactive image editing with open-vocabulary input. This framework enables users to perform drag-based editing on pre-trained 3D Gaussian object models, producing modified 2D images through multi-view consistent editing. Our contributions include the introduction of a new task, the development of DragGaussian for interactive point-based 3D editing, and comprehensive validation of its effectiveness through qualitative and quantitative experiments.
用户友好型3D物体编辑是一个具有挑战性的任务,最近引起了广泛关注。缺乏2D先验知识直接3D物体编辑的限制促使人们对利用2D生成模型进行3D编辑更加关注。虽然像Instruct NeRF-to-NeRF这样的现有方法提供了一种解决方案,但它们通常缺乏用户友好性,特别是由于语义引导编辑。在3D表示领域,3D高斯平铺 emerging as a promising approach for its efficiency and natural explicit property,促进精确编辑任务。基于这些见解,我们提出了DragGaussian,一个基于3D高斯平铺的3D物体拖拽编辑框架,利用扩散模型实现与open-vocabulary输入的交互式图像编辑。这个框架使用户能够在预训练的3D高斯物体模型上执行拖拽编辑,通过多视角一致编辑产生修改的2D图像。我们的贡献包括引入了一个新任务、开发了DragGaussian用于交互式点基础3D编辑以及通过定性和定量实验全面评估其有效性的全面验证。
https://arxiv.org/abs/2405.05800
Mapping is crucial for spatial reasoning, planning and robot navigation. Existing approaches range from metric, which require precise geometry-based optimization, to purely topological, where image-as-node based graphs lack explicit object-level reasoning and interconnectivity. In this paper, we propose a novel topological representation of an environment based on "image segments", which are semantically meaningful and open-vocabulary queryable, conferring several advantages over previous works based on pixel-level features. Unlike 3D scene graphs, we create a purely topological graph with segments as nodes, where edges are formed by a) associating segment-level descriptors between pairs of consecutive images and b) connecting neighboring segments within an image using their pixel centroids. This unveils a "continuous sense of a place", defined by inter-image persistence of segments along with their intra-image neighbours. It further enables us to represent and update segment-level descriptors through neighborhood aggregation using graph convolution layers, which improves robot localization based on segment-level retrieval. Using real-world data, we show how our proposed map representation can be used to i) generate navigation plans in the form of "hops over segments" and ii) search for target objects using natural language queries describing spatial relations of objects. Furthermore, we quantitatively analyze data association at the segment level, which underpins inter-image connectivity during mapping and segment-level localization when revisiting the same place. Finally, we show preliminary trials on segment-level `hopping' based zero-shot real-world navigation. Project page with supplementary details: this http URL
映射对于空间推理、规划和机器人导航至关重要。现有的方法从基于精确几何优化的 metric 方法,到基于图像节点(图像作为节点)的纯拓扑方法,再到基于像素级特征的方法,都存在局限性。在本文中,我们提出了一种基于 "图像段" 的全新拓扑表示环境,这些段具有语义意义且可开放词汇查询,比基于像素级特征的前期工作具有多个优势。与 3D 场景图不同,我们创建了一个纯拓扑图,其中节点为段,边由以下两种方式形成:将连续图像中的段级描述符关联起来,或将图像内相邻的段连接起来,利用像素中心。这揭示了 "空间场所的连续感觉",通过连续保留段及其内图像邻居来实现。它还使我们能够通过邻域聚合使用图卷积层来表示和更新段级描述符,从而改善基于段级检索的机器人定位。使用现实世界的数据,我们证明了我们的提议映射可以为 i) 生成 "跳过段" 的导航计划,以及 ii) 使用自然语言查询描述物体空间关系来寻找目标物体。此外,我们还在段级别上定量分析了数据的相关性,这将有助于在重新访问相同地点时建立图像间的连接。最后,我们在基于零散段级的现实世界无监督导航初步试验方面进行了展示。附录中有更多详细信息:http://
https://arxiv.org/abs/2405.05792
Artificial Intelligence Generative Content (AIGC) technologies have significantly influenced the remote sensing domain, particularly in the realm of image generation. However, remote sensing image editing, an equally vital research area, has not garnered sufficient attention. Different from text-guided editing in natural images, which relies on extensive text-image paired data for semantic correlation, the application scenarios of remote sensing image editing are often extreme, such as forest on fire, so it is difficult to obtain sufficient paired samples. At the same time, the lack of remote sensing semantics and the ambiguity of text also restrict the further application of image editing in remote sensing field. To solve above problems, this letter proposes a diffusion based method to fulfill stable and controllable remote sensing image editing with text guidance. Our method avoids the use of a large number of paired image, and can achieve good image editing results using only a single image. The quantitative evaluation system including CLIP score and subjective evaluation metrics shows that our method has better editing effect on remote sensing images than the existing image editing model.
人工智能生成内容(AIGC)技术在遥感领域取得了显著影响,特别是在图像生成领域。然而,遥感图像编辑,同样是一个至关重要的研究领域,并没有得到足够的关注。与自然图像中的文本指导编辑不同,后者依赖于大量的文本-图像配对数据来建立语义关联,遥感图像编辑的应用场景通常是极端的,例如火灾后的森林,因此很难获得足够的配对样本。同时,遥感语义缺乏和文本的模糊性也限制了远程 sensing领域图像编辑的进一步应用。为解决上述问题,本文提出了一种基于扩散的方法,用于通过文本指导实现稳定且可控制遥感图像编辑。我们的方法避免了使用大量配对图像,仅使用单张图像即可实现良好的图像编辑效果。包括CLIP得分和主观评价指标在内的定量评估系统显示,我们的方法在遥感图像编辑效果上比现有图像编辑模型效果更好。
https://arxiv.org/abs/2405.05769
Audio-driven talking head generation is advancing from 2D to 3D content. Notably, Neural Radiance Field (NeRF) is in the spotlight as a means to synthesize high-quality 3D talking head outputs. Unfortunately, this NeRF-based approach typically requires a large number of paired audio-visual data for each identity, thereby limiting the scalability of the method. Although there have been attempts to generate audio-driven 3D talking head animations with a single image, the results are often unsatisfactory due to insufficient information on obscured regions in the image. In this paper, we mainly focus on addressing the overlooked aspect of 3D consistency in the one-shot, audio-driven domain, where facial animations are synthesized primarily in front-facing perspectives. We propose a novel method, NeRFFaceSpeech, which enables to produce high-quality 3D-aware talking head. Using prior knowledge of generative models combined with NeRF, our method can craft a 3D-consistent facial feature space corresponding to a single image. Our spatial synchronization method employs audio-correlated vertex dynamics of a parametric face model to transform static image features into dynamic visuals through ray deformation, ensuring realistic 3D facial motion. Moreover, we introduce LipaintNet that can replenish the lacking information in the inner-mouth area, which can not be obtained from a given single image. The network is trained in a self-supervised manner by utilizing the generative capabilities without additional data. The comprehensive experiments demonstrate the superiority of our method in generating audio-driven talking heads from a single image with enhanced 3D consistency compared to previous approaches. In addition, we introduce a quantitative way of measuring the robustness of a model against pose changes for the first time, which has been possible only qualitatively.
音频驱动的讲话头生成从2D发展到3D内容。值得注意的是,Neural Radiance Field(NeRF)作为一种生成高质量3D谈话头的方法,受到了广泛关注。然而,基于NeRF的方法通常需要大量的成对音频-视觉数据来每个身份,从而限制了方法的可扩展性。尽管已经尝试使用单张图像生成音频驱动的3D谈话头动画,但结果往往不令人满意,因为图像中的遮盖区域信息不足。在本文中,我们主要关注解决单张音频驱动领域中3D一致性的忽视问题,其中面部动画主要在正面视角下合成。我们提出了名为NeRFFaceSpeech的新方法,该方法利用先验知识、NeRF和生成模型的组合,可以生成高质量的3D意识谈话头。通过将参数化面部模型的空间相关顶点动力学与NeRF相结合,我们的方法可以将静态图像特征通过光线变形转换成动态视觉效果,确保逼真的3D面部运动。此外,我们还引入了LipaintNet,它可以补充图像中内嘴区域所缺少的信息。该网络通过利用生成能力而不需要额外数据进行自监督训练。综合实验证明,与以前方法相比,我们的方法在从单张图像中生成音频驱动谈话头方面具有增强的3D一致性。此外,我们还引入了一种量化评估模型对姿态变化的鲁棒性的方法,这是目前仅凭定性评估的方法无法实现的。
https://arxiv.org/abs/2405.05749
Flamenco, recognized by UNESCO as part of the Intangible Cultural Heritage of Humanity, is a profound expression of cultural identity rooted in Andalusia, Spain. However, there is a lack of quantitative studies that help identify characteristic patterns in this long-lived music tradition. In this work, we present a computational analysis of Flamenco lyrics, employing natural language processing and machine learning to categorize over 2000 lyrics into their respective Flamenco genres, termed as $\textit{palos}$. Using a Multinomial Naive Bayes classifier, we find that lexical variation across styles enables to accurately identify distinct $\textit{palos}$. More importantly, from an automatic method of word usage, we obtain the semantic fields that characterize each style. Further, applying a metric that quantifies the inter-genre distance we perform a network analysis that sheds light on the relationship between Flamenco styles. Remarkably, our results suggest historical connections and $\textit{palo}$ evolutions. Overall, our work illuminates the intricate relationships and cultural significance embedded within Flamenco lyrics, complementing previous qualitative discussions with quantitative analyses and sparking new discussions on the origin and development of traditional music genres.
弗拉明戈,作为联合国教科文组织列入人类非物质文化遗产的一部分,是源于西班牙安达卢西亚州的深刻文化身份表达。然而,缺乏定量的研究来帮助确定这种长期存在的音乐传统中的特征模式。在这篇工作中,我们提出了对弗拉明戈歌词的计算分析,利用自然语言处理和机器学习将超过2000个歌词归类为相应的弗拉明戈流派,称为“$\textit{palos}$”。使用多分类朴素贝叶斯分类器,我们发现词汇变化的风格差异能够准确识别出不同的$\textit{palos}$。更重要的是,从自动使用单词的方法中,我们获得了每个流派的语义场。进一步,应用了一个衡量不同流派之间距离的度量,我们进行了网络分析,揭示了弗拉明戈风格之间的联系。值得注意的是,我们的结果表明,弗拉明戈歌词中存在历史联系和$\textit{palo}$演变。总体而言,我们的工作揭示了弗拉明戈歌词中复杂的关系和文化意义,补充了以前定性讨论的定量分析和引发了关于传统音乐流派起源和发展的新讨论。
https://arxiv.org/abs/2405.05723
Single-Photon Emission Computed Tomography (SPECT) left ventricular assessment protocols are important for detecting ischemia in high-risk patients. To quantitatively measure myocardial function, clinicians depend on commercially available solutions to segment and reorient the left ventricle (LV) for evaluation. Based on large normal datasets, the segmentation performance and the high price of these solutions can hinder the availability of reliable and precise localization of the LV delineation. To overcome the aforementioned shortcomings this paper aims to give a recipe for diagnostic centers as well as for clinics to automatically segment the myocardium based on small and low-quality labels on reconstructed SPECT, complete field-of-view (FOV) volumes. A combination of Continuous Max-Flow (CMF) with prior shape information is developed to augment the 3D U-Net self-supervised learning (SSL) approach on various geometries of SPECT apparatus. Experimental results on the acquired dataset have shown a 5-10\% increase in quantitative metrics based on the previous State-of-the-Art (SOTA) solutions, suggesting a good plausible way to tackle the few-shot SSL problem on high-noise SPECT cardiac datasets.
Single-Photon Emission Computed Tomography (SPECT) left ventricular assessment protocols are important for detecting ischemia in high-risk patients. To quantitatively measure myocardial function, clinicians depend on commercially available solutions to segment and reorient the left ventricle (LV) for evaluation. Based on large normal datasets, the segmentation performance and the high cost of these solutions can hinder the availability of reliable and precise localization of the LV delineation. To overcome the aforementioned shortcomings, this paper aims to give a recipe for diagnostic centers as well as for clinics to automatically segment the myocardium based on small and low-quality labels on reconstructed SPECT, complete field-of-view (FOV) volumes. A combination of Continuous Max-Flow (CMF) with prior shape information is developed to augment the 3D U-Net self-supervised learning (SSL) approach on various geometries of SPECT apparatus. Experimental results on the acquired dataset have shown a 5-10\% increase in quantitative metrics based on the previous State-of-the-Art (SOTA) solutions, suggesting a good plausible way to tackle the few-shot SSL problem on high-noise SPECT cardiac datasets.
https://arxiv.org/abs/2405.05520
Online discussion forums provide crucial data to understand the concerns of a wide range of real-world communities. However, the typical qualitative and quantitative methods used to analyze those data, such as thematic analysis and topic modeling, are infeasible to scale or require significant human effort to translate outputs to human readable forms. This study introduces QuaLLM, a novel LLM-based framework to analyze and extract quantitative insights from text data on online forums. The framework consists of a novel prompting methodology and evaluation strategy. We applied this framework to analyze over one million comments from two Reddit's rideshare worker communities, marking the largest study of its type. We uncover significant worker concerns regarding AI and algorithmic platform decisions, responding to regulatory calls about worker insights. In short, our work sets a new precedent for AI-assisted quantitative data analysis to surface concerns from online forums.
线上论坛提供了对现实生活中各种社区的关键数据,以了解这些社区的担忧。然而,用于分析这些数据的通常定性和定量方法,如主题分析和主题建模,在扩展规模或需要大量的人工努力将输出转换为可读形式方面是不够的。本研究介绍了一个名为QuaLLM的新LLM基框架,用于分析在线论坛文本数据的定量洞见。该框架包括一种新颖的提示方法和评估策略。我们将该框架应用于分析两个Reddit的拼车工人社区的超过一百万条评论,这是目前此类研究中最大的。我们发现了关于AI和算法平台决策的工人担忧,并回应了关于工人洞察力的监管要求。简而言之,我们的工作为人工智能辅助的定量数据分析在浮现在线论坛的担忧树立了新的先例。
https://arxiv.org/abs/2405.05345
Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses, which we term "Type I hallucinations". Instead, they focus on hallucinations responding to very specific question formats -- typically a multiple-choice response regarding a particular object or attribute -- which we term "Type II hallucinations". Additionally, such benchmarks often require external API calls to models which are subject to change. In practice, we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this, we propose THRONE, a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets, we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations, and that established benchmarks for measuring Type I hallucinations are incomplete. Finally, we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.
缓解大型视觉语言模型(LVLM)中的幻觉仍然是一个开放性问题。最近的大规模基准测试并没有解决开放性自由文本响应中的幻觉,我们称之为“Type I幻觉”。相反,它们关注于针对特定问题格式(通常是关于特定物体或属性的多选题)的幻觉,我们称之为“Type II幻觉”。此外,许多基准测试通常需要对处于变化中的模型进行外部API调用。在实践中,我们观察到,Type II幻觉的减少并没有导致Type I幻觉的减少,而是两种幻觉 often 是反相关的。为解决这个问题,我们提出了THRONE,一种新的基于对象的自动框架,用于定量评估LVLM自由文本输出中的Type I幻觉。我们使用公共语言模型(LMs)来检测LVLM响应中的幻觉并计算有用的指标。通过使用公开数据集评估大量最近的LVLM,我们发现,现有指标的改进并没有导致Type I幻觉的减少,已建立的基准测试也不完整。最后,我们提供了一种简单而有效的数据增强方法,作为强基准来减少Type I和Type II幻觉。
https://arxiv.org/abs/2405.05256
Diffusion models are a powerful generative framework, but come with expensive inference. Existing acceleration methods often compromise image quality or fail under complex conditioning when operating in an extremely low-step regime. In this work, we propose a novel distillation framework tailored to enable high-fidelity, diverse sample generation using just one to three steps. Our approach comprises three key components: (i) Backward Distillation, which mitigates training-inference discrepancies by calibrating the student on its own backward trajectory; (ii) Shifted Reconstruction Loss that dynamically adapts knowledge transfer based on the current time step; and (iii) Noise Correction, an inference-time technique that enhances sample quality by addressing singularities in noise prediction. Through extensive experiments, we demonstrate that our method outperforms existing competitors in quantitative metrics and human evaluations. Remarkably, it achieves performance comparable to the teacher model using only three denoising steps, enabling efficient high-quality generation.
扩散模型是一个强大的生成框架,但它们通常代价高昂的推理。现有的加速方法在操作在极度低步骤率时通常会牺牲图像质量或者在复杂条件下失效。在本文中,我们提出了一个专门为高保真度、多样样本生成而设计的全新去蒸馏框架。我们的方法包括三个关键组件: (i)反向扩散,通过在自己反向轨迹上进行自校准来缓解训练推理差异;(ii)平移重构损失,根据当前时间步动态适应知识转移;和(iii)噪声纠正,通过解决噪声预测中的 singularities 来提高样本质量。通过大量实验,我们证明了我们的方法在数量指标和人类评价方面优于现有竞争对手。值得注意的是,它仅使用三个去噪步骤就实现了与教师模型相当的表现,从而实现高效的优质生成。
https://arxiv.org/abs/2405.05224
Across a number of sign languages, temporal and spatial characteristics of dominant hand articulation are used to express semantic and grammatical features. In this study of Austrian Sign Language (Österreichische Gebärdensprache, or ÖGS), motion capture data of four Deaf signers is used to quantitatively characterize the kinematic parameters of sign production in verbs and adjectives. We investigate (1) the difference in production between verbs involving a natural endpoint (telic verbs; e.g. arrive) and verbs lacking an endpoint (atelic verbs; e.g. analyze), and (2) adjective signs in intensified vs. non-intensified (plain) forms. Motion capture data analysis using linear-mixed effects models (LME) indicates that both the endpoint marking in verbs, as well as marking of intensification in adjectives, are expressed by movement modulation in ÖGS. While the semantic distinction between verb types (telic/atelic) is marked by higher peak velocity and shorter duration for telic signs compared to atelic ones, the grammatical distinction (intensification) in adjectives is expressed by longer duration for intensified compared to non-intensified adjectives. The observed individual differences of signers might be interpreted as personal signing style.
在奥地利手语(奥地利手语,简称奥地利手语)中,使用主导手部动作的时间和空间特征来表达语义和语法特征。在这项关于奥地利手语的研究中,使用四个聋人的运动捕捉数据来定量描述动词和形容词手语生产中的运动参数。我们研究了(1)涉及自然终点的动词(例如到达)与不涉及终点的动词(例如分析)之间的生产差异,以及(2)增强形式和非增强形式之间的形容词手语。使用线性混合效应模型(LME)进行运动捕捉数据分析表明,动词的终点标记以及形容词的增强形式标记都在奥地利手语中通过运动调节来表达。尽管动词类型(有终点的/无终点的)之间的语义区分是由较高峰值速度和较短持续时间来区分的,但形容词的语法区分(增强)是由增强形式持续时间长于非增强形式来区分的。观察到的每个发言者的个人差异可以解释为个人 signing style。
https://arxiv.org/abs/2405.05161
This paper presents an analysis of properties of two hybrid discretization methods for Gaussian derivatives, based on convolutions with either the normalized sampled Gaussian kernel or the integrated Gaussian kernel followed by central differences. The motivation for studying these discretization methods is that in situations when multiple spatial derivatives of different order are needed at the same scale level, they can be computed significantly more efficiently compared to more direct derivative approximations based on explicit convolutions with either sampled Gaussian kernels or integrated Gaussian kernels. While these computational benefits do also hold for the genuinely discrete approach for computing discrete analogues of Gaussian derivatives, based on convolution with the discrete analogue of the Gaussian kernel followed by central differences, the underlying mathematical primitives for the discrete analogue of the Gaussian kernel, in terms of modified Bessel functions of integer order, may not be available in certain frameworks for image processing, such as when performing deep learning based on scale-parameterized filters in terms of Gaussian derivatives, with learning of the scale levels. In this paper, we present a characterization of the properties of these hybrid discretization methods, in terms of quantitative performance measures concerning the amount of spatial smoothing that they imply, as well as the relative consistency of scale estimates obtained from scale-invariant feature detectors with automatic scale selection, with an emphasis on the behaviour for very small values of the scale parameter, which may differ significantly from corresponding results obtained from the fully continuous scale-space theory, as well as between different types of discretization methods.
本文对两种混合插值方法(基于对齐抽样高斯核或集成高斯核的卷积)对高斯导数的性质进行了分析。这些插值方法 motivated研究的原因是在同一尺度级别需要多个不同阶度的空间导数时,它们可以比基于显式卷积的高斯核或集成高斯核的更直接导数近似的计算效率更高。虽然这些计算优势也适用于真正离散方法计算高斯导数的数字模拟,但基于对齐抽样高斯核的数字模拟以及通过对称差分的高斯核,其背后的数学基本原理(关于整数级Bessel函数)在一些图像处理框架中可能无法获得,例如在基于尺度参数化滤波器的大深度学习情况下,学习尺度级别。本文,我们给出了这些混合插值方法的定量性能指标,即它们所隐含的空间平滑程度,以及来自自动尺度选择尺度不变特征检测器获得的相对一致尺度估计,重点关注小尺度参数下的行为,这些行为可能与完全连续尺度空间理论得到的结果有很大的差异,以及不同类型的插值方法。
https://arxiv.org/abs/2405.05095
Wound healing is a complex process involving changes in collagen fibers. Accurate monitoring of these changes is crucial for assessing the progress of wound healing and has significant implications for guiding clinical treatment strategies and drug screening. However, traditional quantitative analysis methods focus on spatial characteristics such as collagen fiber alignment and variance, lacking threshold standards to differentiate between different stages of wound healing. To address this issue, we propose an innovative approach based on deep learning to predict the progression of wound healing by analyzing collagen fiber features in histological images of wound tissue. Leveraging the unique learning capabilities of deep learning models, our approach captures the feature variations of collagen fibers in histological images from different categories and classifies them into various stages of wound healing. To overcome the limited availability of histological image data, we employ a transfer learning strategy. Specifically, we fine-tune a VGG16 model pretrained on the ImageNet dataset to adapt it to the classification task of histological images of wounds. Through this process, our model achieves 82% accuracy in classifying six stages of wound healing. Furthermore, to enhance the interpretability of the model, we employ a class activation mapping technique called LayerCAM. LayerCAM reveals the image regions on which the model relies when making predictions, providing transparency to the model's decision-making process. This visualization not only helps us understand how the model identifies and evaluates collagen fiber features but also enhances trust in the model's prediction results. To the best of our knowledge, our proposed model is the first deep learning-based classification model used for predicting wound healing stages.
伤口愈合是一个涉及胶原纤维变化复杂的生物过程。准确监测这些变化对评估伤口愈合进展至关重要,这对指导临床治疗策略和药物筛选具有重大意义。然而,传统的定量分析方法仅关注空间特征,如胶原纤维对齐和方差,缺乏将不同伤口愈合阶段区分开的阈值标准。为了应对这个问题,我们提出了一个基于深度学习的方法来预测伤口愈合的进展,通过分析伤口组织切片的胶原纤维特征。利用深度学习模型的独特学习能力,我们的方法捕捉了不同类别和分类的胶原纤维特征的变异,并将它们分类为不同的伤口愈合阶段。为了克服病理图像数据的有限性,我们采用迁移学习策略。具体来说,我们使用预训练于ImageNet数据集的VGG16模型,并对其进行调整,使其适应伤口组织图像的分类任务。通过这个过程,我们的模型在分类六个伤口愈合阶段时实现了82%的准确率。此外,为了增强模型的可解释性,我们采用了一种称为LayerCAM的分类激活映射技术。LayerCAM揭示了模型在做出预测时所依赖的图像区域,从而使我们对模型的决策过程有更深入的了解。这种可视化不仅帮助我们理解模型如何识别和评估胶原纤维特征,还增强了模型预测结果的可信度。据我们所知,我们的基于深度学习的模型是第一个用于预测伤口愈合阶段的深度学习分类模型。
https://arxiv.org/abs/2405.05297
Few-shot class-incremental learning (FSCIL) aims to acquire knowledge from novel classes with limited samples while retaining information about base classes. Existing methods address catastrophic forgetting and overfitting by freezing the feature extractor during novel-class learning. However, these methods usually tend to cause the confusion between base and novel classes, i.e., classifying novel-class samples into base classes. In this paper, we delve into this phenomenon to study its cause and solution. We first interpret the confusion as the collision between the novel-class and the base-class region in the feature space. Then, we find the collision is caused by the label-irrelevant redundancies within the base-class feature and pixel space. Through qualitative and quantitative experiments, we identify this redundancy as the shortcut in the base-class training, which can be decoupled to alleviate the collision. Based on this analysis, to alleviate the collision between base and novel classes, we propose a method for FSCIL named Redundancy Decoupling and Integration (RDI). RDI first decouples redundancies from base-class space to shrink the intra-base-class feature space. Then, it integrates the redundancies as a dummy class to enlarge the inter-base-class feature space. This process effectively compresses the base-class feature space, creating buffer space for novel classes and alleviating the model's confusion between the base and novel classes. Extensive experiments across benchmark datasets, including CIFAR-100, miniImageNet, and CUB-200-2011 demonstrate that our method achieves state-of-the-art performance.
少数样本分类增强学习(FSCIL)旨在通过有限的样本获得关于新类别的知识,同时保留基类信息的完整性。现有的方法通过在 novel-class 学习期间冻结特征提取器来解决灾难性遗忘和过拟合问题。然而,这些方法通常会导致基类和新类之间的混淆,即将 novel-class 样本归类为基类。在本文中,我们深入研究了这种现象的原因和解决方案。我们首先解释混淆为特征空间中 novel-class 和基类区域之间的碰撞。然后,我们发现碰撞是由基类特征和像素空间中的标签无关冗余引起的。通过定性和定量实验,我们确定这种冗余是基类训练中的快捷方式,可以解耦以减轻碰撞。根据这种分析,为了解决基类和新类之间的碰撞,我们提出了一个名为 Redundancy Decoupling and Integration(RDI)的 FSCIL 方法。RDI 首先从基类空间中解耦冗余以压缩内基类特征空间。然后,它将冗余作为模拟类来扩展基类特征空间。这一过程有效地压缩了基类特征空间,为 novel 类创建了缓冲区空间,减轻了模型在基类和 novel 类之间的混淆。在包括 CIFAR-100、miniImageNet 和 CUB-200-2011 等基准数据集的广泛实验中,我们的方法证明了其达到了最先进的性能水平。
https://arxiv.org/abs/2405.04918
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and quantifiable properties pertaining them, EQA with situational queries (such as "Is the bathroom clean and dry?") is more challenging, as the agent needs to figure out not just what the target objects pertaining to the query are, but also requires a consensus on their states to be answerable. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM's output to create a dataset of unique situational queries, corresponding consensus object information, and predicted answers. PGE maintains uniqueness among the generated queries, using multiple forms of semantic similarity. We validate the generated dataset via a large scale user-study conducted on M-Turk, and introduce it as S-EQA, the first dataset tackling EQA with situational queries. Our user study establishes the authenticity of S-EQA with a high 97.26% of the generated queries being deemed answerable, given the consensus object data. Conversely, we observe a low correlation of 46.2% on the LLM-predicted answers to human-evaluated ones; indicating the LLM's poor capability in directly answering situational queries, while establishing S-EQA's usability in providing a human-validated consensus for an indirect solution. We evaluate S-EQA via Visual Question Answering (VQA) on VirtualHome, which unlike other simulators, contains several objects with modifiable states that also visually appear different upon modification -- enabling us to set a quantitative benchmark for S-EQA. To the best of our knowledge, this is the first work to introduce EQA with situational queries, and also the first to use a generative approach for query creation.
我们在家庭环境中针对情境查询(S-EQA)解决了 embodied 问题回答(EQA)问题。与之前的工作不同,这些工作主要解决与目标对象直接引用并可量化的属性相关的简单查询,而 EQA with situational queries(例如“卫生间干净干燥吗?”)更具挑战性,因为代理需要确定不仅目标对象的答案,而且还需要就它们的状态达成一致。为了实现这个目标,我们首先介绍了一种新颖的提示生成-评估(PGE)方案,该方案围绕 LLM 的输出创建了一个独特的数据集,包括独特的情境查询、相应的共识对象信息和预测的答案。PGE 在生成的查询中保持独特性,利用多种语义相似性。我们通过在 M-Turk 上进行大规模用户研究来验证生成的数据集,并将其作为 S-EQA,第一个处理情境查询的 dataset。我们的用户研究证实 S-EQA 的真实性,其中有 97.26% 的生成查询被认为具有答案,基于共识对象数据。相反,我们在 LLM 预测的答案和人类评估的答案之间观察到较低的相关性,表明 LLM 在直接回答情境查询方面能力较差,但 S-EQA 在提供人类验证的共识方面具有可用性。我们通过在 VirtualHome 上使用视觉问答(VQA)来评估 S-EQA,这个模拟器与其他模拟器不同,包含多个可修改的状态的对象,在修改后也具有不同的视觉表现,使我们能够为 S-EQA 设定一个量化基准。据我们所知,这是第一个介绍 EQA with situational queries 的作品,也是第一个使用生成方法创建查询的。
https://arxiv.org/abs/2405.04732
I explored adapting Stable Diffusion v1.5 for generating domain-specific satellite and aerial images in remote sensing. Recognizing the limitations of existing models like Midjourney and Stable Diffusion, trained primarily on natural RGB images and lacking context for remote sensing, I used the RSICD dataset to train a Stable Diffusion model with a loss of 0.2. I incorporated descriptive captions from the dataset for text-conditioning. Additionally, I created a synthetic dataset for a Land Use Land Classification (LULC) task, employing prompting techniques with RAG and ChatGPT and fine-tuning a specialized remote sensing LLM. However, I faced challenges with prompt quality and model performance. I trained a classification model (ResNet18) on the synthetic dataset achieving 49.48% test accuracy in TorchGeo to create a baseline. Quantitative evaluation through FID scores and qualitative feedback from domain experts assessed the realism and quality of the generated images and dataset. Despite extensive fine-tuning and dataset iterations, results indicated subpar image quality and realism, as indicated by high FID scores and domain-expert evaluation. These findings call attention to the potential of diffusion models in remote sensing while highlighting significant challenges related to insufficient pretraining data and computational resources.
我探索了将Stable Diffusion v1.5适应远程 sensing生成特定领域的卫星和航空影像。意识到Midjourney和Stable Diffusion等现有模型的限制,我主要在自然RGB图像上训练,并缺乏遥感方面的上下文,使用RSICD数据集训练了一个Stable Diffusion模型,损失为0.2。为了进行文本条件处理,我引入了数据集中的描述性字幕。此外,我还创建了一个土地利用和土地分类(LULC)任务的合成数据集,并使用RAG和ChatGPT进行提示,并对专门的远程 sensing LLM进行微调。然而,我遇到了提示质量和模型性能方面的挑战。在合成数据集上训练了一个分类模型(ResNet18),在TorchGeo上实现了49.48%的测试精度,作为 baseline。通过FID分数和领域专家的定性反馈对生成图像和数据集进行定量评估,结果表明生成的图像质量较低,不够真实,正如高FID分数和领域专家评估所显示的。这些发现提醒人们在远程感应在扩散模型方面的潜力,同时强调了缺乏预训练数据和计算资源所带来的重大挑战。
https://arxiv.org/abs/2405.04717
Existing diffusion-based video editing methods have achieved impressive results in motion editing. Most of the existing methods focus on the motion alignment between the edited video and the reference video. However, these methods do not constrain the background and object content of the video to remain unchanged, which makes it possible for users to generate unexpected videos. In this paper, we propose a one-shot video motion editing method called Edit-Your-Motion that requires only a single text-video pair for training. Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to decouple spatio-temporal features in space-time diffusion models. DPL separates learning object content and motion into two training stages. In the first training stage, we focus on learning the spatial features (the features of object content) and breaking down the temporal relationships in the video frames by shuffling them. We further propose Recurrent-Causal Attention (RC-Attn) to learn the consistent content features of the object from unordered video frames. In the second training stage, we restore the temporal relationship in video frames to learn the temporal feature (the features of the background and object's motion). We also adopt the Noise Constraint Loss to smooth out inter-frame differences. Finally, in the inference stage, we inject the content features of the source object into the editing branch through a two-branch structure (editing branch and reconstruction branch). With Edit-Your-Motion, users can edit the motion of objects in the source video to generate more exciting and diverse videos. Comprehensive qualitative experiments, quantitative experiments and user preference studies demonstrate that Edit-Your-Motion performs better than other methods.
现有的基于扩散的动态编辑方法在动态编辑方面取得了令人印象深刻的成果。大多数现有方法都关注编辑视频与参考视频之间的运动对齐。然而,这些方法没有将视频的背景和对象内容约束为不变,这使得用户可以生成意外的视频。在本文中,我们提出了一个仅需要一个文本-视频对进行训练的一击视频动态编辑方法,称为Edit-Your-Motion。具体来说,我们设计了一个空间时间扩散模型中的详细提示引导学习策略(DPL)来分离空间特征。DPL将学习对象内容和运动分为两个训练阶段。在第一个训练阶段,我们专注于学习空间特征(对象内容的特征)并通过随机调整视频帧顺序来分解视频帧中的时间关系。我们还提出了递归因果注意(RC-Attn)来从无序视频帧中学习对象的连续特征。在第二个训练阶段,我们恢复视频帧中的时间关系以学习背景和对象的动态特征。我们还采用噪声约束损失来平滑帧之间的差异。最后,在推理阶段,我们将源对象的內容通过双分支结构(编辑分支和重构分支)注入编辑分支中。使用Edit-Your-Motion,用户可以编辑源视频中对象的动态以生成更令人兴奋和多样化的视频。全面的定性实验、定量和用户偏好研究证明了Edit-Your-Motion的性能优于其他方法。
https://arxiv.org/abs/2405.04496
Neural Radiance Field~(NeRF) achieves extremely high quality in object-scaled and indoor scene reconstruction. However, there exist some challenges when reconstructing large-scale scenes. MLP-based NeRFs suffer from limited network capacity, while volume-based NeRFs are heavily memory-consuming when the scene resolution increases. Recent approaches propose to geographically partition the scene and learn each sub-region using an individual NeRF. Such partitioning strategies help volume-based NeRF exceed the single GPU memory limit and scale to larger scenes. However, this approach requires multiple background NeRF to handle out-of-partition rays, which leads to redundancy of learning. Inspired by the fact that the background of current partition is the foreground of adjacent partition, we propose a scalable scene reconstruction method based on joint Multi-resolution Hash Grids, named DistGrid. In this method, the scene is divided into multiple closely-paved yet non-overlapped Axis-Aligned Bounding Boxes, and a novel segmented volume rendering method is proposed to handle cross-boundary rays, thereby eliminating the need for background NeRFs. The experiments demonstrate that our method outperforms existing methods on all evaluated large-scale scenes, and provides visually plausible scene reconstruction. The scalability of our method on reconstruction quality is further evaluated qualitatively and quantitatively.
Neural Radiance Field (NeRF) 在物体缩放和室内场景重建方面实现了极高的质量。然而,在重构大规模场景时存在一些挑战。基于 MLP 的 NeRFs 存在网络容量有限的问题,而基于体积的 NeRFs 在场景分辨率增加时内存消耗严重。最近的方法提出了使用单个 NeRF 个体对场景进行地理分割,并学习每个子区域的策略。这种分割策略有助于基于体积的 NeRF 超过单个 GPU 内存限制,并扩展到更大的场景。然而,这种方法需要多个背景 NeRF 来处理跨分区光线,导致学习冗余。受到当前分割背景是相邻分割背景的事实启发,我们提出了一个基于联合多分辨率哈希网格的可扩展场景重构方法,名为 DistGrid。在这个方法中,场景被分成多个紧密铺纹且不重叠的轴向对齐的边界框,并提出了一种新的分割体积渲染方法来处理跨边界光线,从而消除需要背景 NeRF 的需求。实验证明,我们的方法在所有评估的大规模场景上都超过了现有方法,并提供了解剖学合理的场景重建。我们对重建质量的可扩展性进行了定性和定量的评估。
https://arxiv.org/abs/2405.04416
3D occupancy, an advanced perception technology for driving scenarios, represents the entire scene without distinguishing between foreground and background by quantifying the physical space into a grid map. The widely adopted projection-first deformable attention, efficient in transforming image features into 3D representations, encounters challenges in aggregating multi-view features due to sensor deployment constraints. To address this issue, we propose our learning-first view attention mechanism for effective multi-view feature aggregation. Moreover, we showcase the scalability of our view attention across diverse multi-view 3D tasks, such as map construction and 3D object detection. Leveraging the proposed view attention as well as an additional multi-frame streaming temporal attention, we introduce ViewFormer, a vision-centric transformer-based framework for spatiotemporal feature aggregation. To further explore occupancy-level flow representation, we present FlowOcc3D, a benchmark built on top of existing high-quality datasets. Qualitative and quantitative analyses on this benchmark reveal the potential to represent fine-grained dynamic scenes. Extensive experiments show that our approach significantly outperforms prior state-of-the-art methods. The codes and benchmark will be released soon.
3D占有率,一种高级驾驶场景感知技术,通过将物理空间量化成一个网格图,表示整个场景,没有区分前景和背景。广泛采用的投影先验可变形注意力,在将图像特征转换为3D表示时高效,但在聚合多视图特征时遇到了传感器部署限制的挑战。为了应对这个问题,我们提出了学习先验视注意机制,用于有效的多视图特征聚合。此外,我们还展示了我们在不同多视图3D任务上的视注意力的可扩展性,如地图构建和3D物体检测。利用所提出的视注意力和额外的多帧流式时间注意,我们引入了ViewFormer,一个以视觉为中心的Transformer框架,用于空间和时间特征的聚合。为进一步探索占有率级流动表示,我们发布了FlowOcc3D,一个基于现有高质量数据集的基准。对这一基准的定性和定量分析揭示了潜在的表示细粒度动态场景的潜力。大量实验证明,我们的方法在性能上显著超过了先前的最先进方法。代码和基准将很快发布。
https://arxiv.org/abs/2405.04299
Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence to safety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.
大型语言模型(LLMs)因其在各种任务上的出色表现而吸引了大量关注和广泛应用。然而,它们也面临着一些挑战,包括幻觉、事实不一致和数值推理能力的局限性。在杂项推理任务中评估LLMs仍然是一个活跃的研究领域。在LLM突破之前,已经在医学领域证明了Transformer的成功,有效地用于各种自然语言理解(NLU)任务。根据这一趋势,LLMs也已被用于医学领域,引发了对事实准确性的担忧、遵守安全协议的疑问以及固有局限性的质疑。在本文中,我们重点评估了流行开源和闭源LLMs的自然语言推理能力,使用临床研究报告作为数据集。我们展示了每个LLM的性能结果,并进一步分析了它们在开发集上的表现,特别是涉及医疗缩写需要数值-定量推理的具有挑战性的实例。我们的领导LLM Gemini在任务评分榜上获得了0.748的测试集F1分数,确保了第九的位置。我们的工作是独一无二的,对LLMs在医学领域内的推理能力进行了深入的审查。
https://arxiv.org/abs/2405.04170
In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
在本文中,我们研究了在3D形状和文本描述之间进行跨模态检索的开放研究任务。以前的方法主要依赖于点云编码器进行特征提取,这可能会忽略3D形状的关键固有特征,包括深度、空间层次结构、几何连续性等。为了解决这个问题,我们提出了COM3D,这是第一次尝试利用跨视图匹配和跨模态挖掘来提高检索性能。值得注意的是,我们通过场景表示转换器增强3D特征,以生成3D形状的跨视图匹配特征,从而丰富其固有特征并提高其与文本匹配的兼容性。此外,我们还基于半硬负样本挖掘方法优化跨模态匹配过程,试图提高学习效率。大量的定量和定性实验证实了我们提出的COM3D具有优越性,在Text2Shape数据集上取得了最先进的成果。
https://arxiv.org/abs/2405.04103