We consider the problem of physically-based inverse rendering using 3D Gaussian Splatting (3DGS) representations. While recent 3DGS methods have achieved remarkable results in novel view synthesis (NVS), accurately capturing high-fidelity geometry, physically interpretable materials and lighting remains challenging, as it requires precise geometry modeling to provide accurate surface normals, along with physically-based rendering (PBR) techniques to ensure correct material and lighting disentanglement. Previous 3DGS methods resort to approximating surface normals, but often struggle with noisy local geometry, leading to inaccurate normal estimation and suboptimal material-lighting decomposition. In this paper, we introduce GeoSplatting, a novel hybrid representation that augments 3DGS with explicit geometric guidance and differentiable PBR equations. Specifically, we bridge isosurface and 3DGS together, where we first extract isosurface mesh from a scalar field, then convert it into 3DGS points and formulate PBR equations for them in a fully differentiable manner. In GeoSplatting, 3DGS is grounded on the mesh geometry, enabling precise surface normal modeling, which facilitates the use of PBR frameworks for material decomposition. This approach further maintains the efficiency and quality of NVS from 3DGS while ensuring accurate geometry from the isosurface. Comprehensive evaluations across diverse datasets demonstrate the superiority of GeoSplatting, consistently outperforming existing methods both quantitatively and qualitatively.
我们考虑使用三维高斯散射(3DGS)表示进行基于物理的逆渲染问题。尽管最近的3DGS方法在新颖视图合成(NVS)方面取得了显著成果,能够准确捕捉高质量几何形状、具有物理可解释性的材料和照明仍然是一个挑战。这需要精确的几何建模来提供准确的表面法线,并结合基于物理的渲染(PBR)技术以确保正确的材质和光照分离。之前的3DGS方法依赖于近似表面法线,但常常因局部几何形状嘈杂而导致法线估计不准确以及材料-照明分解次优。本文中,我们介绍了GeoSplatting,这是一种新型混合表示,通过显式的几何指导和可微分的PBR方程增强了3DGS。具体来说,我们将等值面与3DGS结合起来,在这个过程中首先从标量场提取等值面网格,然后将其转换为3DGS点并以完全可微的方式为其建立PBR方程。在GeoSplatting中,3DGS基于网格几何形状,能够进行精确的表面法线建模,从而促进使用PBR框架来分解材质。这种方法不仅保持了3DGS在NVS方面的效率和质量,还确保了从等值面获取准确的几何结构。广泛的实验评估表明,在多样化的数据集上,GeoSplatting表现出色,无论是定量还是定性指标均优于现有方法。
https://arxiv.org/abs/2410.24204
Detecting Human-Object Interactions (HOI) in zero-shot settings, where models must handle unseen classes, poses significant challenges. Existing methods that rely on aligning visual encoders with large Vision-Language Models (VLMs) to tap into the extensive knowledge of VLMs, require large, computationally expensive models and encounter training difficulties. Adapting VLMs with prompt learning offers an alternative to direct alignment. However, fine-tuning on task-specific datasets often leads to overfitting to seen classes and suboptimal performance on unseen classes, due to the absence of unseen class labels. To address these challenges, we introduce a novel prompt learning-based framework for Efficient Zero-Shot HOI detection (EZ-HOI). First, we introduce Large Language Model (LLM) and VLM guidance for learnable prompts, integrating detailed HOI descriptions and visual semantics to adapt VLMs to HOI tasks. However, because training datasets contain seen-class labels alone, fine-tuning VLMs on such datasets tends to optimize learnable prompts for seen classes instead of unseen ones. Therefore, we design prompt learning for unseen classes using information from related seen classes, with LLMs utilized to highlight the differences between unseen and related seen classes. Quantitative evaluations on benchmark datasets demonstrate that our EZ-HOI achieves state-of-the-art performance across various zero-shot settings with only 10.35% to 33.95% of the trainable parameters compared to existing methods. Code is available at this https URL.
在零样本设置中检测人-物交互(HOI),模型需要处理未见过的类别,这提出了显著的挑战。现有方法依赖于将视觉编码器与大型视觉语言模型(VLMs)对齐以利用VLMs的广泛知识,但这种方法需要大规模且计算成本高昂的模型,并且在训练时遇到困难。通过提示学习来适应VLMs为直接对齐提供了一种替代方案。然而,在任务特定的数据集上微调往往导致过度拟合已见类别并在未见过的类别的性能不佳,因为缺乏未见过类别的标签。为了应对这些挑战,我们提出了一种基于提示学习的新框架,用于高效的零样本HOI检测(EZ-HOI)。首先,我们引入了大型语言模型(LLM)和VLM指导可学习提示,整合详细的HOI描述和视觉语义以适应VLMs到HOI任务。然而,由于训练数据集仅包含已见类别的标签,因此在这些数据集上微调VLMs倾向于优化已见过的类别而非未见过的类别。因此,我们设计了使用相关已见类别信息来学习提示的未见过类别,并利用LLMs突出显示未见过类别和相关已见类别之间的差异。基准数据集上的定量评估表明,我们的EZ-HOI在各种零样本设置中仅使用现有方法10.35%至33.95%的可训练参数实现了最先进的性能。代码可在以下链接获取:[这个 https URL]。
https://arxiv.org/abs/2410.23904
Remote sensing change detection aims to perceive changes occurring on the Earth's surface from remote sensing data in different periods, and feed these changes back to humans. However, most existing methods only focus on detecting change regions, lacking the ability to interact with users to identify changes that the users expect. In this paper, we introduce a new task named Change Detection Question Answering and Grounding (CDQAG), which extends the traditional change detection task by providing interpretable textual answers and intuitive visual evidence. To this end, we construct the first CDQAG benchmark dataset, termed QAG-360K, comprising over 360K triplets of questions, textual answers, and corresponding high-quality visual masks. It encompasses 10 essential land-cover categories and 8 comprehensive question types, which provides a large-scale and diverse dataset for remote sensing applications. Based on this, we present VisTA, a simple yet effective baseline method that unifies the tasks of question answering and grounding by delivering both visual and textual answers. Our method achieves state-of-the-art results on both the classic CDVQA and the proposed CDQAG datasets. Extensive qualitative and quantitative experimental results provide useful insights for the development of better CDQAG models, and we hope that our work can inspire further research in this important yet underexplored direction. The proposed benchmark dataset and method are available at this https URL.
遥感变化检测旨在从不同时间段的遥感数据中感知地球表面的变化,并将这些变化反馈给人类。然而,大多数现有的方法只专注于检测变化区域,缺乏与用户交互以识别用户期望变化的能力。本文介绍了一个新任务——变化检测问答及定位(Change Detection Question Answering and Grounding, CDQAG),该任务通过提供可解释的文本答案和直观的视觉证据扩展了传统的变化检测任务。为此,我们构建了第一个CDQAG基准数据集QAG-360K,包含超过36万组问题、文本答案及对应高质量视觉掩码的三元组。这个数据集涵盖了10种基本的地表覆盖类别和8种全面的问题类型,为遥感应用提供了一个大规模且多样的数据集。基于此,我们提出了VisTA方法,这是一种简单而有效的基线方法,通过同时给出视觉和文本答案来统一问答和定位任务。我们的方法在经典CDVQA数据集和提出的CDQAG数据集上均达到了最先进的结果。广泛的质量和数量实验提供了开发更好CDQAG模型的有用见解,并希望我们的工作能够激发在这个重要但尚未充分探索的方向上的进一步研究。我们提出的基准数据集和方法可在以下链接获取:[此 https URL]。
https://arxiv.org/abs/2410.23828
Ontologies are useful for automatic machine processing of domain knowledge as they represent it in a structured format. Yet, constructing ontologies requires substantial manual effort. To automate part of this process, large language models (LLMs) have been applied to solve various subtasks of ontology learning. However, this partial ontology learning does not capture the interactions between subtasks. We address this gap by introducing OLLM, a general and scalable method for building the taxonomic backbone of an ontology from scratch. Rather than focusing on subtasks, like individual relations between entities, we model entire subcomponents of the target ontology by finetuning an LLM with a custom regulariser that reduces overfitting on high-frequency concepts. We introduce a novel suite of metrics for evaluating the quality of the generated ontology by measuring its semantic and structural similarity to the ground truth. In contrast to standard metrics, our metrics use deep learning techniques to define more robust distance measures between graphs. Both our quantitative and qualitative results on Wikipedia show that OLLM outperforms subtask composition methods, producing more semantically accurate ontologies while maintaining structural integrity. We further demonstrate that our model can be effectively adapted to new domains, like arXiv, needing only a small number of training examples. Our source code and datasets are available at this https URL.
本体在领域知识的自动机器处理中非常有用,因为它们以结构化的格式表示这些知识。然而,构建本体需要大量的手动工作。为了自动化这一过程的一部分,大型语言模型(LLMs)已被应用于解决本体重构学习的各种子任务。然而,这种部分本体重构并未捕捉到各子任务之间的相互作用。为了解决这个问题,我们引入了OLLM,这是一种从零开始构建本体重构骨架的一般性和可扩展性方法。与专注于诸如实体间个别关系的子任务不同,我们通过使用定制的正则化器来微调一个LLM,以减少对高频概念的过拟合,从而建模整个目标本体的子组件。我们引入了一套新的评估指标,用于衡量生成的本体与其真实值在语义和结构上的相似性,从而评价其质量。与标准度量相比,我们的度量使用深度学习技术来定义图之间更稳健的距离测量方法。我们在维基百科上的定量和定性结果表明,OLLM优于子任务组合方法,在保持结构完整性的同时生成了更具语义准确性本体。我们进一步证明了该模型可以有效地适应新的领域,例如arXiv,只需要少量的训练样本即可。我们的源代码和数据集可以在这个https URL上获取。
https://arxiv.org/abs/2410.23584
Research into community content moderation often assumes that moderation teams govern with a single, unified voice. However, recent work has found that moderators disagree with one another at modest, but concerning rates. The problem is not the root disagreements themselves. Subjectivity in moderation is unavoidable, and there are clear benefits to including diverse perspectives within a moderation team. Instead, the crux of the issue is that, due to resource constraints, moderation decisions end up being made by individual decision-makers. The result is decision-making that is inconsistent, which is frustrating for community members. To address this, we develop Venire, an ML-backed system for panel review on Reddit. Venire uses a machine learning model trained on log data to identify the cases where moderators are most likely to disagree. Venire fast-tracks these cases for multi-person review. Ideally, Venire allows moderators to surface and resolve disagreements that would have otherwise gone unnoticed. We conduct three studies through which we design and evaluate Venire: a set of formative interviews with moderators, technical evaluations on two datasets, and a think-aloud study in which moderators used Venire to make decisions on real moderation cases. Quantitatively, we demonstrate that Venire is able to improve decision consistency and surface latent disagreements. Qualitatively, we find that Venire helps moderators resolve difficult moderation cases more confidently. Venire represents a novel paradigm for human-AI content moderation, and shifts the conversation from replacing human decision-making to supporting it.
社区内容审核研究通常假设审核团队以单一统一的声音进行管理。然而,最近的研究发现,审核人员彼此之间存在适度但令人担忧的分歧率。问题不在于这些根本性分歧本身。在审核中存在主观性是不可避免的,并且团队内部包含多元视角有明显的好处。相反,问题的核心是因为资源限制,审核决策最终由个别决策者做出。结果导致了决策的一致性缺失,这对社区成员来说是令人沮丧的。 为了解决这个问题,我们开发了Venire,一个基于机器学习的Reddit小组审议系统。Venire使用了一个训练过的机器学习模型来识别最有可能出现分歧的情况,并将这些情况快速提交给多人审查。理想情况下,Venire允许审核人员发现并解决那些原本可能被忽视的分歧。 我们通过三项研究设计和评估了Venire:一组与审核员的形式化访谈、两个数据集的技术评价以及一项使用思考出声法的研究,在这项研究中,审核员使用Venire来决定真实案例。定量分析显示,Venire能够提高决策的一致性并揭示潜在的分歧。定性分析表明,Venire有助于审核人员更自信地解决困难的审核案件。 Venire代表了人类与AI合作进行内容审核的新范式,并将对话从替代人为决策转变为支持它。
https://arxiv.org/abs/2410.23448
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs' reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles. Our code and data are available at this https URL.
大型语言模型(LLMs)在具有挑战性的推理基准测试中表现出良好的性能,但也可能犯一些基本的推理错误。这种矛盾的行为让人困惑,尤其是在理解LLMs推理能力背后的机制时。一种假设是,它们在常见推理基准上表现高且几乎饱和可能是由于对类似问题的记忆所致。在这篇论文中,我们系统地调查了这一假设,通过基于骑士与骗子(Knights and Knaves, 简称K&K)谜题的动态生成逻辑推理基准进行定量测量记忆情况。研究发现,经过微调后,LLMs能够插值训练谜题(实现近乎完美的准确性),但当这些谜题稍微扰动时却失败了,表明模型在解决训练谜题时严重依赖于记忆。另一方面,我们展示了虽然微调导致了严重的记忆现象,但它也持续提高了泛化性能。通过扰动测试、跨难度水平的可迁移性分析、探测模型内部结构以及使用错误答案进行微调等深入分析表明,尽管存在训练数据的记忆问题,LLMs还是学习了如何解决K&K谜题中的推理。这一现象表明,LLMs展示了记忆与真实推理能力之间的复杂相互作用。最后,我们通过每个样本的记忆得分分析揭示了LLMs在解决逻辑谜题时如何在推理和记忆之间转换。我们的代码和数据可以在这个链接(https URL)上获得。
https://arxiv.org/abs/2410.23123
Fire patterns, consisting of fire effects that offer insights into fire behavior and origin, are traditionally classified based on investigators' visual observations, leading to subjective interpretations. This study proposes a framework for quantitative fire pattern classification to support fire investigators, aiming for consistency and accuracy. The framework integrates four components. First, it leverages human-computer interaction to extract fire patterns from surfaces, combining investigator expertise with computational analysis. Second, it employs an aspect ratio-based random forest model to classify fire pattern shapes. Third, fire scene point cloud segmentation enables precise identification of fire-affected areas and the mapping of 2D fire patterns to 3D scenes. Lastly, spatial relationships between fire patterns and indoor elements support an interpretation of the fire scene. These components provide a method for fire pattern analysis that synthesizes qualitative and quantitative data. The framework's classification results achieve 93% precision on synthetic data and 83% on real fire patterns.
火灾模式,由提供有关火灾行为和起火点见解的火灾效果组成,传统上是根据调查人员的视觉观察进行分类的,这导致了主观解释。本研究提出了一种量化火灾模式分类框架,以支持火灾调查员,旨在实现一致性和准确性。该框架整合了四个组成部分。首先,它利用人机交互从表面提取火灾模式,结合了调查员的专业知识与计算分析。其次,采用基于纵横比的随机森林模型对火灾模式形状进行分类。第三,火灾现场点云分割能够精确识别受火影响区域,并将二维火灾模式映射到三维场景中。最后,火灾模式和室内元素之间的空间关系有助于解释火灾现场情况。这些组成部分提供了一种综合定性和定量数据的火灾模式分析方法。该框架在合成数据上的分类结果达到93%的精度,在真实火灾模式上达到83%。
https://arxiv.org/abs/2410.23105
Interpreting the decisions of Convolutional Neural Networks (CNNs) is essential for understanding their behavior, yet explainability remains a significant challenge, particularly for self-supervised models. Most existing methods for generating saliency maps rely on ground truth labels, restricting their use to supervised tasks. EigenCAM is the only notable label-independent alternative, leveraging Singular Value Decomposition to generate saliency maps applicable across CNN models, but it does not fully exploit the tensorial structure of feature maps. In this work, we introduce the Tucker Saliency Map (TSM) method, which applies Tucker tensor decomposition to better capture the inherent structure of feature maps, producing more accurate singular vectors and values. These are used to generate high-fidelity saliency maps, effectively highlighting objects of interest in the input. We further extend EigenCAM and TSM into multivector variants -Multivec-EigenCAM and Multivector Tucker Saliency Maps (MTSM)- which utilize all singular vectors and values, further improving saliency map quality. Quantitative evaluations on supervised classification models demonstrate that TSM, Multivec-EigenCAM, and MTSM achieve competitive performance with label-dependent methods. Moreover, TSM enhances explainability by approximately 50% over EigenCAM for both supervised and self-supervised models. Multivec-EigenCAM and MTSM further advance state-of-the-art explainability performance on self-supervised models, with MTSM achieving the best results.
解读卷积神经网络(CNN)的决策对于理解其行为至关重要,然而可解释性仍然是一个重大挑战,尤其是对于自监督模型而言。现有的大多数生成显著图的方法依赖于地面真实标签,限制了它们仅适用于有监督任务。EigenCAM 是唯一值得注意的不依赖标签的替代方法,它利用奇异值分解来生成适用于各种 CNN 模型的显著图,但它没有充分利用特征图的张量结构。在这项工作中,我们引入了 Tucker 显著图(TSM)方法,该方法应用 Tucker 张量分解以更好地捕捉特征图的内在结构,产生更准确的奇异向量和值。这些用于生成高保真的显著图,有效突出输入中的感兴趣对象。我们进一步将 EigenCAM 和 TSM 扩展为多矢量变体 - Multivec-EigenCAM 和多矢量 Tucker 显著图(MTSM)- 这些方法利用所有的奇异向量和值,从而进一步提高显著图的质量。在有监督分类模型上的定量评估表明,TSM、Multivec-EigenCAM 和 MTSM 在与依赖标签的方法相比时实现了具有竞争力的表现。此外,对于有监督和自监督模型,TSM 相比 EigenCAM 大约提高了 50% 的可解释性。Multivec-EigenCAM 和 MTSM 更进一步地提升了自监督模型上的最新可解释性能表现,其中 MTSM 达到了最佳结果。
https://arxiv.org/abs/2410.23072
Vocal education in the music field is difficult to quantify due to the individual differences in singers' voices and the different quantitative criteria of singing techniques. Deep learning has great potential to be applied in music education due to its efficiency to handle complex data and perform quantitative analysis. However, accurate evaluations with limited samples over rare vocal types, such as Mezzo-soprano, requires extensive well-annotated data support using deep learning models. In order to attain the objective, we perform transfer learning by employing deep learning models pre-trained on the ImageNet and Urbansound8k datasets for the improvement on the precision of vocal technique evaluation. Furthermore, we tackle the problem of the lack of samples by constructing a dedicated dataset, the Mezzo-soprano Vocal Set (MVS), for vocal technique assessment. Our experimental results indicate that transfer learning increases the overall accuracy (OAcc) of all models by an average of 8.3%, with the highest accuracy at 94.2%. We not only provide a novel approach to evaluating Mezzo-soprano vocal techniques but also introduce a new quantitative assessment method for music education.
声乐教育在音乐领域很难量化,因为歌手的声音个体差异很大,而且演唱技巧的定量标准也各不相同。深度学习因其处理复杂数据和进行定量分析的高效性,在音乐教育中具有很大的应用潜力。然而,对于像女中音这样罕见的嗓音类型,仅凭有限样本实现准确评估需要大量的标注良好的数据支持使用深度学习模型。为了达到这一目标,我们通过在ImageNet和Urbansound8k数据集上预训练的深度学习模型实施迁移学习,以提高声乐技巧评估的准确性。此外,为了解决样本不足的问题,我们构建了一个专门的数据集——女中音声乐集(MVS),用于声乐技术评估。实验结果表明,迁移学习将所有模型的整体准确率平均提高了8.3%,最高准确率达到94.2%。我们不仅提供了一种评价女中音演唱技巧的新方法,还为音乐教育引入了新的定量评估手段。
https://arxiv.org/abs/2410.23325
This research explores the interdisciplinary interaction between psychoanalysis and computer science, suggesting a mutually beneficial exchange. Indeed, psychoanalytic concepts can enrich technological applications involving unconscious, elusive aspects of the human factor, such as social media and other interactive digital platforms. Conversely, computer science, especially Artificial Intelligence (AI), can contribute quantitative concepts and methods to psychoanalysis, identifying patterns and emotional cues in human expression. In particular, this research aims to apply computer science methods to establish fundamental relationships between emotions and Lacanian discourses. Such relations are discovered in our approach via empirical investigation and statistical analysis, and are eventually validated in a theoretical (psychoanalytic) way. It is worth noting that, although emotions have been sporadically studied in Lacanian theory, to the best of our knowledge a systematic, detailed investigation of their role is missing. Such fine-grained understanding of the role of emotions can also make the identification of Lacanian discourses more effective and easy in practise. In particular, our methods indicate the emotions with highest differentiation power in terms of corresponding discourses; conversely, we identify for each discourse the most characteristic emotions it admits. As a matter of fact, we develop a method which we call Lacanian Discourse Discovery (LDD), that simplifies (via systematizing) the identification of Lacanian discourses in texts. Although the main contribution of this paper is inherently theoretical (psychoanalytic), it can also facilitate major practical applications in the realm of interactive digital systems. Indeed, our approach can be automated through Artificial Intelligence methods that effectively identify emotions (and corresponding discourses) in texts.
这项研究探索了精神分析与计算机科学之间的跨学科互动,提出了双方互利的交流。事实上,精神分析的概念可以丰富涉及人类因素中潜意识和难以捉摸方面的技术应用,如社交媒体和其他交互式数字平台。相反,计算机科学,特别是人工智能(AI),可以通过定量概念和方法为精神分析做出贡献,在人类表达中识别模式和情感线索。具体而言,本研究旨在运用计算机科学的方法来建立情感与拉康话语之间的基本关系。这种关系通过实证调查和统计分析发现,并最终在理论上(心理分析角度)得到验证。值得注意的是,尽管情感已在拉康理论中有零星的研究,但据我们所知,对其作用的系统、详细研究尚不存在。对情感角色进行如此细致的理解也可以使拉康话语的实际识别更加有效和简单。特别是,我们的方法指出了对应话语中区分力最强的情感;相反,我们确定了每个话语允许的最具特征的情感。实际上,我们开发了一种称为“拉康话语发现”(LDD)的方法,该方法通过系统化简化了文本中拉康话语的识别。尽管本文的主要贡献本质上是理论性的(心理分析角度),但也能促进互动式数字系统领域的重要实际应用。确实,我们的方法可以通过有效识别文本中的情感(以及相应的话语)的人工智能方法实现自动化。
https://arxiv.org/abs/2410.22895
Facial parts swapping aims to selectively transfer regions of interest from the source image onto the target image while maintaining the rest of the target image unchanged. Most studies on face swapping designed specifically for full-face swapping, are either unable or significantly limited when it comes to swapping individual facial parts, which hinders fine-grained and customized character designs. However, designing such an approach specifically for facial parts swapping is challenged by a reasonable multiple reference feature fusion, which needs to be both efficient and effective. To overcome this challenge, FuseAnyPart is proposed to facilitate the seamless "fuse-any-part" customization of the face. In FuseAnyPart, facial parts from different people are assembled into a complete face in latent space within the Mask-based Fusion Module. Subsequently, the consolidated feature is dispatched to the Addition-based Injection Module for fusion within the UNet of the diffusion model to create novel characters. Extensive experiments qualitatively and quantitatively validate the superiority and robustness of FuseAnyPart. Source codes are available at this https URL.
面部部分交换旨在将源图像中感兴趣的区域选择性地转移到目标图像上,同时保持目标图像的其余部分不变。大多数专门针对全脸交换的研究要么无法进行,要么在单独交换面部部位时受到显著限制,这阻碍了精细和定制化的角色设计。然而,为面部部位交换专门设计这样的方法面临着合理多参考特征融合的挑战,这种融合需要既高效又有效。为克服这一挑战,提出了FuseAnyPart来促进面部“任意部分融合”的无缝自定义。在FuseAnyPart中,不同人的面部部位被组装成基于掩码融合模块内的潜在空间中的完整面孔。随后,整合后的特征被发送到基于加法的注入模块,在扩散模型的UNet内进行融合以创造新的角色。广泛的实验定性和定量验证了FuseAnyPart的优势和鲁棒性。源代码可以在以下链接获取:[此处提供的https URL]。
https://arxiv.org/abs/2410.22771
Inverse rendering pipelines are gaining prominence in realizing photo-realistic reconstruction of real-world objects for emulating them in virtual reality scenes. Apart from material reflectances, spectral rendering and in-scene illuminants' spectral power distributions (SPDs) play important roles in producing photo-realistic images. We present a simple, low-cost technique to capture and reconstruct the SPD of uniform illuminants. Instead of requiring a costly spectrometer for such measurements, our method uses a diffractive compact disk (CD-ROM) and a machine learning approach for accurate estimation. We show our method to work well with spotlights under simulations and few real-world examples. Presented results clearly demonstrate the reliability of our approach through quantitative and qualitative evaluations, especially in spectral rendering of iridescent materials.
逆向渲染管线在实现真实世界物体的逼真重建方面日益重要,以便在虚拟现实场景中模拟这些物体。除了材料反射率之外,光谱渲染和场景内光源的光谱功率分布(SPD)对于生成逼真的图像也起着重要作用。我们提出了一种简单、低成本的技术来捕捉和重构均匀光源的SPD。我们的方法不需要昂贵的分光计进行此类测量,而是使用衍射紧凑盘(CD-ROM)和机器学习方法来进行准确估计。我们在模拟环境中以及少数现实世界示例中展示了该方法对聚光灯的有效性。所呈现的结果通过定量和定性评估清晰地证明了我们方法的可靠性,特别是在虹彩材料的光谱渲染方面。
https://arxiv.org/abs/2410.22679
Magnetic Resonance Fingerprinting (MRF) is a time-efficient approach to quantitative MRI, enabling the mapping of multiple tissue properties from a single, accelerated scan. However, achieving accurate reconstructions remains challenging, particularly in highly accelerated and undersampled acquisitions, which are crucial for reducing scan times. While deep learning techniques have advanced image reconstruction, the recent introduction of diffusion models offers new possibilities for imaging tasks, though their application in the medical field is still emerging. Notably, diffusion models have not yet been explored for the MRF problem. In this work, we propose for the first time a conditional diffusion probabilistic model for MRF image reconstruction. Qualitative and quantitative comparisons on in-vivo brain scan data demonstrate that the proposed approach can outperform established deep learning and compressed sensing algorithms for MRF reconstruction. Extensive ablation studies also explore strategies to improve computational efficiency of our approach.
磁共振指纹识别(Magnetic Resonance Fingerprinting, MRF)是一种时间效率高的定量MRI方法,能够通过单次加速扫描映射多种组织属性。然而,在高度加速和欠采样的采集情况下实现准确的重建仍然具有挑战性,而这些技术对于减少扫描时间至关重要。虽然深度学习技术已经提高了图像重建能力,但最近提出的扩散模型为成像任务提供了新的可能性,尽管它们在医学领域的应用仍在发展中。值得注意的是,目前还没有探索扩散模型用于解决MRF问题的方法。在这项工作中,我们首次提出了一种用于MRF图像重建的条件扩散概率模型。使用体内脑部扫描数据进行的定性和定量比较表明,所提出的这种方法可以超越现有的深度学习和压缩感知算法在MRF重建中的表现。广泛的消融研究也探讨了提高我们的方法计算效率的策略。
https://arxiv.org/abs/2410.23318
We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at this https URL.
我们推出了Image2Struct,这是一个用于评估视觉语言模型(VLMs)从图像中提取结构能力的基准测试。我们的基准测试具备以下特点:1) 涵盖了现实世界的使用场景;2) 完全自动化,无需人工判断;3) 基于持续更新的数据流。在Image2Struct中,通过给定输入图像(如网页截图),VLMs被提示生成其底层结构(例如LaTeX代码或HTML)。接着将该结构渲染成输出图像(如渲染后的网页),并将其与输入图像进行比较以得出相似度分数。这种往返评估使我们能够对多结构有效的任务中的VLMs进行定量评价。我们创建了一个管道,在执行时从活跃的在线社区下载最新数据,并在没有人工干预的情况下评估VLMs的表现。我们介绍了三个领域(网页、LaTeX和乐谱),并使用了五种图像度量标准(像素相似性、Inception向量之间的余弦相似性、学习到的感知图像块相似性、结构相似性指数测量以及地面移动相似性)来实现图像对之间高效自动比较。我们在14个著名的VLMs上进行了Image2Struct评估,发现分数差异很大,表明Image2Struct能够区分不同VLMs的表现水平。此外,在各个领域中最佳得分相差甚大(例如在乐谱上的得分为0.402,而在LaTeX方程式的得分则为0.830),这表明Image2Struct包含了难度不同的任务。为了提高透明度,我们在此链接https://...上发布了完整结果。
https://arxiv.org/abs/2410.22456
Modern deep learning models often make predictions by focusing on irrelevant areas, leading to biased performance and limited generalization. Existing methods aimed at rectifying model attention require explicit labels for irrelevant areas or complex pixel-wise ground truth attention maps. We present CRAYON (Correcting Reasoning with Annotations of Yes Or No), offering effective, scalable, and practical solutions to rectify model attention using simple yes-no annotations. CRAYON empowers classical and modern model interpretation techniques to identify and guide model reasoning: CRAYON-ATTENTION directs classic interpretations based on saliency maps to focus on relevant image regions, while CRAYON-PRUNING removes irrelevant neurons identified by modern concept-based methods to mitigate their influence. Through extensive experiments with both quantitative and human evaluation, we showcase CRAYON's effectiveness, scalability, and practicality in refining model attention. CRAYON achieves state-of-the-art performance, outperforming 12 methods across 3 benchmark datasets, surpassing approaches that require more complex annotations.
现代深度学习模型常常通过关注不相关的区域来进行预测,导致性能偏差和泛化能力有限。现有的旨在纠正模型注意力的方法需要明确标记出不相关区域或复杂的像素级地面真实注意图。我们提出了CRAYON(利用是或否的注释来纠正推理),提供了一种有效、可扩展且实用的解决方案,使用简单的“是”或“否”的标注来矫正模型注意力。CRAYON增强了经典和现代模型解释技术的能力,以识别并引导模型推理:CRAYON-ATTENTION基于显著性图的经典解释方法将注意力集中在相关图像区域上,而CRAYON-PRUNING则通过现代概念导向的方法移除被识别为不相关的神经元,从而减轻其影响。通过广泛的定量和人工评估实验,我们展示了CRAYON在优化模型注意力方面的有效性、可扩展性和实用性。CRAYON实现了最先进的性能,在3个基准数据集上超越了12种方法,并优于需要更复杂注释的方法。
https://arxiv.org/abs/2410.22312
Gradual semantics have demonstrated great potential in argumentation, in particular for deploying quantitative bipolar argumentation frameworks (QBAFs) in a number of real-world settings, from judgmental forecasting to explainable AI. In this paper, we provide a novel methodology for obtaining gradual semantics for structured argumentation frameworks, where the building blocks of arguments and relations between them are known, unlike in QBAFs, where arguments are abstract entities. Differently from existing approaches, our methodology accommodates incomplete information about arguments' premises. We demonstrate the potential of our approach by introducing two different instantiations of the methodology, leveraging existing gradual semantics for QBAFs in these more complex frameworks. We also define a set of novel properties for gradual semantics in structured argumentation, discuss their suitability over a set of existing properties. Finally, we provide a comprehensive theoretical analysis assessing the instantiations, demonstrating the their advantages over existing gradual semantics for QBAFs and structured argumentation.
逐步语义在论证中展现出了巨大的潜力,特别是在部署定量双极论证框架(QBAF)于各种现实世界场景中的应用,从判断性预测到可解释的人工智能。本文提供了一种新颖的方法论来获取结构化论证框架的逐步语义,在这种框架下,论证的构建块及其之间的关系是已知的,这与QBAFs不同,在后者中论证被视为抽象实体。不同于现有的方法,我们的方法可以处理关于前提信息不完整的情况。我们通过引入该方法的两种不同实例来展示其潜力,这些实例利用了现有针对QBAF的逐步语义在更复杂框架中的应用。我们还定义了一组适用于结构化论证的新属性,并讨论了它们相对于已有属性的适用性。最后,我们提供了一个全面的理论分析来评估这些实例,展示了与现有QBAF和结构化论证的逐步语义相比的优势。
https://arxiv.org/abs/2410.22209
As more applications of large language models (LLMs) for 3D content for immersive environments emerge, it is crucial to study user behaviour to identify interaction patterns and potential barriers to guide the future design of immersive content creation and editing systems which involve LLMs. In an empirical user study with 12 participants, we combine quantitative usage data with post-experience questionnaire feedback to reveal common interaction patterns and key barriers in LLM-assisted 3D scene editing systems. We identify opportunities for improving natural language interfaces in 3D design tools and propose design recommendations for future LLM-integrated 3D content creation systems. Through an empirical study, we demonstrate that LLM-assisted interactive systems can be used productively in immersive environments.
随着大型语言模型(LLMs)在沉浸式环境中用于创建3D内容的应用越来越多,研究用户行为以识别交互模式和潜在障碍变得至关重要。这可以指导未来涉及LLMs的沉浸式内容创作与编辑系统的设计。在一个包含12名参与者的实证用户研究中,我们将定量使用数据与体验后的问卷反馈相结合,揭示了在LLM辅助的3D场景编辑系统中的常见交互模式及主要障碍。我们发现改进3D设计工具中文本界面的机会,并为未来集成LLMs的3D内容创作系统提出设计建议。通过实证研究,我们展示了LLM辅助的交互系统可以在沉浸式环境中被有效使用。
https://arxiv.org/abs/2410.22177
Reconstructing controllable Gaussian splats from monocular video is a challenging task due to its inherently insufficient constraints. Widely adopted approaches supervise complex interactions with additional masks and control signal annotations, limiting their real-world applications. In this paper, we propose an annotation guidance-free method, dubbed FreeGaussian, that mathematically derives dynamic Gaussian motion from optical flow and camera motion using novel dynamic Gaussian constraints. By establishing a connection between 2D flows and 3D Gaussian dynamic control, our method enables self-supervised optimization and continuity of dynamic Gaussian motions from flow priors. Furthermore, we introduce a 3D spherical vector controlling scheme, which represents the state with a 3D Gaussian trajectory, thereby eliminating the need for complex 1D control signal calculations and simplifying controllable Gaussian modeling. Quantitative and qualitative evaluations on extensive experiments demonstrate the state-of-the-art visual performance and control capability of our method. Project page: this https URL.
从单目视频中重构可控制的高斯斑点是一个具有挑战性的任务,因为其内在的约束条件不足。广泛采用的方法通过添加额外的掩码和控制信号注释来监督复杂交互,这限制了它们在现实世界中的应用。在这篇论文中,我们提出了一种无需标注指导的方法,称为FreeGaussian,它利用新颖的动态高斯约束从光流和相机运动中数学推导出动态高斯运动。通过建立2D流与3D高斯动态控制之间的联系,我们的方法实现了自我监督优化,并使动态高斯运动具有连续性。此外,我们引入了一种三维球形向量控制方案,该方案用一个3D高斯轨迹来表示状态,从而消除了对复杂的一维控制信号计算的需求,并简化了可控制的高斯模型建立过程。大量的实验定量和定性的评估表明,我们的方法具有最先进的视觉性能和控制能力。项目页面:这个 https URL。
https://arxiv.org/abs/2410.22070
Ensuring the safety of large language model (LLM) applications is essential for developing trustworthy artificial intelligence. Current LLM safety benchmarks have two limitations. First, they focus solely on either discriminative or generative evaluation paradigms while ignoring their interconnection. Second, they rely on standardized inputs, overlooking the effects of widespread prompting techniques, such as system prompts, few-shot demonstrations, and chain-of-thought prompting. To overcome these issues, we developed SG-Bench, a novel benchmark to assess the generalization of LLM safety across various tasks and prompt types. This benchmark integrates both generative and discriminative evaluation tasks and includes extended data to examine the impact of prompt engineering and jailbreak on LLM safety. Our assessment of 3 advanced proprietary LLMs and 10 open-source LLMs with the benchmark reveals that most LLMs perform worse on discriminative tasks than generative ones, and are highly susceptible to prompts, indicating poor generalization in safety alignment. We also explain these findings quantitatively and qualitatively to provide insights for future research.
确保大型语言模型(LLM)应用的安全性对于开发可信赖的人工智能至关重要。当前的LLM安全基准存在两个限制。首先,它们只专注于区分性或生成性评估范式之一,而忽略了两者之间的联系。其次,它们依赖于标准化输入,忽视了广泛提示技术的影响,如系统提示、少量样本演示和思维链提示。为了解决这些问题,我们开发了SG-Bench,这是一个新的基准测试,用于评估LLM安全性的泛化能力在各种任务和提示类型上的表现。该基准集成了生成性和区分性评估任务,并包含扩展数据以检查提示工程和越狱对LLM安全性的影响。使用此基准评估3个先进的专有LLM和10个开源LLM的结果显示,大多数LLM在区分性任务上的表现不如生成性任务,并且对提示高度敏感,表明其安全一致性泛化能力较差。我们还通过定量和定性的方法解释了这些发现,以提供对未来研究的洞见。
https://arxiv.org/abs/2410.21965
The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at this https URL .
工业场景的建模对于工业制造中的模拟至关重要。虽然大型语言模型(LLMs)在根据文本描述生成通用的3D场景方面已经取得了显著进展,但使用LLMs生成工业场景却面临着独特的挑战,这主要是由于其对精确测量和定位的需求,需要进行复杂的空间布局规划。为了解决这一问题,我们引入了SceneGenAgent,这是一个基于LLM的代理程序,通过C#代码来生成工业场景。SceneGenAgent通过结构化且可计算的格式确保精准的布局规划,并通过布局验证及迭代优化来满足工业场景的定量要求。实验结果表明,由SceneGenAgent驱动的LLMs其性能超过了原有的水平,在真实世界中的工业场景生成任务中达到了高达81.0%的成功率,并有效地满足了大部分场景生成的要求。为了进一步提高可访问性,我们构建了SceneInstruct数据集,旨在对开源LLMs进行微调以集成到SceneGenAgent中。实验表明,在SceneInstruct上对开源LLMs进行微调能够显著提升性能,Llama3.1-70B接近达到了GPT-4o的能力水平。我们的代码和数据可以在这个 https URL 中获取。
https://arxiv.org/abs/2410.21909