Prediction

No More Ambiguity in 360° Room Layout via Bi-Layout Estimation

2024-04-15 17:59:56

Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang

arXiv_CV

arXiv_CV Embedding Prediction Pose 3D
Abstract

Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360° room layout estimation models. To address this issue, we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions, while the other extends to encompass all visible areas. Our model employs two global context embeddings, where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module, the image feature retrieves relevant context from these embeddings, generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing, we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets, notably outperforming leading approaches. Specifically, on the MatterportLayout dataset, it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity. Project page: this https URL

Abstract (translated)

布局注释中的固有歧义给开发准确的360°房间布局估计模型带来了巨大的挑战。为解决这个问题，我们提出了一个新型的二层布局模型，它能够预测两种不同的布局类型。一种在歧义区域停止，另一种则扩展到涵盖所有可见区域。我们的模型采用两个全局上下文嵌入，每个嵌入都被设计为捕捉每个布局类型的特定上下文信息。通过我们的新特征指导模块，图像特征从这些嵌入中检索到相关的上下文信息，生成准确的布局感知特征。我们二层布局模型的一个独特之处是，它能够通过比较两个预测来固有地检测歧义区域。为了在测试过程中绕过手动更正歧义注释的需求，我们还引入了一个新的指标，用于区分真实布局。我们的方法在基准数据集上的表现非常出色，特别是在领先方法之上。具体来说，在MatterportLayout数据集上，它将3DIoU从81.70%提高到了82.57%，在具有很大歧义性的子集上的表现尤其出色，从54.80%提高到了59.97%。项目页面：https:// this URL

URL

https://arxiv.org/abs/2404.09993

PDF

https://arxiv.org/pdf/2404.09993.pdf
Read All
Zero-shot Building Age Classification from Facade Image Using GPT-4

2024-04-15 16:47:22

Zichao Zeng, June Moh Goo, Xinglei Wang, Bin Chi, Meihui Wang, Jan Boehm

arXiv_AI

arXiv_AI Deep_Learning Classification Prediction Language_Model Transformer Zero-Shot Chat
Abstract

A building's age of construction is crucial for supporting many geospatial applications. Much current research focuses on estimating building age from facade images using deep learning. However, building an accurate deep learning model requires a considerable amount of labelled training data, and the trained models often have geographical constraints. Recently, large pre-trained vision language models (VLMs) such as GPT-4 Vision, which demonstrate significant generalisation capabilities, have emerged as potential training-free tools for dealing with specific vision tasks, but their applicability and reliability for building information remain unexplored. In this study, a zero-shot building age classifier for facade images is developed using prompts that include logical instructions. Taking London as a test case, we introduce a new dataset, FI-London, comprising facade images and building age epochs. Although the training-free classifier achieved a modest accuracy of 39.69%, the mean absolute error of 0.85 decades indicates that the model can predict building age epochs successfully albeit with a small bias. The ensuing discussion reveals that the classifier struggles to predict the age of very old buildings and is challenged by fine-grained predictions within 2 decades. Overall, the classifier utilising GPT-4 Vision is capable of predicting the rough age epoch of a building from a single facade image without any training.

Abstract (translated)

建筑物的建造年龄对于支持许多地理空间应用至关重要。目前，大量研究关注使用深度学习从外观图像估计建筑物的年龄。然而，要建立一个准确的深度学习模型需要大量的带标签训练数据，训练出的模型通常具有地理约束。最近，一些预训练的大规模视觉语言模型（VLMs）如GPT-4 Vision，表现出显著的泛化能力，成为处理特定视觉任务的潜在无标签工具，但它们用于构建信息的可行性和可靠性仍需探索。在本文中，使用提示包括逻辑指令开发了一个零散的建筑年龄分类器来处理外观图像。以伦敦为例，我们引入了一个新的数据集FI-London，包括外观图像和建筑年龄的epoch。虽然无标签分类器取得了39.69%的 modest 准确率，但0.85世纪的平均绝对误差表明，模型可以成功预测建筑物的年龄，尽管存在一定的偏差。接下来的讨论揭示了，分类器在预测非常古老的建筑物时遇到困难，并且在20年内的细粒度预测方面受到挑战。总体而言，使用GPT-4 Vision的分类器可以从单个外观图像预测建筑物的粗略年龄。

URL

https://arxiv.org/abs/2404.09921

PDF

https://arxiv.org/pdf/2404.09921.pdf
Read All
Evaluating the Explainability of Attributes and Prototypes for a Medical Classification Model

2024-04-15 16:43:24

Luisa Gallée, Catharina Silvia Lisson, Christoph Gerhard Lisson, Daniela Drees, Felix Weig, Daniel Vogele, Meinrad Beer, Michael Götz

arXiv_CV

arXiv_CV Classification Prediction Medical
Abstract

Due to the sensitive nature of medicine, it is particularly important and highly demanded that AI methods are explainable. This need has been recognised and there is great research interest in xAI solutions with medical applications. However, there is a lack of user-centred evaluation regarding the actual impact of the explanations. We evaluate attribute- and prototype-based explanations with the Proto-Caps model. This xAI model reasons the target classification with human-defined visual features of the target object in the form of scores and attribute-specific prototypes. The model thus provides a multimodal explanation that is intuitively understandable to humans thanks to predefined attributes. A user study involving six radiologists shows that the explanations are subjectivly perceived as helpful, as they reflect their decision-making process. The results of the model are considered a second opinion that radiologists can discuss using the model's explanations. However, it was shown that the inclusion and increased magnitude of model explanations objectively can increase confidence in the model's predictions when the model is incorrect. We can conclude that attribute scores and visual prototypes enhance confidence in the model. However, additional development and repeated user studies are needed to tailor the explanation to the respective use case.

Abstract (translated)

由于药品的敏感性，使得人工智能方法的可解释性尤为重要，也极具吸引力。在带有医疗应用的xAI解决方案方面，已经有很多研究兴趣。然而，关于实际解释的效果缺乏用户本位评价。我们使用基于属性和原型 explanations 的 Proto-Caps 模型进行评估。这个xAI模型通过人类定义的视觉特征的目标对象的分数和属性特定的原型来推理目标分类。因此，该模型提供了多模态的解释，具有直观可理解的特征。一个涉及六名放射科医生的用户研究显示，解释被视为有益的，因为它们反映了他们的决策过程。该模型的结果被认为是放射科医生可以使用该模型解释的第二意见。然而，已经发现包括模型解释的规模和程度的增加可以增加当模型不正确时的模型的信心。因此，我们可以得出结论，属性分数和视觉原型确实增强了模型的信心。然而，需要进一步发展和重复用户研究来定制解释到各自的用例。

URL

https://arxiv.org/abs/2404.09917

PDF

https://arxiv.org/pdf/2404.09917.pdf
Read All
Progressive Knowledge Graph Completion

2024-04-15 16:16:59

Jiayi Li, Ruilin Luo, Jiaqi Sun, Jing Xiao, Yujiu Yang

arXiv_AI

arXiv_AI Classification Knowledge Knowledge_Graph Prediction Pose
Abstract

Knowledge Graph Completion (KGC) has emerged as a promising solution to address the issue of incompleteness within Knowledge Graphs (KGs). Traditional KGC research primarily centers on triple classification and link prediction. Nevertheless, we contend that these tasks do not align well with real-world scenarios and merely serve as surrogate benchmarks. In this paper, we investigate three crucial processes relevant to real-world construction scenarios: (a) the verification process, which arises from the necessity and limitations of human verifiers; (b) the mining process, which identifies the most promising candidates for verification; and (c) the training process, which harnesses verified data for subsequent utilization; in order to achieve a transition toward more realistic challenges. By integrating these three processes, we introduce the Progressive Knowledge Graph Completion (PKGC) task, which simulates the gradual completion of KGs in real-world scenarios. Furthermore, to expedite PKGC processing, we propose two acceleration modules: Optimized Top-$k$ algorithm and Semantic Validity Filter. These modules significantly enhance the efficiency of the mining procedure. Our experiments demonstrate that performance in link prediction does not accurately reflect performance in PKGC. A more in-depth analysis reveals the key factors influencing the results and provides potential directions for future research.

Abstract (translated)

知识图谱完成（KGC）作为一种解决知识图谱（KG）中不完整性问题的有益解决方案，已经引起了研究人员的关注。传统的KGC研究主要集中在三元分类和链接预测。然而，我们认为这些任务并不符合现实世界的场景，仅仅作为替代指标。在本文中，我们研究了与现实世界构建场景相关的三个关键过程：（a）验证过程，这是由于人类验证者必要性和限制而产生的；（b）挖掘过程，它确定了最有前途的验证候选者；（c）训练过程，它利用验证数据进行后续利用。为了实现更真实的挑战，我们将这三个过程整合起来，引入了渐进式知识图谱完成（PKGC）任务，该任务在现实世界的场景中模拟KG的逐步完成。此外，为了加速PKGC处理，我们提出了两个加速模块：优化前k个算法和语义有效性过滤器。这些模块显著提高了挖掘过程的效率。我们的实验结果表明，链接预测的性能并不能准确反映PKGC的性能。更详细的分析揭示了影响结果的关键因素，并为未来的研究提供了潜在方向。

URL

https://arxiv.org/abs/2404.09897

PDF

https://arxiv.org/pdf/2404.09897.pdf
Read All
Map-Relative Pose Regression for Visual Re-Localization

2024-04-15 15:53:23

Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

arXiv_CV

arXiv_CV Relation Prediction Pose
Abstract

Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods, absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy, they require vast amounts of training data that, realistically, can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression, map-relative pose regression (marepo), that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets, indoor and outdoor. Code is available: this https URL

Abstract (translated)

姿态回归网络预测查询图像相对于已知环境的相机姿态。在这样一个方法家族中，绝对姿态回归（APR）最近在位置误差范围内显示出有前景的准确性。APR网络隐含地将场景几何编码在权重中。为了实现高准确度，它们需要大量训练数据，这在实际上只能通过新视角合成在数天内创建是远远不够的。这个过程必须重复进行，对于每个新场景。我们提出了一种新的姿态回归方法，称为map-relative pose regression（Marepo），它在场景无关的方式满足姿态回归网络的需求。我们将姿态回归器对场景特定地图表示进行约束，使其姿态预测相对于场景地图。这使我们能够将姿态回归器在数百个场景上进行训练，以学习场景特定地图表示与相机姿态之间的通用关系。我们的map-relative pose regressor可以立即应用于新的地图表示，或者在仅仅几分钟的微调后获得最高准确度。我们的方法在两个公开数据集（室内和室外）上远远超过了前后的姿态回归方法。代码可用：https:// this URL

URL

https://arxiv.org/abs/2404.09884

PDF

https://arxiv.org/pdf/2404.09884.pdf
Read All
HyperMono: A Monotonicity-aware Approach to Hyper-Relational Knowledge Representation

2024-04-15 15:00:17

Zhiwei Hu, V\'ictor Guti\'errez-Basulto, Zhiliang Xiang, Ru Li, Jeff Z. Pan

arXiv_AI

arXiv_AI Embedding Relation Inference Knowledge Knowledge_Graph Prediction Pose
Abstract

In a hyper-relational knowledge graph (HKG), each fact is composed of a main triple associated with attribute-value qualifiers, which express additional factual knowledge. The hyper-relational knowledge graph completion (HKGC) task aims at inferring plausible missing links in a HKG. Most existing approaches to HKGC focus on enhancing the communication between qualifier pairs and main triples, while overlooking two important properties that emerge from the monotonicity of the hyper-relational graphs representation regime. Stage Reasoning allows for a two-step reasoning process, facilitating the integration of coarse-grained inference results derived solely from main triples and fine-grained inference results obtained from hyper-relational facts with qualifiers. In the initial stage, coarse-grained results provide an upper bound for correct predictions, which are subsequently refined in the fine-grained step. More generally, Qualifier Monotonicity implies that by attaching more qualifier pairs to a main triple, we may only narrow down the answer set, but never enlarge it. This paper proposes the HyperMono model for hyper-relational knowledge graph completion, which realizes stage reasoning and qualifier monotonicity. To implement qualifier monotonicity HyperMono resorts to cone embeddings. Experiments on three real-world datasets with three different scenario conditions demonstrate the strong performance of HyperMono when compared to the SoTA.

Abstract (translated)

在超关系知识图（HKG）中，每个事实由与属性值定语相关的主要三元组组成，这些定语表示附加事实知识。超关系知识图完成（HKGC）任务的目的是推断HKG中的可能缺失链接。几乎所有现有的HKGC方法都关注于增强定语对之间以及主要三元组之间的通信，而忽略了从超关系图表示范式的单调性产生的两个重要属性。阶段推理允许进行两次推理过程，促进仅从主要三元组获得粗粒度推理结果以及仅从具有定语的知识图获得细粒度推理结果的整合。在初始阶段，粗粒度结果提供正确预测的上限，然后在细粒度阶段进行进一步的优化。更一般地说，定语单调性意味着，将更多的定语与主要三元组相关联，我们只能缩小答案集，但永远不会扩大它。本文提出了超Mono模型，用于超关系知识图完成，实现了阶段推理和定语单调性。为了实现定语单调性，超Mono求助于锥体嵌入。在三个真实世界数据集上进行三个不同情景条件的实验，证明了超Mono与SoTA之间的强烈性能。

URL

https://arxiv.org/abs/2404.09848

PDF

https://arxiv.org/pdf/2404.09848.pdf
Read All
Interaction as Explanation: A User Interaction-based Method for Explaining Image Classification Models

2024-04-15 14:26:00

Hyeonggeun Yun

arXiv_AI

arXiv_AI Deep_Learning Classification Image_Classification Prediction Action
Abstract

In computer vision, explainable AI (xAI) methods seek to mitigate the 'black-box' problem by making the decision-making process of deep learning models more interpretable and transparent. Traditional xAI methods concentrate on visualizing input features that influence model predictions, providing insights primarily suited for experts. In this work, we present an interaction-based xAI method that enhances user comprehension of image classification models through their interaction. Thus, we developed a web-based prototype allowing users to modify images via painting and erasing, thereby observing changes in classification results. Our approach enables users to discern critical features influencing the model's decision-making process, aligning their mental models with the model's logic. Experiments conducted with five images demonstrate the potential of the method to reveal feature importance through user interaction. Our work contributes a novel perspective to xAI by centering on end-user engagement and understanding, paving the way for more intuitive and accessible explainability in AI systems.

Abstract (translated)

在计算机视觉领域，可解释性AI（xAI）方法旨在通过使深度学习模型的决策过程更加可解释和透明来解决“黑盒子”问题。传统的xAI方法集中精力可视化影响模型预测的输入特征，为专家提供最适合的见解。在这项工作中，我们提出了一个基于交互的xAI方法，通过用户交互来增强用户对图像分类模型的理解。因此，我们开发了一个基于网页的原型，使用户可以通过涂画和擦除来修改图像，从而观察分类结果的变化。我们的方法使用户能够分辨影响模型决策过程的关键特征，将他们的思维模型与模型的逻辑对齐。用五张图像进行的实验证明了这种方法通过用户交互揭示特征的重要性。我们的工作为xAI领域提供了一个新的视角，将重点放在了最终用户的参与和理解上，为AI系统提供了更直观和易用的可解释性。

URL

https://arxiv.org/abs/2404.09828

PDF

https://arxiv.org/pdf/2404.09828.pdf
Read All
Personalized Collaborative Fine-Tuning for On-Device Large Language Models

2024-04-15 12:54:31

Nicolas Wagner, Dongyang Fan, Martin Jaggi

arXiv_CL

arXiv_CL Prediction Language_Model Self-Supervised
Abstract

We explore on-device self-supervised collaborative fine-tuning of large language models with limited local data availability. Taking inspiration from the collaborative learning community, we introduce three distinct trust-weighted gradient aggregation schemes: weight similarity-based, prediction similarity-based and validation performance-based. To minimize communication overhead, we integrate Low-Rank Adaptation (LoRA) and only exchange LoRA weight updates. Our protocols, driven by prediction and performance metrics, surpass both FedAvg and local fine-tuning methods, which is particularly evident in realistic scenarios with more diverse local data distributions. The results underscore the effectiveness of our approach in addressing heterogeneity and scarcity within local datasets.

Abstract (translated)

我们研究了在有限本地数据可用性下对大型语言模型的自监督协同微调。从协同学习社区中汲取灵感，我们引入了三种不同的信任加权梯度聚合方案：基于权重相似度的、基于预测相似度的和基于验证性能的。为了最小化通信开销，我们集成了 Low-Rank Adaptation（LoRA），并且只交换 LoRA 权重更新。我们的协议，由预测和性能指标驱动，超越了 FedAvg 和局部微调方法，尤其是在具有更丰富本地数据分布的现实场景中，这种效果尤为明显。结果证实了我们在解决局部数据异质性和稀疏性问题方面的方法的有效性。

URL

https://arxiv.org/abs/2404.09753

PDF

https://arxiv.org/pdf/2404.09753.pdf
Read All
AMPCliff: quantitative definition and benchmarking of activity cliffs in antimicrobial peptides

2024-04-15 12:40:12

Kewei Li, Yuqian Wu, Yutong Guo, Yinheng Li, Yusi Fan, Ruochi Zhang, Lan Huang, Fengfeng Zhou

arXiv_AI

arXiv_AI Deep_Learning Relation Knowledge Prediction Language_Model Quantitative Pose Activity
Abstract

Activity cliff (AC) is a phenomenon that a pair of similar molecules differ by a small structural alternation but exhibit a large difference in their biochemical activities. The AC of small molecules has been extensively investigated but limited knowledge is accumulated about the AC phenomenon in peptides with canonical amino acids. This study introduces a quantitative definition and benchmarking framework AMPCliff for the AC phenomenon in antimicrobial peptides (AMPs) composed by canonical amino acids. A comprehensive analysis of the existing AMP dataset reveals a significant prevalence of AC within AMPs. AMPCliff quantifies the activities of AMPs by the metric minimum inhibitory concentration (MIC), and defines 0.9 as the minimum threshold for the normalized BLOSUM62 similarity score between a pair of aligned peptides with at least two-fold MIC changes. This study establishes a benchmark dataset of paired AMPs in Staphylococcus aureus from the publicly available AMP dataset GRAMPA, and conducts a rigorous procedure to evaluate various AMP AC prediction models, including nine machine learning, four deep learning algorithms, four masked language models, and four generative language models. Our analysis reveals that these models are capable of detecting AMP AC events and the pre-trained protein language ESM2 model demonstrates superior performance across the evaluations. The predictive performance of AMP activity cliffs remains to be further improved, considering that ESM2 with 33 layers only achieves the Spearman correlation coefficient=0.50 for the regression task of the MIC values on the benchmark dataset. Source code and additional resources are available at this https URL or this https URL.

Abstract (translated)

活动悬崖（AC）是一种现象，即一对相似的分子在结构上略有不同，但在生物化学活性上表现出很大的差异。对小分子AC的研究已经很广泛，但对于由同源氨基酸组成的肽中的AC现象，积累的知识还很少。本研究引入了针对由同源氨基酸组成的抗菌肽（AMPs）的活动悬崖（AC）的定量定义和基准框架AMPcliff，以研究由同源氨基酸组成的肽中的AC现象。对现有AMP数据集的全面分析揭示了AMP中AC在其中的显著分布。AMPcliff通过代谢最低抑制浓度（MIC）来量化AMPs的活性，并将0.9定义为对于具有至少两倍MIC变化的对齐的肽之间的标准化BLOSUM62相似性分数的最小阈值。本研究从公开可用的AMP数据集GRAMPA中建立了由同源氨基酸组成的AMP对基准数据集的基准数据集，并进行了对各种AMP AC预测模型的评估，包括九个机器学习模型、四个深度学习算法、四个遮蔽语言模型和四个生成语言模型。我们的分析显示，这些模型能够检测到AMP AC事件，而预训练的蛋白质语言ESM2模型在评估中表现出优越性能。考虑到ESM2模型仅在基准数据集中的MIC回归任务上实现了Spearman相关系数=0.50，AMP活性悬崖的预测性能仍有待进一步改进。源代码和附加资源可在此https://url或https://url查阅。

URL

https://arxiv.org/abs/2404.09738

PDF

https://arxiv.org/pdf/2404.09738.pdf
Read All
Quantization of Large Language Models with an Overdetermined Basis

2024-04-15 12:38:46

Daniil Merkulov, Daria Cherniuk, Alexander Rudikov, Ivan Oseledets, Ekaterina Muravleva, Aleksandr Mikhalev, Boris Kashin

arXiv_CL

arXiv_CL Classification Text_Classification Prediction Language_Model Pose Quantization
Abstract

In this paper, we introduce an algorithm for data quantization based on the principles of Kashin representation. This approach hinges on decomposing any given vector, matrix, or tensor into two factors. The first factor maintains a small infinity norm, while the second exhibits a similarly constrained norm when multiplied by an orthogonal matrix. Surprisingly, the entries of factors after decomposition are well-concentrated around several peaks, which allows us to efficiently replace them with corresponding centroids for quantization purposes. We study the theoretical properties of the proposed approach and rigorously evaluate our compression algorithm in the context of next-word prediction tasks and on a set of downstream tasks for text classification. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance while ensuring data compression, marking a significant advancement in the field of data quantization.

Abstract (translated)

在本文中，我们基于Kashin表示原理提出了一种数据量化算法。这种方法的关键在于将任何给定的向量、矩阵或张量分解成两个因子。第一个因子保持一个较小的无穷范数，而第二个因子在乘以正交矩阵时表现出类似的约束范数。令人惊讶的是，在分解后，因子中的entry points 很好地集中在几个尖点上，这使我们能够有效地用相应的聚类器来代替它们，从而实现量化目的。我们研究了所提出的算法的理论性质，并在next-word prediction任务和一系列下游任务（如文本分类）上对压缩算法进行了严谨的评估。我们的研究结果表明，Kashin量化在保证数据压缩的同时，实现了与竞争或卓越模型性能相当的量化质量，这标志着数据量化领域的重要进展。

URL

https://arxiv.org/abs/2404.09737

PDF

https://arxiv.org/pdf/2404.09737.pdf
Read All
Equipping Diffusion Models with Differentiable Spatial Entropy for Low-Light Image Enhancement

2024-04-15 12:35:10

Wenyi Lian, Wenjing Lian, Ziwei Luo

arXiv_CV

arXiv_CV Image_Enhancement Deep_Learning Face Prediction Pose Restoration Enhancement Matching Diffusion
Abstract

Image restoration, which aims to recover high-quality images from their corrupted counterparts, often faces the challenge of being an ill-posed problem that allows multiple solutions for a single input. However, most deep learning based works simply employ l1 loss to train their network in a deterministic way, resulting in over-smoothed predictions with inferior perceptual quality. In this work, we propose a novel method that shifts the focus from a deterministic pixel-by-pixel comparison to a statistical perspective, emphasizing the learning of distributions rather than individual pixel values. The core idea is to introduce spatial entropy into the loss function to measure the distribution difference between predictions and targets. To make this spatial entropy differentiable, we employ kernel density estimation (KDE) to approximate the probabilities for specific intensity values of each pixel with their neighbor areas. Specifically, we equip the entropy with diffusion models and aim for superior accuracy and enhanced perceptual quality over l1 based noise matching loss. In the experiments, we evaluate the proposed method for low light enhancement on two datasets and the NTIRE challenge 2024. All these results illustrate the effectiveness of our statistic-based entropy loss. Code is available at this https URL.

Abstract (translated)

图像修复的目标是从损坏的图像中恢复高质量的图像，通常面临着一个具有单个输入多项式解的问题。然而，大多数基于深度学习的作品仅仅采用L1损失来以确定性的方式训练网络，导致预测过拟合，感知质量差。在本文中，我们提出了一种新方法，将重点从确定性的像素逐像素比较转变为统计视角，强调学习分布而不是单个像素值。核心思想是引入空间熵到损失函数中，以测量预测和目标之间的分布差异。为了使空间熵不同寻常，我们采用核密度估计（KDE）来近似每个像素具有与其邻居区域的具体强度值的概率。具体来说，我们将熵与扩散模型相结合，旨在实现与基于L1噪声匹配的损失相比的卓越准确性和感知质量的提高。在实验中，我们对所提出的方法在两个数据集上的低光增强进行了评估，以及NTIRE挑战2024。所有这些结果都说明了基于统计熵的熵损失的有效性。代码可在此处访问：https://www.xxx.com/

URL

https://arxiv.org/abs/2404.09735

PDF

https://arxiv.org/pdf/2404.09735.pdf
Read All
Bridging Vision and Language Spaces with Assignment Prediction

2024-04-15 10:04:15

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

arXiv_CV

arXiv_CV Image_Caption Caption VQA Embedding Relation Prediction Language_Model Transformer Pose
Abstract

This paper introduces VLAP, a novel approach that bridges pretrained vision models and large language models (LLMs) to make frozen LLMs understand the visual world. VLAP transforms the embedding space of pretrained vision models into the LLMs' word embedding space using a single linear layer for efficient and general-purpose visual and language understanding. Specifically, we harness well-established word embeddings to bridge two modality embedding spaces. The visual and text representations are simultaneously assigned to a set of word embeddings within pretrained LLMs by formulating the assigning procedure as an optimal transport problem. We predict the assignment of one modality from the representation of another modality data, enforcing consistent assignments for paired multimodal data. This allows vision and language representations to contain the same information, grounding the frozen LLMs' word embedding space in visual data. Moreover, a robust semantic taxonomy of LLMs can be preserved with visual data since the LLMs interpret and reason linguistic information from correlations between word embeddings. Experimental results show that VLAP achieves substantial improvements over the previous linear transformation-based approaches across a range of vision-language tasks, including image captioning, visual question answering, and cross-modal retrieval. We also demonstrate the learned visual representations hold a semantic taxonomy of LLMs, making visual semantic arithmetic possible.

Abstract (translated)

本文介绍了一种名为VLAP的新方法，它将预训练的视觉模型（VMs）和大语言模型（LLMs）相桥，使冻结的LLMs能够理解视觉世界。VLAP通过使用单线性层将预训练的视觉模型的嵌入空间转换为LLMs的词向量空间，实现高效的视觉和语言理解。具体来说，我们利用已经确立的词向量来连接两个模态的嵌入空间。通过将分配过程表示为最优传输问题，将视觉和文本表示同时分配给预训练的LLM中的一个单词向量集合。我们预测从另一个模态的表示中分配一个模态，强制保持成对多模态数据的相似分配。这使得视觉和语言表示包含相同的信息，将冻结的LLM的词嵌入空间 grounded in visual data。此外，通过视觉数据可以保留LLM的语义分类器，因为LLM解释并推理单词嵌入之间的相关性。实验结果表明，VLAP在各种视觉-语言任务上的改进都超过了基于线性变换的先前方法。我们还证明了学习到的视觉表示具有LLM的语义分类器，使视觉语义算术成为可能。

URL

https://arxiv.org/abs/2404.09632

PDF

https://arxiv.org/pdf/2404.09632.pdf
Read All
A Self-feedback Knowledge Elicitation Approach for Chemical Reaction Predictions

2024-04-15 09:26:33

Pengfei Liu, Jun Tao, Zhixiang Ren

arXiv_AI

arXiv_AI Knowledge Prediction Optimization Language_Model Action Enhancement Agent
Abstract

The task of chemical reaction predictions (CRPs) plays a pivotal role in advancing drug discovery and material science. However, its effectiveness is constrained by the vast and uncertain chemical reaction space and challenges in capturing reaction selectivity, particularly due to existing methods' limitations in exploiting the data's inherent knowledge. To address these challenges, we introduce a data-curated self-feedback knowledge elicitation approach. This method starts from iterative optimization of molecular representations and facilitates the extraction of knowledge on chemical reaction types (RTs). Then, we employ adaptive prompt learning to infuse the prior knowledge into the large language model (LLM). As a result, we achieve significant enhancements: a 14.2% increase in retrosynthesis prediction accuracy, a 74.2% rise in reagent prediction accuracy, and an expansion in the model's capability for handling multi-task chemical reactions. This research offers a novel paradigm for knowledge elicitation in scientific research and showcases the untapped potential of LLMs in CRPs.

Abstract (translated)

化学反应预测（CRPs）任务的进步在推动药物发现和材料科学方面具有重要影响。然而，其效果受到化学反应空间中庞大而不可预测的挑战的限制，尤其是在现有方法对数据固有知识利用的局限性。为了应对这些挑战，我们引入了一种数据审议的自反馈知识抽取方法。这种方法从分子表示的迭代优化开始，并促进从化学反应类型（RTs）中提取知识。然后，我们采用自适应提示学习将先验知识注入大型语言模型（LLM）。因此，我们取得了显著的提高：逆推预测准确度提高了14.2%，试剂预测准确度提高了74.2%，并且模型在处理多任务化学反应方面的能力得到了扩展。这项研究为科学研究中的知识抽取提供了一个新的范例，并展示了LLM在CRPs中的潜在优势。

URL

https://arxiv.org/abs/2404.09606

PDF

https://arxiv.org/pdf/2404.09606.pdf
Read All
Large language models and linguistic intentionality

2024-04-15 08:37:26

Jumbly Grindrod

arXiv_AI

arXiv_AI Prediction Language_Model Transformer Chat
Abstract

Do large language models like Chat-GPT or LLaMa meaningfully use the words they produce? Or are they merely clever prediction machines, simulating language use by producing statistically plausible text? There have already been some initial attempts to answer this question by showing that these models meet the criteria for entering meaningful states according to metasemantic theories of mental content. In this paper, I will argue for a different approach - that we should instead consider whether language models meet the criteria given by our best metasemantic theories of linguistic content. In that vein, I will illustrate how this can be done by applying two such theories to the case of language models: Gareth Evans' (1982) account of naming practices and Ruth Millikan's (1984, 2004, 2005) teleosemantics. In doing so, I will argue that it is a mistake to think that the failure of LLMs to meet plausible conditions for mental intentionality thereby renders their outputs meaningless, and that a distinguishing feature of linguistic intentionality - dependency on a pre-existing linguistic system - allows for the plausible result LLM outputs are meaningful.

Abstract (translated)

大规模语言模型如Chat-GPT或LLLM是否真正使用它们产生的单词有意义？还是它们只是聪明的预测机器，通过产生统计上可预测的文本模拟语言使用？已经有一些尝试回答这个问题的初步尝试，根据元语义理论，这些模型符合进入有意义的阶段的条件。在本文中，我将主张采取不同的方法 - 即我们应该考虑是否满足我们关于语言内容的最好元语义理论给出的标准。以这种方式，我将通过将两个这样的理论应用于语言模型来阐述这一点：Gareth Evans的（1982）对命名实践的描述和Ruth Millikan的（1984，2004，2005）语用学。在这样做的时候，我将认为认为，认为LLM未能满足心理意图的条件因此使它们的输出无意义是一个错误，而语言意图的一个特点是依赖于先前的语言系统，这使得LLM输出具有合理性。

URL

https://arxiv.org/abs/2404.09576

PDF

https://arxiv.org/pdf/2404.09576.pdf
Read All
{sigma}-GPTs: A New Approach to Autoregressive Models

2024-04-15 08:22:47

Arnaud Pannatier, Evann Courdier, Fran\c{c}ois Fleuret

arXiv_AI

arXiv_AI Prediction Language_Model Transformer Chat
Abstract

Autoregressive models, such as the GPT family, use a fixed order, usually left-to-right, to generate sequences. However, this is not a necessity. In this paper, we challenge this assumption and show that by simply adding a positional encoding for the output, this order can be modulated on-the-fly per-sample which offers key advantageous properties. It allows for the sampling of and conditioning on arbitrary subsets of tokens, and it also allows sampling in one shot multiple tokens dynamically according to a rejection strategy, leading to a sub-linear number of model evaluations. We evaluate our method across various domains, including language modeling, path-solving, and aircraft vertical rate prediction, decreasing the number of steps required for generation by an order of magnitude.

Abstract (translated)

自回归模型（如GPT家族），通常使用固定的序列顺序，通常是从左到右的顺序，来生成序列。然而，这并不是一个必要条件。在本文中，我们挑战这个假设，并证明通过为输出添加位置编码，可以根据每个样本实时修改序列的顺序，从而实现关键的优势性质。这使得可以对任意子词集进行抽样和条件处理，并且也允许根据拒绝策略动态地抽样多个词，从而将模型的评估步骤数量降低到线性以下。我们评估我们的方法在各种领域，包括语言建模、路径求解和飞机垂直速率预测，通过数量级减少生成所需步骤。

URL

https://arxiv.org/abs/2404.09562

PDF

https://arxiv.org/pdf/2404.09562.pdf
Read All
Characterization and Mitigation of Insufficiencies in Automated Driving Systems

2024-04-15 08:19:13

Yuting Fu, Jochen Seemann, Caspar Hanselaar, Tim Beurskens, Andrei Terechko, Emilia Silvas, Maurice Heemels

arXiv_AI

arXiv_AI Detection Survey Prediction Autonomous
Abstract

Automated Driving (AD) systems have the potential to increase safety, comfort and energy efficiency. Recently, major automotive companies have started testing and validating AD systems (ADS) on public roads. Nevertheless, the commercial deployment and wide adoption of ADS have been moderate, partially due to system functional insufficiencies (FI) that undermine passenger safety and lead to hazardous situations on the road. FIs are defined in ISO 21448 Safety Of The Intended Functionality (SOTIF). FIs are insufficiencies in sensors, actuators and algorithm implementations, including neural networks and probabilistic calculations. Examples of FIs in ADS include inaccurate ego-vehicle localization on the road, incorrect prediction of a cyclist maneuver, unreliable detection of a pedestrian, etc. The main goal of our study is to formulate a generic architectural design pattern, which is compatible with existing methods and ADS, to improve FI mitigation and enable faster commercial deployment of ADS. First, we studied the 2021 autonomous vehicles disengagement reports published by the California Department of Motor Vehicles (DMV). The data clearly show that disengagements are five times more often caused by FIs rather than by system faults. We then made a comprehensive list of insufficiencies and their characteristics by analyzing over 10 hours of publicly available road test videos. In particular, we identified insufficiency types in four major categories: world model, motion plan, traffic rule, and operational design domain. The insufficiency characterization helps making the SOTIF analyses of triggering conditions more systematic and comprehensive. Based on our FI characterization, simulation experiments and literature survey, we define a novel generic architectural design pattern Daruma to dynamically select the channel that is least likely to have a FI at the moment.

Abstract (translated)

自动驾驶（AD）系统具有提高安全性、舒适度和能源效率的潜力。最近，主要汽车公司已开始在公共道路上测试和验证AD系统（ADS）。然而，AD系统的商业部署和广泛采用程度并不高，部分原因是系统功能不足（FI）导致的乘客安全问题和道路上的危险情况。FI是指根据ISO 21448《安全意图功能完整性（SOTIF）》定义的传感器、执行器和算法实现方面的不足。ADS中的FI包括对道路上的自适应车辆解除情况的准确性不高的自适应车辆定位、对自行车手势预测不准确、对行人检测不可靠等。我们的研究主要目标是提出一种通用架构设计模式，与现有方法和ADS兼容，以提高FI的缓解和实现ADS的商业部署。首先，我们研究了加利福尼亚州交通局（DMV）发布的2021年自动驾驶解除报告。数据清楚地表明，由FI而不是系统故障引起的解除情况更多。接着，我们在超过10小时的公开道路测试视频中进行全面的缺陷和其特征分析。特别是，我们在世界模型、运动计划、交通规则和操作设计域四个主要类别中识别出缺陷类型。FI特征分析有助于使触发条件SOTIF分析更加系统化和全面。根据我们的FI特征，模拟实验和文献调查，我们定义了一个新颖的通用架构设计模式Daruma，用于动态选择当前最不可能有FI的通道。

URL

https://arxiv.org/abs/2404.09557

PDF

https://arxiv.org/pdf/2404.09557.pdf
Read All
SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction

2024-04-15 06:45:06

Pin Tang, Zhongdao Wang, Guoqing Wang, Jilai Zheng, Xiangxuan Ren, Bailan Feng, Chao Ma

arXiv_CV

arXiv_CV CNN Sparse Prediction Transformer Pose Autonomous 3D Point_Cloud
Abstract

Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.

Abstract (translated)

基于视觉的自动驾驶需要对3D空间进行显式的建模，其中2D潜在表示被映射，然后应用后续的3D操作。然而，在密集的潜在空间中操作会引入 cubic 时间和空间复杂度，从而限制了感知范围或空间分辨率的可扩展性。现有的方法通过像Bird's Eye View（BEV）或Tri-Perspective View（TPV）这样的投影来压缩密集表示。尽管这些投影有效，但它们导致信息损失，尤其是对于诸如语义 occupancy 预测等任务。为了应对这个问题，我们提出了SparseOcc，一种基于稀疏点云处理的节能占用网络。它采用了一种无损失的稀疏潜在表示，具有三个关键创新。首先，3D稀疏扩散器通过空间分解的3D稀疏卷积核进行潜在完成。其次，特征金字塔和稀疏插值增强了来自其他的信息。最后，将Transformer头重新设计为稀疏变体。SparseOcc在密集基线上的FLOPs减少了74.9％。有趣的是，它还提高了精度，从12.8％到14.1％的mIOU，这部分可以归因于稀疏表示避免空洞像素的幻觉的能力。

URL

https://arxiv.org/abs/2404.09502

PDF

https://arxiv.org/pdf/2404.09502.pdf
Read All
MyGO: Discrete Modality Information as Fine-Grained Tokens for Multi-modal Knowledge Graph Completion

2024-04-15 05:40:41

Yichi Zhang, Zhuo Chen, Lingbing Guo, Yajing Xu, Binbin Hu, Ziqi Liu, Huajun Chen, Wen Zhang

arXiv_AI

arXiv_AI Knowledge Knowledge_Graph Prediction Action Contrastive_Learning
Abstract

Multi-modal knowledge graphs (MMKG) store structured world knowledge containing rich multi-modal descriptive information. To overcome their inherent incompleteness, multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given MMKGs, leveraging both structural information from the triples and multi-modal information of the entities. Existing MMKGC methods usually extract multi-modal features with pre-trained models and employ a fusion module to integrate multi-modal features with triple prediction. However, this often results in a coarse handling of multi-modal data, overlooking the nuanced, fine-grained semantic details and their interactions. To tackle this shortfall, we introduce a novel framework MyGO to process, fuse, and augment the fine-grained modality information from MMKGs. MyGO tokenizes multi-modal raw data as fine-grained discrete tokens and learns entity representations with a cross-modal entity encoder. To further augment the multi-modal representations, MyGO incorporates fine-grained contrastive learning to highlight the specificity of the entity representations. Experiments on standard MMKGC benchmarks reveal that our method surpasses 20 of the latest models, underlining its superior performance. Code and data are available at this https URL

Abstract (translated)

多模态知识图（MMKG）存储了包含丰富多模态描述信息的结构化世界知识。为了克服其固有不完整性，多模态知识图完成（MMKGC）旨在从给定的MMKG中发掘未观察到的知识，并利用三元组结构和实体多模态信息。现有的MMKGC方法通常使用预训练模型提取多模态特征，并采用融合模块将多模态特征与三元组预测集成。然而，这通常导致对多模态数据的粗略处理，忽视了细微、细粒度的语义细节及其相互作用。为了应对这一缺陷，我们引入了一个名为MyGO的新框架来处理、融合和增强MMKG中的细粒度模态信息。MyGO将多模态原始数据划分为细粒度离散标记，并使用跨模态实体编码器学习实体表示。为了进一步增强多模态表示，MyGO引入了细粒度对比学习来突出实体表示的特定性。在标准MMKGC基准测试上进行的实验显示，我们的方法超越了最新的20个模型，强调了其卓越的性能。代码和数据可在此链接https://进行访问

URL

https://arxiv.org/abs/2404.09468

PDF

https://arxiv.org/pdf/2404.09468.pdf
Read All
CompGS: Efficient 3D Scene Representation via Compressed Gaussian Splatting

2024-04-15 04:50:39

Xiangrui Liu, Xinju Wu, Pingping Zhang, Shiqi Wang, Zhu Li, Sam Kwong

arXiv_CV

arXiv_CV Relation Prediction Optimization Pose 3D
Abstract

Gaussian splatting, renowned for its exceptional rendering quality and efficiency, has emerged as a prominent technique in 3D scene representation. However, the substantial data volume of Gaussian splatting impedes its practical utility in real-world applications. Herein, we propose an efficient 3D scene representation, named Compressed Gaussian Splatting (CompGS), which harnesses compact Gaussian primitives for faithful 3D scene modeling with a remarkably reduced data size. To ensure the compactness of Gaussian primitives, we devise a hybrid primitive structure that captures predictive relationships between each other. Then, we exploit a small set of anchor primitives for prediction, allowing the majority of primitives to be encapsulated into highly compact residual forms. Moreover, we develop a rate-constrained optimization scheme to eliminate redundancies within such hybrid primitives, steering our CompGS towards an optimal trade-off between bitrate consumption and representation efficacy. Experimental results show that the proposed CompGS significantly outperforms existing methods, achieving superior compactness in 3D scene representation without compromising model accuracy and rendering quality. Our code will be released on GitHub for further research.

Abstract (translated)

著名的Gaussian插值法，以其卓越的渲染质量和效率而闻名，已成为3D场景表示中的一个突出技术。然而，Gaussian插值法的庞大的数据量使其在现实应用中缺乏实用性。本文提出了一种高效3D场景表示方法，名为Compressed Gaussian Splatting（CompGS），它利用紧凑的Gaussian原语进行真实的3D场景建模，数据量显著减少。为了确保Gaussian原语的紧凑性，我们设计了一种混合原语结构，捕捉彼此之间的预测关系。然后，我们利用一小部分锚定原语进行预测，使得大多数原语都可以被封装在高度紧凑的残差形式中。此外，我们开发了一个速率约束优化方案，以消除这类混合原语中的冗余部分，使CompGS达到最佳的数据量与表示效果的权衡。实验结果表明，与现有方法相比，CompGS具有显著的优越性，在保持模型精度和渲染质量的同时，具有卓越的紧凑性。我们的代码将在GitHub上发布，以进行进一步的研究。

URL

https://arxiv.org/abs/2404.09458

PDF

https://arxiv.org/pdf/2404.09458.pdf
Read All
Utility-Fairness Trade-Offs and How to Find Them

2024-04-15 04:43:53

Sepehr Dehdashtian, Bashir Sadeghi, Vishnu Naresh Boddeti

arXiv_CV

arXiv_CV Classification Represenation_Learning Knowledge Prediction Pose
Abstract

When building classification systems with demographic fairness considerations, there are two objectives to satisfy: 1) maximizing utility for the specific task and 2) ensuring fairness w.r.t. a known demographic attribute. These objectives often compete, so optimizing both can lead to a trade-off between utility and fairness. While existing works acknowledge the trade-offs and study their limits, two questions remain unanswered: 1) What are the optimal trade-offs between utility and fairness? and 2) How can we numerically quantify these trade-offs from data for a desired prediction task and demographic attribute of interest? This paper addresses these questions. We introduce two utility-fairness trade-offs: the Data-Space and Label-Space Trade-off. The trade-offs reveal three regions within the utility-fairness plane, delineating what is fully and partially possible and impossible. We propose U-FaTE, a method to numerically quantify the trade-offs for a given prediction task and group fairness definition from data samples. Based on the trade-offs, we introduce a new scheme for evaluating representations. An extensive evaluation of fair representation learning methods and representations from over 1000 pre-trained models revealed that most current approaches are far from the estimated and achievable fairness-utility trade-offs across multiple datasets and prediction tasks.

Abstract (translated)

在考虑人口公平性时构建分类系统时，有两个目标需要满足：1）最大化特定任务的效用，2）确保已知人口属性的公平性。这两个目标通常会竞争，因此优化这两个目标可能会导致效用和公平性的权衡。尽管现有的工作承认这些权衡并研究其局限性，但两个问题仍然未得到回答：1）效用和公平性之间的最优权衡是什么？2）我们如何从数据中数值量化这些权衡，以便为所感兴趣的预测任务和人口属性制定预测？本文回答了这些问题。我们引入了两种效用-公平性权衡：数据空间和标签空间权衡。权衡揭示了效用-公平性平面上的三个区域，区分了完全和部分可能性和不可能的情况。我们提出了U-FaTE，一种从数据样本中数值量化权衡的方法，用于特定预测任务和人口定义。基于权衡，我们引入了一种新的评估表示的方案。对超过1000个预训练模型的公平表示学习方法和表示的深入评估表明，大多数现有方法离预计和可实现公平-效用权衡相差很远。

URL

https://arxiv.org/abs/2404.09454

PDF

https://arxiv.org/pdf/2404.09454.pdf
Read All

Content

Prediction (20)

Prediction

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL

PDF

Abstract

Abstract (translated)

URL