Decision-making, a complicated task requiring various types of abilities, presents an excellent framework for assessing Large Language Models (LLMs). Our research investigates LLMs' decision-making capabilities through the lens of a well-established field, Game Theory. We focus specifically on games that support the participation of more than two agents simultaneously. Subsequently, we introduce our framework, GAMA-Bench, including eight classical multi-agent games. We design a scoring scheme to assess a model's performance in these games quantitatively. Through GAMA-Bench, we investigate LLMs' robustness, generalizability, and enhancement strategies. Results reveal that while GPT-3.5 shows satisfying robustness, its generalizability is relatively limited. However, its performance can be improved through approaches such as Chain-of-Thought. Additionally, we conduct evaluations across various LLMs and find that GPT-4 outperforms other models on GAMA-Bench, achieving a score of 72.5. Moreover, the increasingly higher scores across the three iterations of GPT-3.5 (0613, 1106, 0125) demonstrate marked advancements in the model's intelligence with each update. The code and experimental results are made publicly available via this https URL.
决策是一个复杂而需要各种能力的工作,是评估大型语言模型(LLMs)的理想框架。我们的研究通过一个成熟领域——博弈论,从多个角度探讨LLMs的决策能力。我们重点关注支持多个代理同时参与的游戏。接着,我们引入了我们的框架GAMA-Bench,包括八个经典的跨代理游戏。我们设计了一个评分方案来定量评估模型在这些游戏中的表现。通过GAMA-Bench,我们研究了LLMs的稳健性、可扩展性和增强策略。结果表明,尽管GPT-3.5表现出令人满意的稳健性,但它的可扩展性相对有限。然而,通过诸如Chain-of-Thought等方法,其性能可以得到提高。此外,我们对各种LLM进行了评估,发现GPT-4在GAMA-Bench上优于其他模型,得分达到了72.5。此外,GPT-3.5在三个迭代期内的得分越来越高(0613,1106,0125),表明模型的智能得到每个更新的明显提高。代码和实验结果可通过此链接的https URL公开发布。
https://arxiv.org/abs/2403.11807
We describe an approach for aligning an LLM-based dialogue agent based on global (i.e., dialogue-level) rewards, while also taking into account naturally-occurring multimodal signals. At a high level, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session-level reward, using Local Implicit (LI} multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the standard RHLF pipeline improve an LLM-based dialog agent. We run quantitative and qualitative human studies to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
我们提出了一种基于全局奖励(即对话级别的奖励)来对LLM基于对话的代理程序进行对齐的方法,并考虑自然产生的多模态信号。在较高水平上,我们的方法(被称为GELI)通过分解人类提供的全局明确(GE)会话级别奖励,使用局部隐含(LI)多模态奖励信号进行跨模态调整奖励分解步骤。这个分解的奖励模型然后被用作改进LLM基于对话代理的一部分标准RHLF管道。我们进行了定量和定性的人类研究来评估我们的GELI方法的性能,并发现它相对于基线方法在各种对话指标上表现出一致的改进。
https://arxiv.org/abs/2403.11330
Large language models (LLMs) use pretraining to predict the subsequent word; however, their expansion requires significant computing resources. Numerous big tech companies and research institutes have developed multilingual LLMs (MLLMs) to meet current demands, overlooking less-resourced languages (LRLs). This study proposed three strategies to enhance the performance of LRLs based on the publicly available MLLMs. First, the MLLM vocabularies of LRLs were expanded to enhance expressiveness. Second, bilingual data were used for pretraining to align the high- and less-resourced languages. Third, a high-quality small-scale instruction dataset was constructed and instruction-tuning was performed to augment the LRL. The experiments employed the Llama2 model and Korean was used as the LRL, which was quantitatively evaluated against other developed LLMs across eight tasks. Furthermore, a qualitative assessment was performed based on human evaluation and GPT4. Experimental results showed that our proposed Bllossom model exhibited superior performance in qualitative analyses compared to previously proposed Korean monolingual models.
大型语言模型(LLMs)使用预训练来预测接下来的单词;然而,它们的扩展需要大量的计算资源。许多大型科技公司和研究机构为了满足当前需求,已经开发了多语言LLM(MLLM),而忽略了资源较少的语言(LRL)。这项研究提出了三种基于公开可用MLLM的LR Language模型的增强策略。首先,扩展了LR LML的词汇表,提高了其表现力。其次,使用双语数据进行预训练,以使高和低资源语言对齐。第三,构建了高质量的小规模指令数据集并进行指令调整,以增强LR。实验使用了Llama2模型和韩语作为LR,并通过八个任务将其他开发的LLM进行了量化评估。此外,还基于人类评估和GPT4进行了定性评估。实验结果表明,与之前提出的韩语单语模型相比,我们提出的Bllossom模型在定性分析中表现出色。
https://arxiv.org/abs/2403.10882
Knowledge Measures (KMs) aim at quantifying the amount of knowledge/information that a knowledge base carries. On the other hand, Belief Change (BC) is the process of changing beliefs (in our case, in terms of contraction, expansion and revision) taking into account a new piece of knowledge, which possibly may be in contradiction with the current belief. We propose a new quantitative BC framework that is based on KMs by defining belief change operators that try to minimise, from an information-theoretic point of view, the surprise that the changed belief carries. To this end, we introduce the principle of minimal surprise. In particular, our contributions are (i) a general information-theoretic approach to KMs for which [1] is a special case; (ii) KM-based BC operators that satisfy the so-called AGM postulates; and (iii) a characterisation of any BC operator that satisfies the AGM postulates as a KM-based BC operator, i.e., any BC operator satisfying the AGM postulates can be encoded within our quantitative BC framework. We also introduce quantitative measures that account for the information loss of contraction, information gain of expansion and information change of revision. We also give a succinct look into the problem of iterated revision, which deals with the application of a sequence of revision operations in our framework, and also illustrate how one may build from our KM-based contraction operator also one not satisfying the (in)famous recovery postulate, by focusing on the so-called severe withdrawal model as an illustrative example.
知识衡量(KMs)旨在量化知识库所携带的知识/信息量。另一方面,信念变化(BC)是针对新知识(在我们的情况下,收缩、膨胀和修订)的过程,该过程考虑到新知识可能与当前信念相矛盾。我们提出了一个基于KMs的新定量BC框架,通过定义信念变化操作来最小化从信息论角度上的惊喜。因此,我们引入了最小惊奇原则。具体来说,我们的贡献是(i)一种针对KMs的一般信息论方法,其中[1]是一个特殊情况;(ii)满足所谓AGM公理的KM-基信念变化操作;(iii)满足AGM公理的任意BC操作的KM基定义,即任何满足AGM公理的BC操作都可以用我们的定量BC框架编码。我们还引入了考虑收缩、膨胀和修订的信息损失的定量度量。我们还简要探讨了迭代修订的问题,该问题涉及在我们的框架中应用一系列修订操作,并且还通过关注所谓的严重撤回模型作为示例来说明如何从基于KMs的收缩操作构建不满足(不)著名恢复公理的操作。
https://arxiv.org/abs/2403.10502
The surge in black-box AI models has prompted the need to explain the internal mechanism and justify their reliability, especially in high-stakes applications, such as healthcare and autonomous driving. Due to the lack of a rigorous definition of explainable AI (XAI), a plethora of research related to explainability, interpretability, and transparency has been developed to explain and analyze the model from various perspectives. Consequently, with an exhaustive list of papers, it becomes challenging to have a comprehensive overview of XAI research from all aspects. Considering the popularity of neural networks in AI research, we narrow our focus to a specific area of XAI research: gradient based explanations, which can be directly adopted for neural network models. In this review, we systematically explore gradient based explanation methods to date and introduce a novel taxonomy to categorize them into four distinct classes. Then, we present the essence of technique details in chronological order and underscore the evolution of algorithms. Next, we introduce both human and quantitative evaluations to measure algorithm performance. More importantly, we demonstrate the general challenges in XAI and specific challenges in gradient based explanations. We hope that this survey can help researchers understand state-of-the-art progress and their corresponding disadvantages, which could spark their interest in addressing these issues in future work.
黑盒AI模型的激增引发了解释内部机制并证明其可靠性的需要,特别是在高风险应用中,如医疗和自动驾驶等领域。由于可解释AI(XAI)的严谨定义缺失,为了解释和分析模型从各种角度进行大量的研究。因此,随着一系列论文的详细列出,全面了解XAI研究方面变得具有挑战性。考虑到神经网络在人工智能研究中的流行,我们将重点缩小为XAI研究的一个具体领域:基于梯度的解释,可以直接应用于神经网络模型。 在本文回顾中,我们系统地探讨了迄今为止的基于梯度的解释方法,并引入了一个新的分类体系将它们分为四个不同的类别。然后,我们按时间顺序呈现了技术细节,强调了解算法的演变过程。接下来,我们引入了人类和定量评估来衡量算法的性能。更重要的是,我们展示了XAI的一般挑战和基于梯度的解释的特殊挑战。我们希望这次调查可以帮助研究人员了解最先进的进展,以及他们相应的不足之处,激发他们在未来的工作中关注这些问题。
https://arxiv.org/abs/2403.10415
In scientific research and its application, scientific literature analysis is crucial as it allows researchers to build on the work of others. However, the fast growth of scientific knowledge has led to a massive increase in scholarly articles, making in-depth literature analysis increasingly challenging and time-consuming. The emergence of Large Language Models (LLMs) has offered a new way to address this challenge. Known for their strong abilities in summarizing texts, LLMs are seen as a potential tool to improve the analysis of scientific literature. However, existing LLMs have their own limits. Scientific literature often includes a wide range of multimodal elements, such as molecular structure, tables, and charts, which are hard for text-focused LLMs to understand and analyze. This issue points to the urgent need for new solutions that can fully understand and analyze multimodal content in scientific literature. To answer this demand, we present Uni-SMART (Universal Science Multimodal Analysis and Research Transformer), an innovative model designed for in-depth understanding of multimodal scientific literature. Through rigorous quantitative evaluation across several domains, Uni-SMART demonstrates superior performance over leading text-focused LLMs. Furthermore, our exploration extends to practical applications, including patent infringement detection and nuanced analysis of charts. These applications not only highlight Uni-SMART's adaptability but also its potential to revolutionize how we interact with scientific literature.
在科学研究及其应用中,文献分析至关重要,因为它允许研究人员借鉴他人的工作。然而,科学知识的快速发展导致学术论文数量的大幅增加,使得深入的文献分析越来越具有挑战性和耗时。大型语言模型的出现为解决这一挑战提供了一种新的方法。以其在概括文本方面的强大能力而闻名,LLMs被视为提高科学文献分析的一种潜在工具。然而,现有的LLMs也有其局限性。科学文献通常包括多种多模态元素,如分子结构、表格和图表,这对于以文本为中心的LLMs来说很难理解和分析。这个问题凸显了急需新的解决方案,这些方案可以完全理解并分析科学文献中的多模态内容。为了满足这一需求,我们提出了Uni-SMART(通用科学多模态分析和研究转换器),一种专为深入理解多模态科学文献而设计的创新模型。通过在多个领域进行严谨的定量评估,Uni-SMART证明其在文本导向的LLM中的优越性能。此外,我们的探索还扩展到实际应用,包括专利侵权检测和图表精炼分析。这些应用不仅突出了Uni-SMART的适应性,还表明了其可能彻底改变我们与科学文献互动的方式。
https://arxiv.org/abs/2403.10301
Reconstructing detailed 3D objects from single-view images remains a challenging task due to the limited information available. In this paper, we introduce FDGaussian, a novel two-stage framework for single-image 3D reconstruction. Recent methods typically utilize pre-trained 2D diffusion models to generate plausible novel views from the input image, yet they encounter issues with either multi-view inconsistency or lack of geometric fidelity. To overcome these challenges, we propose an orthogonal plane decomposition mechanism to extract 3D geometric features from the 2D input, enabling the generation of consistent multi-view images. Moreover, we further accelerate the state-of-the-art Gaussian Splatting incorporating epipolar attention to fuse images from different viewpoints. We demonstrate that FDGaussian generates images with high consistency across different views and reconstructs high-quality 3D objects, both qualitatively and quantitatively. More examples can be found at our website this https URL.
从单视图图像中重建详细的三维物体仍然是一个具有挑战性的任务,因为可用信息有限。在本文中,我们引入了FDGaussian,一种新的两阶段框架,用于从单视图图像中重建详细的三维物体。最近的方法通常利用预训练的2D扩散模型生成输入图像的合理新视图,但它们要么遇到多视角不一致问题,要么缺乏几何一致性。为了克服这些挑战,我们提出了一个正交平面分解机制,从2D输入中提取3D几何特征,从而生成一致的多视角图像。此外,我们通过引入 epipolar 注意机制进一步加速最先进的高斯膨胀,将来自不同视角的图像融合在一起。我们证明了FDGaussian在不同的视图中具有高度的一致性,并成功重建了高质量的三维物体,无论是定性的还是定量的。更多例子可以在我们的网站上找到,网址是https://this URL。
https://arxiv.org/abs/2403.10242
In recent research, significant attention has been devoted to the open-vocabulary object detection task, aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection, open vocabulary object detection largely extends the object detection categories. However, it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that, despite its open-set nature, the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper, we call such a new setting as generative open-ended object detection, which is a more general and practical problem. To address it, we formulate object detection as a generative problem and propose a simple framework named GenerateU, which can detect dense objects and generate their names in a free-form way. Particularly, we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task, we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example, on the LVIS dataset, our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP, even though the category names are not seen by GenerateU during inference. Code is available at: https:// this http URL .
在最近的研究中,大量精力投入到了开箱对象检测任务中,旨在超越训练过程中指定的有限类别的限制,在推理过程中检测由任意类名描述的对象。与传统的物体检测相比,开箱对象检测大大扩展了物体检测类别。然而,它依赖于使用预训练的视觉和自然语言模型计算图像区域与一组任意类别的相似度。这表明,尽管该任务具有开箱式的特点,但在推理阶段仍然需要预定义的物体类别。这引发了这样一个问题:如果在推理阶段我们没有确切的物体类别知识,会怎么样?在本文中,我们将这种新设置称为生成开放对象检测,这是一种更一般和实用的问题。为了解决这个问题,我们将物体检测建模为生成问题,并提出了名为GenerateU的简单框架,可以检测密集物体并以自由形式生成它们的名称。特别是,我们使用Deformable DETR作为区域建议生成器,将语言模型翻译为视觉区域的对象名称。为了评估自由形式物体检测任务,我们引入了一种定量评估方法来量化生成结果的性能。广泛的实验证明,我们的GenerateU在零散检测方面具有很强的表现。例如,在LVIS数据集上,我们的GenerateU与开箱式物体检测方法GLIP的性能相当,尽管在推理过程中GenerateU没有看到类名。代码可在此处访问:https:// this http URL 。
https://arxiv.org/abs/2403.10191
Diffusion-based image editing is a composite process of preserving the source image content and generating new content or applying modifications. While current editing approaches have made improvements under text guidance, most of them have only focused on preserving the information of the input image, disregarding the importance of editability and alignment to the target prompt. In this paper, we prioritize the editability by proposing a zero-shot image editing method, named \textbf{E}nhance \textbf{E}ditability for text-based image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance (\textbf{E4C}), which only requires inference-stage optimization to explicitly enhance the edibility and text alignment. Specifically, we develop a unified dual-branch feature-sharing pipeline that enables the preservation of the structure or texture of the source image while allowing the other to be adapted based on the editing task. We further integrate CLIP guidance into our pipeline by utilizing our novel random-gateway optimization mechanism to efficiently enhance the semantic alignment with the target prompt. Comprehensive quantitative and qualitative experiments demonstrate that our method effectively resolves the text alignment issues prevalent in existing methods while maintaining the fidelity to the source image, and performs well across a wide range of editing tasks.
扩散图像编辑是一种保留源图像内容并生成新内容或应用修改的复合过程。虽然当前的编辑方法在文本指导下已经取得了一定的改进,但大多数方法只关注保留输入图像的信息,而忽略了编辑性和与目标提示的对齐重要性。在本文中,我们通过提出一种零散拍摄图像编辑方法,名为\textbf{E}nhance \textbf{E}ditability for text-based image \textbf{E}diting via \textbf{E}fficient \textbf{C}LIP guidance (\textbf{E4C}),强调可编辑性,该方法只需要在推理阶段进行优化,明确增强编辑性和文本对齐。具体来说,我们开发了一个统一的雙分支特征共享管道,在保留源图像的结构或纹理的同时,允许其他内容根据编辑任务进行调整。此外,我们将CLIP指导集成到我们的编辑过程中,通过利用我们新颖的随机门道优化机制,有效地增强目标提示与语义对齐。全面的定量实验和定性实验结果表明,我们的方法有效地解决了现有方法中普遍存在的文本对齐问题,同时保持了源图像的忠实性,并在广泛的编辑任务中表现良好。
https://arxiv.org/abs/2403.10133
The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.
模拟视觉场景的底层动态和推理未来是人工智能的核心。已经尝试了许多方法来通过物理理解和预测能力来增强智能系统。然而,大多数现有方法都集中在像素级预测,这会造成沉重的计算成本,同时缺乏对视频背后的物理动态的深入理解。最近,以物体为中心的预测方法应运而生,并吸引了越来越多的关注。在它的基础上,本文提出了一种自监督的物体中心预测模型,通过学习物体之间的视觉动态来进行未来预测。我们的模型由两个模块组成:感知模块和动态模块。感知模块用于将图像分解成多个物体并生成具有对象中心表示的图像。动态模块融合上下文信息,考虑环境-物体和物体之间的相互作用,预测物体的未来轨迹。为了验证所提出方法的的有效性,进行了大量实验。实验结果表明,与最先进的方法相比,我们模型的视觉效果和物理可靠性更高。无论是定量还是定性实验结果,都证实了所提出方法的有效性。
https://arxiv.org/abs/2403.10079
The pretraining-finetuning paradigm has gained widespread adoption in vision tasks and other fields, yet it faces the significant challenge of high sample annotation costs. To mitigate this, the concept of active finetuning has emerged, aiming to select the most appropriate samples for model finetuning within a limited budget. Traditional active learning methods often struggle in this setting due to their inherent bias in batch selection. Furthermore, the recent active finetuning approach has primarily concentrated on aligning the distribution of selected subsets with the overall data pool, focusing solely on diversity. In this paper, we propose a Bi-Level Active Finetuning framework to select the samples for annotation in one shot, which includes two stages: core sample selection for diversity, and boundary sample selection for uncertainty. The process begins with the identification of pseudo-class centers, followed by an innovative denoising method and an iterative strategy for boundary sample selection in the high-dimensional feature space, all without relying on ground-truth labels. Our comprehensive experiments provide both qualitative and quantitative evidence of our method's efficacy, outperforming all the existing baselines.
预训练-微调范式已经在各种任务中得到了广泛应用,然而它面临着高样本注释成本的显著挑战。为了减轻这一挑战,出现了主动微调的概念,旨在在有限预算内选择模型微调的最佳样本。传统的主动学习方法在這種设置中往往因为它们固有的批量选择偏见而遇到困难。此外,最近主动微调方法主要集中在将选定的子集的分布与整个数据池的分布对齐,仅关注多样性。在本文中,我们提出了一个双级别主动微调框架,用于在一次射击中选择注释样本,包括两个阶段:核心样本选择(多样性)和边界样本选择(不确定性)。过程始于伪类中心的出现,然后是采用创新去噪方法和在高维特征空间中边界的迭代策略。所有这些方法都没有依赖于真实标签。我们全面的实验提供了我们方法的成效的定量和定性证据,超越了所有现有基线。
https://arxiv.org/abs/2403.10069
In this work, we propose a novel method to supervise 3D Gaussian Splatting (3DGS) scenes using optical tactile sensors. Optical tactile sensors have become widespread in their use in robotics for manipulation and object representation; however, raw optical tactile sensor data is unsuitable to directly supervise a 3DGS scene. Our representation leverages a Gaussian Process Implicit Surface to implicitly represent the object, combining many touches into a unified representation with uncertainty. We merge this model with a monocular depth estimation network, which is aligned in a two stage process, coarsely aligning with a depth camera and then finely adjusting to match our touch data. For every training image, our method produces a corresponding fused depth and uncertainty map. Utilizing this additional information, we propose a new loss function, variance weighted depth supervised loss, for training the 3DGS scene model. We leverage the DenseTact optical tactile sensor and RealSense RGB-D camera to show that combining touch and vision in this manner leads to quantitatively and qualitatively better results than vision or touch alone in a few-view scene syntheses on opaque as well as on reflective and transparent objects. Please see our project page at this http URL
在这项工作中,我们提出了一种使用光学触觉传感器监督3D高斯平铺(3DGS)场景的新方法。光学触觉传感器在机器人操作和物体表示中得到了广泛应用;然而,原始光学触觉传感器数据不适合直接监督3DGS场景。我们的表示利用高斯过程隐式表面来隐含表示物体,将多个触摸合并为一个统一的表示,并带有一定的不确定性。我们将这个模型与单目深度估计网络相结合,该网络在两个阶段过程中对齐,粗略对齐深度相机,然后对触摸数据进行微调以匹配。对于每个训练图像,我们的方法都会产生相应的融合深度和不确定性地图。利用这个额外的信息,我们提出了一个新的损失函数,方差加权深度监督损失,用于训练3DGS场景模型。我们利用DenseTact光学触觉传感器和RealSense RGB-D相机来证明,以这种方式将触摸和视觉相结合在几个视图场景合成中会产生比视觉或触摸单独更好的结果。请访问我们的项目页面,链接在此:
https://arxiv.org/abs/2403.09875
In today's technologically driven world, the rapid spread of fake news, particularly during critical events like elections, poses a growing threat to the integrity of information. To tackle this challenge head-on, we introduce FakeWatch, a comprehensive framework carefully designed to detect fake news. Leveraging a newly curated dataset of North American election-related news articles, we construct robust classification models. Our framework integrates a model hub comprising of both traditional machine learning (ML) techniques and cutting-edge Language Models (LMs) to discern fake news effectively. Our overarching objective is to provide the research community with adaptable and precise classification models adept at identifying the ever-evolving landscape of misinformation. Quantitative evaluations of fake news classifiers on our dataset reveal that, while state-of-the-art LMs exhibit a slight edge over traditional ML models, classical models remain competitive due to their balance of accuracy and computational efficiency. Additionally, qualitative analyses shed light on patterns within fake news articles. This research lays the groundwork for future endeavors aimed at combating misinformation, particularly concerning electoral processes. We provide our labeled data and model publicly for use and reproducibility.
在当今技术驱动的世界中,虚假新闻的迅速传播,特别是在关键事件如选举期间,对信息的真实性提出了越来越大的威胁。要解决这个问题,我们推出了FakeWatch,一个精心设计的全面框架,以检测虚假新闻。利用北美选举相关新闻文章的新整理数据集,我们构建了强大的分类模型。我们的框架结合了传统机器学习(ML)技术和尖端自然语言模型(LMs),能有效识别虚假新闻。我们 overall 的目标是为研究社区提供可适应和精确的分类模型,能够识别不断演变的信息谬误的日益丰富 的背景。 在我们的数据集上对虚假新闻分类器的定量评估显示,虽然最先进的 LMs 在传统 ML 模型上略微领先,但古典模型由于其准确性和计算效率的平衡,仍然具有竞争力。此外,定性分析揭示了虚假新闻文章中的模式。这项研究为打击虚假信息,特别是关于选举过程的未来努力奠定了基础。我们公开我们的带标签数据和模型供使用和重复。
https://arxiv.org/abs/2403.09858
At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work, we propose Holo-Relighting, a volumetric relighting method that is capable of synthesizing novel viewpoints, and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to reconstruct geometry and appearance from an input portrait as a set of 3D-aware features. We design a relighting module conditioned on a given lighting to process these features, and predict a relit 3D representation in the form of a tri-plane, which can render to an arbitrary viewpoint through volume rendering. Besides viewpoint and lighting control, Holo-Relighting also takes the head pose as a condition to enable head-pose-dependent lighting effects. With these novel designs, Holo-Relighting can generate complex non-Lambertian lighting effects (e.g., specular highlights and cast shadows) without using any explicit physical lighting priors. We train Holo-Relighting with data captured with a light stage, and propose two data-rendering techniques to improve the data quality for training the volumetric relighting system. Through quantitative and qualitative experiments, we demonstrate Holo-Relighting can achieve state-of-the-arts relighting quality with better photorealism, 3D consistency and controllability.
肖像摄影的核心在于寻找理想的照明和拍摄角度。这个过程通常需要 photography 方面的先进知识和复杂的工作室设备。在这项工作中,我们提出了 Holo-Relighting,一种能够合成新视角和新照明效果的体积 relighting 方法,以及从单张图片预测 relit 3D 表示的方法。Holo-Relighting 利用预训练的 3D GAN (EG3D) 来重构输入肖像中的几何形状和外观,作为一组 3D 意识到的特征。我们根据给定的照明设计了一个 relighting 模块,处理这些特征,并预测一个 relit 3D 表示,可以在体积渲染中渲染到任意视角。除了视点和光照控制之外,Holo-Relighting 还利用头姿态作为条件,以实现基于头姿态的照明效果。通过使用这些新颖的设计,Holo-Relighting 可以在不使用任何显式的物理照明先验的情况下生成复杂的非兰伯特ian 照明效果(例如镜面光和高光)。我们用光机捕捉到的数据训练 Holo-Relighting,并提出两种数据渲染技术来提高训练体积 relighting 系统的数据质量。通过定量和定性实验,我们证明了 Holo-Relighting 可以在更好的照片现实主义、3D 一致性和可控制性方面实现最先进的 relighting 质量。
https://arxiv.org/abs/2403.09632
There is general agreement that some form of regulation is necessary both for AI creators to be incentivised to develop trustworthy systems, and for users to actually trust those systems. But there is much debate about what form these regulations should take and how they should be implemented. Most work in this area has been qualitative, and has not been able to make formal predictions. Here, we propose that evolutionary game theory can be used to quantitatively model the dilemmas faced by users, AI creators, and regulators, and provide insights into the possible effects of different regulatory regimes. We show that creating trustworthy AI and user trust requires regulators to be incentivised to regulate effectively. We demonstrate the effectiveness of two mechanisms that can achieve this. The first is where governments can recognise and reward regulators that do a good job. In that case, if the AI system is not too risky for users then some level of trustworthy development and user trust evolves. We then consider an alternative solution, where users can condition their trust decision on the effectiveness of the regulators. This leads to effective regulation, and consequently the development of trustworthy AI and user trust, provided that the cost of implementing regulations is not too high. Our findings highlight the importance of considering the effect of different regulatory regimes from an evolutionary game theoretic perspective.
大多数工作都处于定性阶段,尚未能够进行正式预测。在这里,我们提出了一种方法,即进化游戏理论可以用于定量建模用户、AI创作者和监管者所面临的问题,并为不同监管制度可能产生的影响提供洞察。我们证明了创建值得信赖的AI和用户信任需要监管者受到激励来有效地进行监管。我们还展示了两种实现这一目标的方法。第一种方法是政府承认并奖励做得好的监管者。在这种情况下,如果AI系统对用户来说不太危险,那么可信发展和用户信任就会逐步演化。然后我们考虑了一种 alternative solution,即用户可以根据监管者的有效性来调整他们的信任决策。这种方式实现了有效的监管,并因此促进了值得信赖的AI和用户信任的发展,只要实施监管的成本不是太高。我们的研究结果强调了从进化游戏理论的角度考虑不同监管制度影响的必要性。
https://arxiv.org/abs/2403.09510
Reconstructing a sequence of sharp images from the blurry input is crucial for enhancing our insights into the captured scene and poses a significant challenge due to the limited temporal features embedded in the image. Spike cameras, sampling at rates up to 40,000 Hz, have proven effective in capturing motion features and beneficial for solving this ill-posed problem. Nonetheless, existing methods fall into the supervised learning paradigm, which suffers from notable performance degradation when applied to real-world scenarios that diverge from the synthetic training data domain. Moreover, the quality of reconstructed images is capped by the generated images based on motion analysis interpolation, which inherently differs from the actual scene, affecting the generalization ability of these methods in real high-speed scenarios. To address these challenges, we propose the first self-supervised framework for the task of spike-guided motion deblurring. Our approach begins with the formulation of a spike-guided deblurring model that explores the theoretical relationships among spike streams, blurry images, and their corresponding sharp sequences. We subsequently develop a self-supervised cascaded framework to alleviate the issues of spike noise and spatial-resolution mismatching encountered in the deblurring model. With knowledge distillation and re-blurring loss, we further design a lightweight deblur network to generate high-quality sequences with brightness and texture consistency with the original input. Quantitative and qualitative experiments conducted on our real-world and synthetic datasets with spikes validate the superior generalization of the proposed framework. Our code, data and trained models will be available at \url{this https URL}.
重建从模糊输入的序列锐利的图像对改善我们对捕捉到的场景和姿势的洞察力至关重要。采样率为40,000 Hz的尖刺相机在捕捉运动特征方面已经取得了有效成果,并且为解决这个欠拟合问题提供了有益的启示。然而,现有的方法属于监督学习范式,当应用于现实世界场景时,其性能显著下降。此外,根据运动分析平滑生成的图像的质量限制了重建图像的质量,这使得这些方法在现实高速场景中的泛化能力受限。为了应对这些挑战,我们提出了第一个自监督框架来解决尖刺引导运动去雾问题。我们的方法首先基于尖刺流之间的理论关系,模糊图像和它们的锐利序列之间的关系进行建模。然后我们开发了一个自监督的级联框架来减轻去雾模型中尖刺噪声和空间分辨率不匹配的问题。通过知识蒸馏和平滑损失,我们进一步设计了一个轻量级的去雾网络,能够生成具有高质量和纹理一致性的高质量序列。我们对现实世界和合成数据集进行定量实验和定性实验,验证了所提出的框架在尖刺引导运动去雾方面的优越通用性。我们的代码、数据和训练的模型将公开发布在\url{这个 https URL}。
https://arxiv.org/abs/2403.09486
Diffusion models have achieved remarkable success in the domain of text-guided image generation and, more recently, in text-guided image editing. A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image, which is then denoised to achieve the desired edits. However, current methods for diffusion inversion often struggle to produce edits that are both faithful to the specified text prompt and closely resemble the source image. To overcome these limitations, we introduce a novel and adaptable diffusion inversion technique for real image editing, which is grounded in a theoretical analysis of the role of $\eta$ in the DDIM sampling equation for enhanced editability. By designing a universal diffusion inversion method with a time- and region-dependent $\eta$ function, we enable flexible control over the editing extent. Through a comprehensive series of quantitative and qualitative assessments, involving a comparison with a broad array of recent methods, we demonstrate the superiority of our approach. Our method not only sets a new benchmark in the field but also significantly outperforms existing strategies. Our code is available at this https URL
扩散模型在文本引导图像生成和最近在文本引导图像编辑领域取得了显著的成功。编辑真实图像的常见策略包括将扩散过程反向以获得原始图像的嘈杂表示,然后通过去噪来达到所需的编辑。然而,目前用于扩散反向的常见方法往往难以产生同时忠实于指定文本提示并接近于原始图像的编辑。为了克服这些限制,我们引入了一种新型的可适应扩散反向技术,用于真实图像编辑,该技术基于对$\eta$在DDIM采样方程中增强可编辑性的理论分析。通过设计一个时间相关且区域相关的$\eta$函数的全局扩散反向方法,我们实现了对编辑范围的灵活控制。通过一系列定量和定性评估,包括与一系列最近方法进行比较,我们证明了我们的方法的优越性。不仅在领域内设定了一个新的基准,而且显著优于现有策略。我们的代码可在此处访问:https://www.xxxxxx.com。
https://arxiv.org/abs/2403.09468
Segmentation of brain structures on MRI is the primary step for further quantitative analysis of brain diseases. Manual segmentation is still considered the gold standard in terms of accuracy; however, such data is extremely time-consuming to generate. This paper presents a deep learning-based segmentation approach for 12 deep-brain structures, utilizing multiple region-based U-Nets. The brain is divided into three focal regions of interest that encompass the brainstem, the ventricular system, and the striatum. Next, three region-based U-nets are run in parallel to parcellate these larger structures into their respective four substructures. This approach not only greatly reduces the training and processing times but also significantly enhances the segmentation accuracy, compared to segmenting the entire MRI image at once. Our approach achieves remarkable accuracy with an average Dice Similarity Coefficient (DSC) of 0.901 and 95% Hausdorff Distance (HD95) of 1.155 mm. The method was compared with state-of-the-art segmentation approaches, demonstrating a high level of accuracy and robustness of the proposed method.
基于MRI的大脑结构分割是进一步定量分析脑疾病的主要步骤。尽管手动分割在准确性方面仍然被认为是黄金标准,但这种数据生成方式仍然非常耗时。本文提出了一种基于深度学习的分割方法,用于12个深脑结构,利用多个区域基于U-Net。将大脑划分为三个感兴趣区域,涵盖脑干、血管系统和纹状体。接下来,三个区域基于U-Net并行运行,将较大的结构分割为其各自的四个亚结构。这种方法不仅大大减少了训练和处理时间,而且显著提高了分割准确性,与一次性分割整个MRI图像相比。我们的方法在平均Dice相似性系数(DSC)为0.901和95%汉明距离(HD95)为1.155 mm的惊人准确度上取得了显著成就。该方法与最先进的分割方法进行了比较,证明了所提出方法的高度准确性和稳健性。
https://arxiv.org/abs/2403.09414
3D Gaussian splatting (3DGS) has recently demonstrated impressive capabilities in real-time novel view synthesis and 3D reconstruction. However, 3DGS heavily depends on the accurate initialization derived from Structure-from-Motion (SfM) methods. When trained with randomly initialized point clouds, 3DGS fails to maintain its ability to produce high-quality images, undergoing large performance drops of 4-5 dB in PSNR. Through extensive analysis of SfM initialization in the frequency domain and analysis of a 1D regression task with multiple 1D Gaussians, we propose a novel optimization strategy dubbed RAIN-GS (Relaxing Accurate Initialization Constraint for 3D Gaussian Splatting), that successfully trains 3D Gaussians from random point clouds. We show the effectiveness of our strategy through quantitative and qualitative comparisons on multiple datasets, largely improving the performance in all settings. Our project page and code can be found at this https URL.
3D Gaussian splatting (3DGS) 最近在实时新视图合成和 3D 重建方面展示了令人印象深刻的 capabilities。然而,3DGS 高度依赖于来自结构从运动(SfM)方法的准确初始化。当用随机初始化的点云进行训练时,3DGS 无法保持其产生高质量图像的能力,性能下降达到 4-5 dB 在 PSNR。通过分析 SfM 初始化在频域中的广泛性和分析具有多个 1D 高斯的情况,我们提出了一个名为 RAIN-GS(放松准确初始化约束 3D Gaussian Splatting)的新优化策略,成功地从随机点云中训练 3D 高斯。我们通过多个数据集的定量和定性比较展示了我们策略的有效性,大大提高了所有设置的性能。我们的项目页和代码可以在这个链接中找到。
https://arxiv.org/abs/2403.09413
Large-scale generative models have demonstrated impressive capacity in producing visually compelling images, with increasing applications in medical imaging. However, they continue to grapple with the challenge of image hallucination and the generation of anatomically inaccurate outputs. These limitations are mainly due to the sole reliance on textual inputs and lack of spatial control over the generated images, hindering the potential usefulness of such models in real-life settings. We present XReal, a novel controllable diffusion model for generating realistic chest X-ray images through precise anatomy and pathology location control. Our lightweight method can seamlessly integrate spatial control in a pre-trained text-to-image diffusion model without fine-tuning, retaining its existing knowledge while enhancing its generation capabilities. XReal outperforms state-of-the-art x-ray diffusion models in quantitative and qualitative metrics while showing 13% and 10% anatomy and pathology realism gain, respectively, based on the expert radiologist evaluation. Our model holds promise for advancing generative models in medical imaging, offering greater precision and adaptability while inviting further exploration in this evolving field. A large synthetically generated data with annotations and code is publicly available at this https URL.
大规模生成模型已经在生成视觉上有吸引人的表现,特别是在医学影像应用中。然而,它们仍然需要应对图像幻觉和生成不准确的组织学输出的挑战。这些限制主要归因于仅依赖文本输入以及生成图像的空间控制不足,阻碍了这类模型在现实场景中的潜在有用性。我们提出了XReal,一种新颖的可控扩散模型,用于通过精确的解剖和病理位置控制生成逼真的胸部X光片图像。我们的轻量级方法可以无微调地将空间控制在预训练的文本到图像扩散模型中,同时保留其现有的知识并增强其生成能力。XReal在定量和定性指标上优于最先进的x光扩散模型,同时显示了13%和10%的组织学和病理学真实感的提高,基于专家放射科医生的评估。我们的模型有望在医学影像生成模型中取得进展,提供更高的精度和适应性,同时邀请在这一不断发展的领域中进行进一步的探索。具有注释和代码的大型合成数据可在此https URL公开获取。
https://arxiv.org/abs/2403.09240