The rapid development of generative AI is a double-edged sword, which not only facilitates content creation but also makes image manipulation easier and more difficult to detect. Although current image forgery detection and localization (IFDL) methods are generally effective, they tend to face two challenges: \textbf{1)} black-box nature with unknown detection principle, \textbf{2)} limited generalization across diverse tampering methods (e.g., Photoshop, DeepFake, AIGC-Editing). To address these issues, we propose the explainable IFDL task and design FakeShield, a multi-modal framework capable of evaluating image authenticity, generating tampered region masks, and providing a judgment basis based on pixel-level and image-level tampering clues. Additionally, we leverage GPT-4o to enhance existing IFDL datasets, creating the Multi-Modal Tamper Description dataSet (MMTD-Set) for training FakeShield's tampering analysis capabilities. Meanwhile, we incorporate a Domain Tag-guided Explainable Forgery Detection Module (DTE-FDM) and a Multi-modal Forgery Localization Module (MFLM) to address various types of tamper detection interpretation and achieve forgery localization guided by detailed textual descriptions. Extensive experiments demonstrate that FakeShield effectively detects and localizes various tampering techniques, offering an explainable and superior solution compared to previous IFDL methods.
生成式 AI 的快速发展是一把双刃剑,这不仅促进了内容创作,而且使图像编辑和检测变得更加容易和困难。尽管当前的图像伪造检测和局部定位(IFDL)方法通常有效,但它们往往面临两个挑战:(1)未知检测原理的黑盒性质,(2)不同编辑方法(如 Photoshop、DeepFake 和 AIGC 编辑)下的有限泛化能力。为解决这些问题,我们提出了可解释的 IFDL 任务,并设计了 FakeShield,一种多模态框架,能够评估图像真实性、生成修改区域mask,并提供基于像素级和图像级修改线索的判断依据。此外,我们还利用 GPT-4o 增强现有 IFDL 数据集,为训练 FakeShield 的修改分析能力创建了多模态 Tamper Description 数据集(MMTD-Set)。同时,我们引入了领域标签指导的伪造检测模块(DTE-FDM)和多模态伪造定位模块(MFLM),以解决各种修改检测解释和实现基于详细文本描述的伪造定位。大量实验证明,FakeShield 有效地检测和定位各种修改技术,与之前 IFDL 方法相比,提供了更高水平的有解释性和优越性。
https://arxiv.org/abs/2410.02761
Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
大语言模型(LLMs)通常会产生错误,包括事实性不准确、偏见和推理失败等,这些共同称为“幻觉”。 近年来,研究表明,LLMs的内部状态编码了其输出真实性相关的信息,并且这种信息可以用于检测错误。在本文中,我们证明了LLMs的内部表示比以前想象的更能编码真实性信息。我们首先发现,真实性信息集中在特定的标记上,并利用这一特性显著增强了错误检测性能。然而,我们发现,这样的错误检测器无法在数据集之间泛化,暗示着——与先前的说法相反——真实性编码不是普遍的,而是多面的。接下来,我们展示了内部表示还可以用于预测模型可能出现的错误类型,促进开发定制化缓解策略。最后,我们揭示了LLMs的内部编码和外部行为之间的差异:它们可能编码正确的答案,但总是生成错误的答案。这些见解从模型内部的角度进一步加深了我们对于LLM错误的了解,这对于未来研究增强错误分析和缓解方法具有指导意义。
https://arxiv.org/abs/2410.02707
We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the "dialect effect" produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.
我们提出了一个19世纪美国文学19个世纪文学 orthovariant 标记的数据集,其中包含了一个新的由人类注释的方言群标签层,旨在作为计算实验探讨文学意义 orthographic 变异的基线。我们对这个数据集进行了广泛的实验,使用 both token (BERT) 和 character (CANINE)- 级别的上下文语言模型。我们发现, intentional orthographic variation产生的“方言效应”使用了多个语言通道,并且这些通道能够在特定的语言建模假设下以不同的程度被揭示。 具体来说,我们发现,选择标记方案确实会影响模型能够揭示的 orthographic 信息的类型。
https://arxiv.org/abs/2410.02674
Cellular automata have become a cornerstone for investigating emergence and self-organization across diverse scientific disciplines, spanning neuroscience, artificial life, and theoretical physics. However, the absence of a hardware-accelerated cellular automata library limits the exploration of new research directions, hinders collaboration, and impedes reproducibility. In this work, we introduce CAX (Cellular Automata Accelerated in JAX), a high-performance and flexible open-source library designed to accelerate cellular automata research. CAX offers cutting-edge performance and a modular design through a user-friendly interface, and can support both discrete and continuous cellular automata with any number of dimensions. We demonstrate CAX's performance and flexibility through a wide range of benchmarks and applications. From classic models like elementary cellular automata and Conway's Game of Life to advanced applications such as growing neural cellular automata and self-classifying MNIST digits, CAX speeds up simulations up to 2,000 times faster. Furthermore, we demonstrate CAX's potential to accelerate research by presenting a collection of three novel cellular automata experiments, each implemented in just a few lines of code thanks to the library's modular architecture. Notably, we show that a simple one-dimensional cellular automaton can outperform GPT-4 on the 1D-ARC challenge.
细胞自动机已成为研究 emergence和自组织现象跨多个学科的基石,包括神经科学、人工生命和理论物理学。然而,缺乏硬件加速的细胞自动机库限制了探索新的研究方向,阻碍了合作,并阻碍了可重复性。在这项工作中,我们介绍了 CAX(细胞自动机加速器),一个高性能且灵活的开放源代码库,旨在加速细胞自动机研究。CAX 通过用户友好的界面提供了尖端的性能和模块化设计,并可以支持任何数量维度的离散和连续细胞自动机。我们通过广泛的基准测试和应用展示了 CAX 的性能和灵活性。从经典模型如基本细胞自动机和康威的生物游戏到高级应用如生长神经细胞自动机和自分类的MNIST数字,CAX 加快了模拟速度2000倍。此外,我们通过展示由库的模块化架构实现的一系列新颖细胞自动机实验,证明了 CAX 在加速研究方面的潜力。值得注意的是,我们展示了简单的单维度细胞自动机在1D-ARC挑战中可以击败GPT-4。
https://arxiv.org/abs/2410.02651
Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [this https URL}].
基于Transformer的模型在自然语言处理领域取得了巨大的突破。为了了解它们为什么表现如此出色以及其可靠性,几项研究集中于以下问题:这些模型编码了哪些语言属性,以及程度如何?当输入文本面对语言扰动时,这些模型的稳健性如何?然而,这些研究主要集中在BERT和英语语言。在本文中,我们研究了关于编码能力和稳健性方面的类似问题,涵盖了6种印度语言中的13种不同扰动,使用9个多语言Transformer模型(7个通用模型和2个特定印度语言模型)。为了进行这项研究,我们引入了一个新的多语言基准数据集IndicSentEval,包含大约47K个句子。令人惊讶的是,我们对表面、句法和语义属性的探究分析发现,虽然几乎所有多语言模型在英语上都表现出一致的编码性能,但对于印度语言,它们的性能存在差异。正如预期,特定印度语言的多语言模型比通用模型更准确地编码印度语言的语言属性。有趣的是,通用模型在扰动较大时表现出更好的鲁棒性,尤其是当同时删除名词和动词,或仅保留名词时。总的来说,这项研究为不同印度语言上流行多语言Transformer模型的探针和扰动特定优劣提供了宝贵的洞见。我们将我们的代码和数据公开发布在这里[这个https:// URL]。
https://arxiv.org/abs/2410.02611
Writing compelling fiction is a multifaceted process combining elements such as crafting a plot, developing interesting characters, and using evocative language. While large language models (LLMs) show promise for story writing, they currently rely heavily on intricate prompting, which limits their use. We propose Agents' Room, a generation framework inspired by narrative theory, that decomposes narrative writing into subtasks tackled by specialized agents. To illustrate our method, we introduce Tell Me A Story, a high-quality dataset of complex writing prompts and human-written stories, and a novel evaluation framework designed specifically for assessing long narratives. We show that Agents' Room generates stories that are preferred by expert evaluators over those produced by baseline systems by leveraging collaboration and specialization to decompose the complex story writing task into tractable components. We provide extensive analysis with automated and human-based metrics of the generated output.
写作引人入胜的小说是一个多维过程,结合了创作情节、发展有趣的人物和运用生动的语言等元素。虽然大型语言模型(LLMs)在故事创作方面显示出潜力,但它们目前仍然依赖复杂的提示,这限制了它们的使用。我们提出 Agents' Room,一种基于叙事理论的生成框架,将叙事写作分解为由专业代理商处理的具体任务。为了说明我们的方法,我们引入了 Tell Me A Story,一个高质量的复杂写作提示和人类编写的故事 dataset,以及专门用于评估长篇小说的全新评估框架。我们证明了 Agents' Room 通过利用协作和专业化将复杂的故事写作任务分解为可处理的部分,从而产生的故事比基线系统更喜欢专家评估员。我们通过自动和基于人工的指标对生成的输出进行详细分析。
https://arxiv.org/abs/2410.02603
Recently, 3D Gaussian Splatting (3DGS) has exceled in novel view synthesis with its real-time rendering capabilities and superior quality. However, it faces challenges for high-resolution novel view synthesis (HRNVS) due to the coarse nature of primitives derived from low-resolution input views. To address this issue, we propose Super-Resolution 3DGS (SuperGS), which is an expansion of 3DGS designed with a two-stage coarse-to-fine training framework, utilizing pretrained low-resolution scene representation as an initialization for super-resolution optimization. Moreover, we introduce Multi-resolution Feature Gaussian Splatting (MFGS) to incorporates a latent feature field for flexible feature sampling and Gradient-guided Selective Splitting (GSS) for effective Gaussian upsampling. By integrating these strategies within the coarse-to-fine framework ensure both high fidelity and memory efficiency. Extensive experiments demonstrate that SuperGS surpasses state-of-the-art HRNVS methods on challenging real-world datasets using only low-resolution inputs.
近年来,3D高斯平铺(3DGS)在实时渲染能力和卓越的质量方面脱颖而出,为高分辨率 novel view synthesis(HRNVS)带来了巨大的挑战。然而,由于从低分辨率输入视图中提取的图形的粗粒度性质,3DGS 在高分辨率 novel view synthesis(HRNVS)方面也面临着挑战。为解决这一问题,我们提出了 Super-Resolution 3DGS(SuperGS),这是通过两阶段粗-到细训练框架设计的一种 3DGS 的扩展,利用预训练的低分辨率场景表示作为超分辨率优化初始化。此外,我们还引入了 Multi-resolution Feature Gaussian Splatting(MFGS),以实现灵活的特征采样,并使用 Gradient-guided Selective Splitting(GSS)进行有效的 Gaussian 上采样。通过将这些策略集成在粗-到细框架中,确保高保真度和内存效率。大量实验证明,SuperGS 通过仅使用低分辨率输入在具有挑战性的现实世界数据集上超越了最先进的 HRNVS 方法。
https://arxiv.org/abs/2410.02571
SDO-FM is a foundation model using data from NASA's Solar Dynamics Observatory (SDO) spacecraft; integrating three separate instruments to encapsulate the Sun's complex physical interactions into a multi-modal embedding space. This model can be used to streamline scientific investigations involving SDO by making the enormous datasets more computationally accessible for heliophysics research and enable investigations that require instrument fusion. We discuss four key components: an ingestion pipeline to create machine learning ready datasets, the model architecture and training approach, resultant embeddings and fine-tunable models, and finally downstream fine-tuned applications. A key component of this effort has been to include subject matter specialists at each stage of development; reviewing the scientific value and providing guidance for model architecture, dataset, and training paradigm decisions. This paper marks release of our pretrained models and embedding datasets, available to the community on Hugging Face and this http URL.
SDO-FM 是一个使用来自 NASA 的 Solar Dynamics Observatory(SDO)太空船收集的数据作为基础的模型,将太阳的复杂物理相互作用集成到一个多模态嵌入空间中。这个模型可以用于通过使极大地数据集更具计算可得性来简化 SDO 在 heliophysics 研究中的科学调查,并使需要仪器融合的研究成为可能。我们讨论了四个关键组件:用于创建机器学习友好数据集的 ingestion 管道、模型架构和训练方法、结果嵌入和可微调的模型,以及最后的下游微调应用程序。 这一努力的关键部分是包括在开发过程中 subject matter specialists 在每个阶段;回顾科学价值并提供模型架构、数据和训练范式决策的指导。本文标志着我们预训练模型的发布,这些模型和嵌入数据可于 Hugging Face 和以下链接获取:
https://arxiv.org/abs/2410.02530
Medical image segmentation plays a crucial role in clinical diagnosis and treatment planning. Although models based on convolutional neural networks (CNNs) and Transformers have achieved remarkable success in medical image segmentation tasks, they still face challenges such as high computational complexity and the loss of local features when capturing long-range dependencies. To address these limitations, we propose Med-TTT, a visual backbone network integrated with Test-Time Training (TTT) layers, which incorporates dynamic adjustment capabilities. Med-TTT introduces the Vision-TTT layer, which enables effective modeling of long-range dependencies with linear computational complexity and adaptive parameter adjustment during inference. Furthermore, we designed a multi-resolution fusion mechanism to combine image features at different scales, facilitating the identification of subtle lesion characteristics in complex backgrounds. At the same time, we adopt a frequency domain feature enhancement strategy based on high pass filtering, which can better capture texture and fine-grained details in images. Experimental results demonstrate that Med-TTT significantly outperforms existing methods on multiple medical image datasets, exhibiting strong segmentation capabilities, particularly in complex image backgrounds. The model achieves leading performance in terms of accuracy, sensitivity, and Dice coefficient, providing an efficient and robust solution for the field of medical image this http URL code is available at this https URL .
医学图像分割在临床诊断和治疗规划中发挥着关键作用。虽然基于卷积神经网络(CNN)和Transformer的模型在医学图像分割任务中取得了显著的成功,但它们仍然面临着一些挑战,如高计算复杂度以及在捕捉长距离依赖时丢失局部特征。为了应对这些局限,我们提出了Med-TTT,一种将Test-Time Training(TTT)层与视觉骨干网络集成在一起的模型,具有动态调整功能。Med-TTT引入了Vision-TTT层,在保持线性计算复杂度的同时,在推理过程中有效建模长距离依赖。此外,我们还设计了一个多分辨率融合机制,将不同尺度下的图像特征进行组合,有助于在复杂背景下识别微妙病变特征。同时,我们采用基于高斯滤波器的频域特征增强策略,可以更好地捕捉图像中的纹理和细小细节。实验结果表明,Med-TTT在多个医学图像数据集上显著优于现有方法,具有强大的分割能力,特别是在复杂图像背景下。该模型在准确率、敏感度和Dice系数方面均取得领先地位,为医学图像分割领域提供了一种高效且可靠的解决方案。您可以在此处访问该模型的原始论文:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7222125/。
https://arxiv.org/abs/2410.02523
Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task learning and limited generalization to unseen objects. To overcome these challenges, we introduce ResDex, a novel approach that integrates residual policy learning with a mixture-of-experts (MoE) framework. ResDex is distinguished by its use of geometry-unaware base policies that are efficiently acquired on individual objects and capable of generalizing across a wide range of unseen objects. Our MoE framework incorporates several base policies to facilitate diverse grasping styles suitable for various objects. By learning residual actions alongside weights that combine these base policies, ResDex enables efficient multi-task RL for universal dexterous grasping. ResDex achieves state-of-the-art performance on the DexGraspNet dataset comprising 3,200 objects with an 88.8% success rate. It exhibits no generalization gap with unseen objects and demonstrates superior training efficiency, mastering all tasks within only 12 hours on a single GPU.
通用且灵巧的抓取跨越多样物体,对机器人学习是一个基本但困难的挑战。使用强化学习(RL)开发策略来处理广泛物体数据集现有方法面临着关键限制,包括多任务学习复杂的课程设计和对未见过的物体的泛化能力有限。为了克服这些挑战,我们引入了ResDex,一种将残差策略学习与专家混合(MoE)框架相结合的新颖方法。ResDex的特点在于其使用几何感知的基础策略,在单个物体上以高效的方式获得,并能够跨越广泛的未见过的物体。我们的MoE框架包括几个基础策略,以促进各种抓取风格,适应该些物体。通过与这些基础策略一起学习残差动作,ResDex实现了通用灵巧抓取的 efficient multi-task RL。ResDex在由3,200个物体组成的DexGraspNet数据集上取得了最先进的性能,成功率为88.8%。它与未见过的物体没有泛化差距,并展示了在单块GPU上训练的高效率,精通所有任务,仅用12个小时。
https://arxiv.org/abs/2410.02475
In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.
在本文中,我们提出了Plug-and-Play(PnP)流匹配算法,用于解决图像反问题。PnP方法通过将预训练的去噪器集成到优化方案中,利用预训练去噪器的优势,通常使用深度神经网络。虽然它们在各种图像反问题中实现了最先进的性能,但PnP方法在更具有生成性的任务(如修复)上存在固有局限性。另一方面,像Flow Matching这样的生成模型在图像采样方面推动了边界,但是它们缺乏在图像修复中有效使用的方法。我们提出了一种将PnP框架与Flow Matching(FM)相结合的方法,通过使用预训练FM模型定义一个时间依赖的去噪器。我们的算法在数据可靠性梯度下降步骤、学习到的FM路径上的投影以及去噪三个步骤之间交替进行。值得注意的是,我们的方法在计算效率和内存友好性方面具有优势,因为它避免了通过ODE进行反向传播和迹计算。我们在去噪、超分辨率、去雾和修复任务上评估了其性能,证明了与现有PnP算法和基于Flow Matching的最佳方法相比具有卓越的结果。
https://arxiv.org/abs/2410.02423
The Coastal underwater evidence search system with surface-underwater collaboration is designed to revolutionize the search for artificial objects in coastal underwater environments, overcoming limitations associated with traditional methods such as divers and tethered remotely operated vehicles. Our innovative multi-robot collaborative system consists of three parts, an autonomous surface vehicle as a mission control center, a towed underwater vehicle for wide-area search, and a biomimetic underwater robot inspired by marine organisms for detailed inspections of identified areas. We conduct extensive simulations and real-world experiments in pond environments and coastal fields to demonstrate the system potential to surpass the limitations of conventional underwater search methods, offering a robust and efficient solution for law enforcement and recovery operations in marine settings.
海洋水下证据搜索系统与水面下合作搜索是一个设计,旨在彻底颠覆沿海水下环境中寻找人造物体的传统方法,克服了与传统方法相关的限制,如潜水员和附着式远程操控车辆。我们创新的多机器人协同系统由三个部分组成,分别是自主水面车辆作为任务控制中心、拖行的水下车辆进行区域搜索和以海洋生物为灵感的水下机器人,用于对确定的区域进行详细检查。我们在池塘环境和沿海水域进行广泛的仿真和实地试验,以展示该系统在超越传统水下搜索方法的局限性方面具有潜力,为警察和救援人员在海洋环境中的执法和恢复操作提供了一个健壮和高效解决方案。
https://arxiv.org/abs/2410.02345
Urban environments face significant challenges due to climate change, including extreme heat, drought, and water scarcity, which impact public health, community well-being, and local economies. Effective management of these issues is crucial, particularly in areas like Sydney Olympic Park, which relies on one of Australia's largest irrigation systems. The Smart Irrigation Management for Parks and Cool Towns (SIMPaCT) project, initiated in 2021, leverages advanced technologies and machine learning models to optimize irrigation and induce physical cooling. This paper introduces two novel methods to enhance the efficiency of the SIMPaCT system's extensive sensor network and applied machine learning models. The first method employs clustering of sensor time series data using K-shape and K-means algorithms to estimate readings from missing sensors, ensuring continuous and reliable data. This approach can detect anomalies, correct data sources, and identify and remove redundant sensors to reduce maintenance costs. The second method involves sequential data collection from different sensor locations using robotic systems, significantly reducing the need for high numbers of stationary sensors. Together, these methods aim to maintain accurate soil moisture predictions while optimizing sensor deployment and reducing maintenance costs, thereby enhancing the efficiency and effectiveness of the smart irrigation system. Our evaluations demonstrate significant improvements in the efficiency and cost-effectiveness of soil moisture monitoring networks. The cluster-based replacement of missing sensors provides up to 5.4% decrease in average error. The sequential sensor data collection as a robotic emulation shows 17.2% and 2.1% decrease in average error for circular and linear paths respectively.
由于气候变化,城市环境面临重大挑战,包括极端高温、干旱和水资源短缺,这些都影响了公共卫生、社区福祉和当地经济。有效管理这些问题至关重要,尤其是在像悉尼奥林匹克公园这样的地区,该地区依赖澳大利亚最大的灌溉系统。2021年启动的智能公园和 cool 城镇项目(SIMPaCT)利用先进的技术和机器学习模型优化灌溉和诱导物理降温。本文介绍了两种新的方法,增强 SIMPaCT 系统的广泛传感器网络的效率,并应用机器学习模型。第一种方法采用 K-形状和 K-means 算法对传感器时间序列数据进行聚类,估计缺失传感器的读数,确保连续和可靠的数据。这种方法可以检测异常,纠正数据来源,并识别和删除冗余传感器,从而降低维护成本。第二种方法涉及使用机器人系统从不同传感器位置进行顺序数据收集,从而大大减少了需要的高数量静态传感器的需要。 Together,这些方法旨在在优化传感器部署的同时降低维护成本,从而提高智能灌溉系统的效率和效果。我们的评估结果表明,土壤水分监测网络的效率和成本效益都有显著提高。基于聚类的缺失传感器替换平均误差降低了至多 5.4%。作为机器人仿真的顺序传感器数据收集,环形和线性路径的平均误差分别降低了 17.2% 和 2.1%。
https://arxiv.org/abs/2410.02335
Phrases are fundamental linguistic units through which humans convey semantics. This study critically examines the capacity of API-based large language models (LLMs) to comprehend phrase semantics, utilizing three human-annotated datasets. We assess the performance of LLMs in executing phrase semantic reasoning tasks guided by natural language instructions and explore the impact of common prompting techniques, including few-shot demonstrations and Chain-of-Thought reasoning. Our findings reveal that LLMs greatly outperform traditional embedding methods across the datasets; however, they do not show a significant advantage over fine-tuned methods. The effectiveness of advanced prompting strategies shows variability. We conduct detailed error analyses to interpret the limitations faced by LLMs in comprehending phrase semantics. Code and data can be found at this https URL.
短语是人们交流语义的基本语言单位。本研究对基于API的大型语言模型(LLMs)理解短语语义的能力进行了批判性探讨,利用了三个由人类标注的数据集。我们评估了LLMs在执行由自然语言指令指导的短语语义推理任务中的表现,并探讨了包括少样本演示和 Chain-of-Thought推理等常见提示技术的影响。我们的研究结果表明,LLMs在数据集上的表现远远超过了传统嵌入方法;然而,它们并没有表现出相对于微调方法的优势。高级提示策略的有效性表现出多样性。我们详细分析了LLMs在理解短语语义时所面临的局限性。代码和数据可在此链接找到:https://url.
https://arxiv.org/abs/2410.02308
Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings demonstrate the effectiveness of integrating spatial visual prompts into VLLMs for improving emotion recognition performance.
视觉大型语言模型(VLLMs)正在改变计算机视觉和自然语言处理之间的交叉。然而,在這些模型中使用視覺提示進行情感識別的潛力仍然被大大開發和利用。傳統的VLLM方法在空間定位方面存在困難,通常會舍棄有價值的全身上下文。為了解決這個問題,我們提出了一个Set-of-Vision prompting(SoV)方法,通過使用空間信息(如邊界框和面部 landmarks)來精確標記目標,來增強 zero-shot情感識別的準確度。SoV在face count和 emotion categorization方面改善了準確度,同時保留了豐富的情感上下文。通過對最新商業或開源VLLM進行了大量實驗和分析,我們評估了SoV模型在自然環境中理解面部表情的能力。我們的研究結果表明,將空間視覺提示集成到VLLM中可以提高情感識別性能。
https://arxiv.org/abs/2410.02244
Detecting 3D keypoints with semantic consistency is widely used in many scenarios such as pose estimation, shape registration and robotics. Currently, most unsupervised 3D keypoint detection methods focus on the rigid-body objects. However, when faced with deformable objects, the keypoints they identify do not preserve semantic consistency well. In this paper, we introduce an innovative unsupervised keypoint detector Key-Grid for both the rigid-body and deformable objects, which is an autoencoder framework. The encoder predicts keypoints and the decoder utilizes the generated keypoints to reconstruct the objects. Unlike previous work, we leverage the identified keypoint in formation to form a 3D grid feature heatmap called grid heatmap, which is used in the decoder section. Grid heatmap is a novel concept that represents the latent variables for grid points sampled uniformly in the 3D cubic space, where these variables are the shortest distance between the grid points and the skeleton connected by keypoint pairs. Meanwhile, we incorporate the information from each layer of the encoder into the decoder section. We conduct an extensive evaluation of Key-Grid on a list of benchmark datasets. Key-Grid achieves the state-of-the-art performance on the semantic consistency and position accuracy of keypoints. Moreover, we demonstrate the robustness of Key-Grid to noise and downsampling. In addition, we achieve SE-(3) invariance of keypoints though generalizing Key-Grid to a SE(3)-invariant backbone.
检测3D关键点与语义一致性是许多应用场景(如姿态估计、形状配准和机器人技术)中广泛使用的。目前,大多数无监督3D关键点检测方法都关注于刚体物体。然而,面对变形物体,它们确定的关键点在语义上并不保持一致。在本文中,我们提出了一种创新的无监督关键点检测器Key-Grid,适用于刚体和变形物体,是一种自动编码器框架。编码器预测关键点,解码器利用生成的关键点重构物体。与之前的工作不同,我们利用已识别的关键点形成一个3D立方空间中采样均匀的网格点特征热图,即网格热图,用于解码器部分。网格热图是一种新颖的概念,它表示在3D立方空间中,网格点与通过关键点对齐的骨架之间的最短距离。同时,我们将编码器每一层的有关信息融入解码器部分。我们在一系列基准数据集上对Key-Grid进行广泛评估。Key-Grid在关键点的语义一致性和位置精度上实现了最先进的性能。此外,我们还证明了Key-Grid对噪声和下采样具有鲁棒性。此外,通过将Key-Grid扩展到SE(3)-不变的骨干网络,我们实现了关键点的SE(3)不变性。
https://arxiv.org/abs/2410.02237
With hundreds of thousands of language models available on Huggingface today, efficiently evaluating and utilizing these models across various downstream, tasks has become increasingly critical. Many existing methods repeatedly learn task-specific representations of Large Language Models (LLMs), which leads to inefficiencies in both time and computational resources. To address this, we propose EmbedLLM, a framework designed to learn compact vector representations, of LLMs that facilitate downstream applications involving many models, such as model routing. We introduce an encoder-decoder approach for learning such embeddings, along with a systematic framework to evaluate their effectiveness. Empirical results show that EmbedLLM outperforms prior methods in model routing both in accuracy and latency. Additionally, we demonstrate that our method can forecast a model's performance on multiple benchmarks, without incurring additional inference cost. Extensive probing experiments validate that the learned embeddings capture key model characteristics, e.g. whether the model is specialized for coding tasks, even without being explicitly trained on them. We open source our dataset, code and embedder to facilitate further research and application.
借助于今天Huggingface上数百万人训练好的语言模型,在各种下游任务上有效地评估和使用这些模型变得越来越重要。许多现有方法在反复学习大型语言模型的任务特定表示后,导致了在时间和计算资源上的低效。为了应对这个问题,我们提出了EmbedLLM,一个旨在学习紧凑向量表示的大型语言模型(LLM)的框架,以便于涉及多种模型的下游应用,例如模型路由。我们引入了一种编码器-解码器方法来学习这些嵌入,并建立了一个系统性的框架来评估它们的有效性。实验结果表明,EmbedLLM在模型路由方面的表现优于先前方法,且延迟更小。此外,我们还证明了我们的方法可以预测模型在多个基准上的性能,而无需承担额外的推理成本。大量的探索性实验证实了学习到的嵌入捕捉了关键模型特征,例如模型是否专门用于编码任务,即使没有明确针对这些模型进行训练。我们开源了我们的数据集、代码和嵌入器,以促进进一步的研究和应用。
https://arxiv.org/abs/2410.02223
We introduce G2T-LLM, a novel approach for molecule generation that uses graph-to-tree text encoding to transform graph-based molecular structures into a hierarchical text format optimized for large language models (LLMs). This encoding converts complex molecular graphs into tree-structured formats, such as JSON and XML, which LLMs are particularly adept at processing due to their extensive pre-training on these types of data. By leveraging the flexibility of LLMs, our approach allows for intuitive interaction using natural language prompts, providing a more accessible interface for molecular design. Through supervised fine-tuning, G2T-LLM generates valid and coherent chemical structures, addressing common challenges like invalid outputs seen in traditional graph-based methods. While LLMs are computationally intensive, they offer superior generalization and adaptability, enabling the generation of diverse molecular structures with minimal task-specific customization. The proposed approach achieved comparable performances with state-of-the-art methods on various benchmark molecular generation datasets, demonstrating its potential as a flexible and innovative tool for AI-driven molecular design.
我们提出了G2T-LLM,一种用于分子生成的新方法,它使用图转树文本编码将基于图的分子结构转换为针对大型语言模型(LLMs)的层次化文本格式。这种编码将复杂的分子图转换为树状格式,比如JSON和XML,这些LLM特别擅长处理这种类型的数据,因为它们在预训练阶段对这类数据进行了广泛的处理。通过利用LLMs的灵活性,我们的方法允许使用自然语言提示进行直观的交互,为分子设计提供了一个更易访问的界面。通过有监督的微调,G2T-LLM生成有效且连贯的化学结构,解决了传统图基方法中常见的问题,如传统方法中看到的无效输出。虽然LLM的计算密集型较高,但它们具有卓越的泛化能力和适应性,能够生成具有最小任务特定定制化的大多数分子结构。所提出的方法在各种基准分子生成数据集上与最先进的方法实现了相当的成绩,证明了其在AI驱动的分子设计中的潜在能力。
https://arxiv.org/abs/2410.02198
Accurate interpretation of Electrocardiogram (ECG) signals is pivotal for diagnosing cardiovascular diseases. Integrating ECG signals with their accompanying textual reports holds immense potential to enhance clinical diagnostics through the combination of physiological data and qualitative insights. However, this integration faces significant challenges due to inherent modality disparities and the scarcity of labeled data for robust cross-modal learning. To address these obstacles, we propose C-MELT, a novel framework that pre-trains ECG and text data using a contrastive masked auto-encoder architecture. C-MELT uniquely combines the strengths of generative with enhanced discriminative capabilities to achieve robust cross-modal representations. This is accomplished through masked modality modeling, specialized loss functions, and an improved negative sampling strategy tailored for cross-modal alignment. Extensive experiments on five public datasets across diverse downstream tasks demonstrate that C-MELT significantly outperforms existing methods, achieving 15% and 2% increases in linear probing and zero-shot performance over state-of-the-art models, respectively. These results highlight the effectiveness of C-MELT, underscoring its potential to advance automated clinical diagnostics through multi-modal representations.
准确解释心电图(ECG)信号对于诊断心血管疾病至关重要。将ECG信号与其相应文本报告集成具有巨大的潜力,通过结合生理数据和定性洞察来提高临床诊断。然而,由于固有模态差异和缺乏带标签数据,这种集成面临着重大挑战。为解决这些障碍,我们提出了C-MELT,一种使用对比性掩码自动编码器架构预训练ECG和文本数据的全新框架。C-MELT独特地将生成力和增强区分能力相结合,实现稳健跨模态表示。这是通过掩码模态建模、专业损失函数和针对跨模态对齐的改进负采样策略来实现的。在五个公共数据集上的广泛实验证明,C-MELT显著优于现有方法,在线性探测和零散探测方面的线性提高分别达到15%和2%。这些结果突出了C-MELT的有效性,并强调了其通过多模态表示推动自动临床诊断的可能。
https://arxiv.org/abs/2410.02131
Recent years have seen many audio-domain text-to-music generation models that rely on large amounts of text-audio pairs for training. However, symbolic-domain controllable music generation has lagged behind partly due to the lack of a large-scale symbolic music dataset with extensive metadata and captions. In this work, we present MetaScore, a new dataset consisting of 963K musical scores paired with rich metadata, including free-form user-annotated tags, collected from an online music forum. To approach text-to-music generation, we leverage a pretrained large language model (LLM) to generate pseudo natural language captions from the metadata. With the LLM-enhanced MetaScore, we train a text-conditioned music generation model that learns to generate symbolic music from the pseudo captions, allowing control of instruments, genre, composer, complexity and other free-form music descriptors. In addition, we train a tag-conditioned system that supports a predefined set of tags available in MetaScore. Our experimental results show that both the proposed text-to-music and tags-to-music models outperform a baseline text-to-music model in a listening test, while the text-based system offers a more natural interface that allows free-form natural language prompts.
近年来,许多音频领域文本到音乐的生成模型依赖于大量文本音频对进行训练。然而,由于缺乏大规模符号音乐数据集以及丰富的元数据和字幕,符号域可控制音乐生成方面的发展滞后。在本文中,我们提出了MetaScore,一个由963K个音乐 score 配对 rich metadata(包括自由文本用户标注的标签,来源于在线音乐论坛)组成的新数据集。为了实现文本到音乐生成,我们利用预训练的大型语言模型(LLM)生成元数据中的自然语言字幕。通过LLM增强的MetaScore,我们训练了一个文本条件音乐生成模型,可以从伪字幕中学习生成符号音乐,实现对乐器、流派、作曲家、复杂程度和其他自由形式音乐描述器的控制。此外,我们训练了一个标签条件系统,支持预定义的标签集。我们的实验结果表明,与基线文本到音乐模型相比,所提出的文本到音乐和标签到音乐模型在听觉测试中表现更佳,而基于文本的系统则提供了更加自然的人工界面,允许自由文本用户提示。
https://arxiv.org/abs/2410.02084