Creating specialized large language models requires vast amounts of clean, special purpose data for training and fine-tuning. With only a handful of existing large-scale, domain-specific datasets, creation of new datasets is required in most applications. This requires the development of new application-specific filtering of web-scale data. Filtering with a high-performance, general-purpose LLM such as GPT-4o can be highly effective, but this is extremely expensive at web-scale. This paper proposes SIEVE, a lightweight alternative that matches GPT-4o accuracy at a fraction of the cost. SIEVE can perform up to 500 filtering operations for the cost of one GPT-4o filtering call. The key to SIEVE is a seamless integration of GPT-4o and lightweight T5 models, using active learning to fine-tune T5 in the background with a small number of calls to GPT-4o. Once trained, it performs as well as GPT-4o at a tiny fraction of the cost. We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks targeting high quality and domain-specific content. Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training at a substantially lower cost (1%) than existing techniques. To further validate SIEVE, experiments show that SIEVE and GPT-4o achieve similar accuracy, with human evaluators preferring SIEVE's filtering results to those of GPT-4o.
创建专用的大型语言模型需要大量干净、专门用途的数据进行训练和微调。仅有一小部分现有的大型、领域特定的数据集,大多数应用程序都需要创建新的数据集。为此,需要在应用程序中开发新的针对网络规模数据的过滤。使用高性能、通用目的的LLM(如GPT-4o)进行过滤可能非常有效,但这对大规模网络非常昂贵。本文提出SIEVE,一种轻量级的替代方案,在GPT-4o的准确度分数为1/1000的成本下实现了与GPT-4o同样的效果。SIEVE的关键是实现GPT-4o与轻量T5模型的无缝集成,通过主动学习在后台对T5进行微调,以少量GPT-4o过滤调用为基础。一旦训练完成,它的表现与GPT-4o相当,而成本只有GPT-4o的1%(远低于现有技术)。我们使用OpenWebText数据集来实验验证SIEVE,该数据集针对高质量和领域特定的内容,有五个高度定制的过滤任务。我们的结果表明,我们的方法在较低的成本(1%)下 curated大型、高质量数据集用于语言模型训练方面的效果和效率。为了进一步验证SIEVE,实验表明SIEVE和GPT-4o达到相似的准确度,人类评估者更喜欢SIEVE的过滤结果,而不喜欢GPT-4o的过滤结果。
https://arxiv.org/abs/2410.02755
Self-supervised stereo matching holds great promise for application and research due to its independence from expensive labeled data. However, direct self-supervised stereo matching paradigms based on photometric loss functions have consistently struggled with performance issues due to the occlusion challenge. The crux of the occlusion challenge lies in the fact that the positions of occluded pixels consistently align with the epipolar search direction defined by the input stereo images, leading to persistent information loss and erroneous feedback at fixed locations during self-supervised training. In this work, we propose a simple yet highly effective pseudo-stereo inputs strategy to address the core occlusion challenge. This strategy decouples the input and feedback images, compelling the network to probabilistically sample information from both sides of the occluding objects. As a result, the persistent lack of information in the aforementioned fixed occlusion areas is mitigated. Building upon this, we further address feedback conflicts and overfitting issues arising from the strategy. By integrating these components, our method achieves stable and significant performance improvements compared to existing methods. Quantitative experiments are conducted to evaluate the performance. Qualitative experiments further demonstrate accurate disparity inference even at occluded regions. These results demonstrate a significant advancement over previous methods in the field of direct self-supervised stereo matching based on photometric loss. The proposed pseudo-stereo inputs strategy, due to its simplicity and effectiveness, has the potential to serve as a new paradigm for direct self-supervised stereo matching. Code is available at this https URL.
自监督立体匹配在应用和研究方面具有巨大的潜力,因为它与昂贵的标记数据独立。然而,基于 photometric 损失函数的直接自监督立体匹配方法一直因遮挡挑战而表现不佳。遮挡挑战的核心在于,遮挡像素的位置始终与输入立体图像定义的 epipolar 搜索方向对齐,导致在自监督训练过程中固定位置持续丢失信息和错误的反馈。在本文中,我们提出了一种简单但非常有效的伪立体输入策略来解决核心遮挡挑战。该策略将输入和反馈图像解耦,迫使网络从遮挡物两侧的概率采样信息。因此,上述固定遮挡区域的持续信息缺乏得到了缓解。在此基础上,我们进一步解决了策略产生的反馈冲突和过拟合问题。通过整合这些组件,我们的方法实现了与现有方法相比的稳定和显著的性能改进。通过进行定量的实验来评估性能。定性实验进一步证明了即使在遮挡区域,精确的差异推理仍然存在。这些结果表明,基于 photometric 损失函数的直接自监督立体匹配在该领域取得了显著的进展。所提出的伪立体输入策略,由于其简单和有效的特点,有可能成为该领域的一种新的范例。代码可以从该链接下载。
https://arxiv.org/abs/2410.02534
In this paper, we introduce Plug-and-Play (PnP) Flow Matching, an algorithm for solving imaging inverse problems. PnP methods leverage the strength of pre-trained denoisers, often deep neural networks, by integrating them in optimization schemes. While they achieve state-of-the-art performance on various inverse problems in imaging, PnP approaches face inherent limitations on more generative tasks like inpainting. On the other hand, generative models such as Flow Matching pushed the boundary in image sampling yet lack a clear method for efficient use in image restoration. We propose to combine the PnP framework with Flow Matching (FM) by defining a time-dependent denoiser using a pre-trained FM model. Our algorithm alternates between gradient descent steps on the data-fidelity term, reprojections onto the learned FM path, and denoising. Notably, our method is computationally efficient and memory-friendly, as it avoids backpropagation through ODEs and trace computations. We evaluate its performance on denoising, super-resolution, deblurring, and inpainting tasks, demonstrating superior results compared to existing PnP algorithms and Flow Matching based state-of-the-art methods.
在本文中,我们提出了Plug-and-Play(PnP)流匹配算法,用于解决图像反问题。PnP方法通过将预训练的去噪器集成到优化方案中,利用预训练去噪器的优势,通常使用深度神经网络。虽然它们在各种图像反问题中实现了最先进的性能,但PnP方法在更具有生成性的任务(如修复)上存在固有局限性。另一方面,像Flow Matching这样的生成模型在图像采样方面推动了边界,但是它们缺乏在图像修复中有效使用的方法。我们提出了一种将PnP框架与Flow Matching(FM)相结合的方法,通过使用预训练FM模型定义一个时间依赖的去噪器。我们的算法在数据可靠性梯度下降步骤、学习到的FM路径上的投影以及去噪三个步骤之间交替进行。值得注意的是,我们的方法在计算效率和内存友好性方面具有优势,因为它避免了通过ODE进行反向传播和迹计算。我们在去噪、超分辨率、去雾和修复任务上评估了其性能,证明了与现有PnP算法和基于Flow Matching的最佳方法相比具有卓越的结果。
https://arxiv.org/abs/2410.02423
This paper introduces a new hybrid descriptor for 3D point matching and point cloud registration, combining local geometrical properties and learning-based feature propagation for each point's neighborhood structure description. The proposed architecture first extracts prior geometrical information by computing each point's planarity, anisotropy, and omnivariance using a Principal Components Analysis (PCA). This prior information is completed by a descriptor based on the normal vectors estimated thanks to constructing a neighborhood based on triangles. The final geometrical descriptor is propagated between the points using local graph convolutions and attention mechanisms. The new feature extractor is evaluated on ModelNet40, Bunny Stanford dataset, KITTI and MVP (Multi-View Partial)-RG for point cloud registration and shows interesting results, particularly on noisy and low overlapping point clouds.
本文提出了一种新的混合描述符,用于3D点匹配和点云配准,将局部几何性质和学习基特征传播相结合,用于描述每个点的邻域结构。所提出的架构首先通过计算每个点的平移、各向异性和二进制化来提取先验几何信息,这是通过计算每个点的法线得到的。先验信息由基于三角形的邻域构建的描述符补充。最后,通过局部图卷积和注意机制在点之间传播几何描述符。对新的特征提取器在ModelNet40、Bunny Stanford数据集、KITTI和MVP(多视角部分点云配准)上的评估表明有趣的结果,尤其是在嘈杂和低重叠点云上。
https://arxiv.org/abs/2410.02420
Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.
准确实时追踪灵活的手部运动和交互在人类-计算机交互、元宇宙、机器人和远程医疗等领域具有许多应用价值。由于手部动作的数量和自由度较大,捕捉真实的握持动作具有挑战性。在这里,我们报道了一种使用具有可伸缩、可清洗的智能手套以及嵌入的螺旋传感器纤维的准确而动态追踪具有关节和手指的运动。传感器纤维具有高动态范围,响应于低0.005%至高155%的应变,并在大范围使用和清洗周期中表现出稳定性。我们使用多级机器学习来报告跨个体交叉验证的平均关节角度估计根方差分别为1.21度和1.45度,分别与没有遮挡或视野限制的昂贵运动捕捉摄像机相匹配,具有与昂贵相机相媲美的准确性。我们还报道了一种增强传感器噪声和灵敏度变化的数据增强技术。我们展示了在物体交互过程中准确追踪灵活的手部动作,开辟了包括在模拟纸键盘上准确打字、从美国手语识别复杂动态和静态手势以及物体识别等新应用领域的道路。
https://arxiv.org/abs/2410.02221
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.
我们提出了一种使用自动生成的指令跟随数据来提高具有生成和图像编辑任务的复杂多模态模型的零 shot能力的方法。我们通过使用GPT-4V和现有的图像生成和编辑数据集来创建一个新的多模态指令跟随集。通过使用这个指令集以及现有的LLaVA-Finetune指令集来支持视觉理解任务,我们产生了GenLLaVA,一种基于指令微调的大规模语言和视觉助手。GenLLaVA是通过结合三种大型预训练模型通过指令微调来实现这一策略的:Mistral进行语言建模,SigLIP进行图像文本匹配,StableDiffusion进行文本到图像生成。我们的模型在LLaVA和原有的多模态模型上展示了卓越的视觉理解能力,并且与诸如Unified-IO 2等本地多模态模型竞争,为通过有效重用现有多模态模型构建先进的通用视觉助手铺平道路。我们将开放源码我们的数据集、代码库和模型检查点,以促进该领域进一步的研究和应用。
https://arxiv.org/abs/2406.11262
We introduce Cadenza, a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas as well as unconditional generations. To accomplish this we propose a novel MIDI encoding method, PerTok (Performance Tokenizer) that captures minute expressive details whilst reducing sequence length up to 59% and vocabulary size up to 95% for polyphonic, monophonic and rhythmic tasks. The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer. The Composer model is a transformer-based Variational Autoencoder (VAE), with Rotary Positional Embeddings (RoPE)ROPE and an autoregressive decoder modified to more effectively integrate the latent codes of the input musical idea. The Performer model is a bidirectional transformer encoder that is separately trained to predict velocities and microtimings on MIDI sequences. Objective and human evaluations demonstrate Cadenza's versatile capability in 1) matching other unconditional state-of-the-art symbolic models in musical quality whilst sounding more expressive, and 2) composing new, expressive ideas that are both stylistically related to the input whilst providing novel ideas to the user. Our framework is designed, researched and implemented with the objective of ethically providing inspiration for musicians.
我们提出了Cadenza,一种新的多阶段生成框架,用于预测符号音乐思想的表达性变化以及无条件生成。为了实现这一目标,我们提出了一个新的MIDI编码方法,PerTok(表演标记词),它能够在保留微小表现细节的同时将序列长度降低至59%,并将词汇量降低至95%对于多声和单声任务。该框架由两个连续阶段组成:1)作曲家,2)表演者。作曲家模型是一个基于Transformer的变分自编码器(VAE),具有旋转位置嵌入(RoPE)ROPE和经过修改的递归解码器,以更有效地整合输入音乐思想的潜在码。表演者模型是一个双向Transformer编码器,分别针对MIDI序列预测速度和微时。对象人类评价证明了Cadenza在1)与其他无条件最佳符号模型在音乐质量上相匹配的同时听起来更加表现力,2)创作与输入音乐风格相关的新表现性想法,并为用户提供新颖的想法。我们的框架旨在为音乐家提供道德灵感。
https://arxiv.org/abs/2410.02060
Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed "agglomerative models." We build upon this body of work by studying the effect of the teachers' activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique "PHI Standardization" (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.
各种视觉基础模型具有独特的优势和劣势,这两点都可以通过无标签的异质多教师知识蒸馏得到改进,这种方法称为聚类模型。我们在这一研究基础上,研究了教师激活统计量的影响,特别是损失函数对最终学生模型质量的影响。我们探讨了统计正则化技术的一个标准工具包,以更好地对不同分布进行对齐,并评估它们的影响。此外,我们研究了下游教师匹配度指标的影响,这促使了使用哈达玛矩阵。利用这些矩阵,我们证明了它们的有用性质,展示了它们如何用于等距标准化,即使用相同的尺度对多维分布的每个维度进行标准化。我们将这种技术称为“PHI 标准化”(PHI-S),并实证证明它在所研究的方法中产生了最佳的学生模型。
https://arxiv.org/abs/2410.01680
In Query-driven Travel Recommender Systems (RSs), it is crucial to understand the user intent behind challenging natural language(NL) destination queries such as the broadly worded "youth-friendly activities" or the indirect description "a high school graduation trip". Such queries are challenging due to the wide scope and subtlety of potential user intents that confound the ability of retrieval methods to infer relevant destinations from available textual descriptions such as WikiVoyage. While query reformulation (QR) has proven effective in enhancing retrieval by addressing user intent, existing QR methods tend to focus only on expanding the range of potentially matching query subtopics (breadth) or elaborating on the potential meaning of a query (depth), but not both. In this paper, we introduce Elaborative Subtopic Query Reformulation (EQR), a large language model-based QR method that combines both breadth and depth by generating potential query subtopics with information-rich elaborations. We also release TravelDest, a novel dataset for query-driven travel destination RSs. Experiments on TravelDest show that EQR achieves significant improvements in recall and precision over existing state-of-the-art QR methods.
在Query驱动的旅行推荐系统(RSs)中,了解用户意图背后的具有挑战性的自然语言(NL)目的地查询,如广义词的“青年友好活动”或间接描述的“高中毕业旅行”,对于理解检索方法从可用文本描述中推断相关目的地具有关键意义至关重要。这些查询具有广泛的应用范围和微妙的含义,使得检索方法难以从如维基百科等文本描述中推断相关目的地。虽然查询重构(QR)已经在增强检索方面取得了有效效果,但现有的QR方法往往仅关注查询子主题的范围(广度)或查询潜在含义的拓展(深度),而不涉及两个方面。在本文中,我们介绍了基于大型语言模型的大规模查询重构方法——拓展性子主题查询重构(EQR),它结合了广度和深度。我们还发布了TravelDest,用于查询驱动旅行目的地RS。在TravelDest上的实验表明,EQR在召回率和精度方面显著超过了现有的QR方法。
https://arxiv.org/abs/2410.01598
In-context learning (ICL) is an effective approach to help large language models (LLMs) adapt to various tasks by providing demonstrations of the target task. Considering the high cost of labeling demonstrations, many methods propose synthesizing demonstrations from scratch using LLMs. However, the quality of the demonstrations synthesized from scratch is limited by the capabilities and knowledge of LLMs. To address this, inspired by transfer learning, we propose In-Context Transfer Learning (ICTL), which synthesizes target task demonstrations by transferring labeled demonstrations from similar source tasks. ICTL consists of two steps: source sampling and target transfer. First, we define an optimization objective, which minimizes transfer error to sample source demonstrations similar to the target task. Then, we employ LLMs to transfer the sampled source demonstrations to the target task, matching the definition and format of the target task. Experiments on Super-NI show that ICTL outperforms synthesis from scratch by 2.0% on average, demonstrating the effectiveness of our method.
上下文学习(ICL)是一种有效的方法,通过提供目标任务的演示来帮助大型语言模型(LLMs)适应各种任务。考虑到用标签演示的高成本,许多方法提议使用LLMs从头合成演示。然而,从头合成的演示的质量受到LLMs的能力和知识限制。为了解决这个问题,我们受到迁移学习的启发,提出了In-Context Transfer Learning(ICTL)。ICTL通过从类似源任务中转移带有标签的演示来合成目标任务的演示。ICTL包括两个步骤:源抽样和目标传递。首先,我们定义一个优化目标,即从目标任务中最小化传输误差,以抽样与目标任务相似的源演示。然后,我们使用LLMs将抽样的源演示传递到目标任务,并匹配目标任务的定义和格式。在Super-NI上的实验表明,ICTL平均比从头合成方式优秀2.0%,证明了我们的方法的有效性。
https://arxiv.org/abs/2410.01548
Transporting between arbitrary distributions is a fundamental goal in generative modeling. Recently proposed diffusion bridge models provide a potential solution, but they rely on a joint distribution that is difficult to obtain in practice. Furthermore, formulations based on continuous domains limit their applicability to discrete domains such as graphs. To overcome these limitations, we propose Discrete Diffusion Schrödinger Bridge Matching (DDSBM), a novel framework that utilizes continuous-time Markov chains to solve the SB problem in a high-dimensional discrete state space. Our approach extends Iterative Markovian Fitting to discrete domains, and we have proved its convergence to the SB. Furthermore, we adapt our framework for the graph transformation and show that our design choice of underlying dynamics characterized by independent modifications of nodes and edges can be interpreted as the entropy-regularized version of optimal transport with a cost function described by the graph edit distance. To demonstrate the effectiveness of our framework, we have applied DDSBM to molecular optimization in the field of chemistry. Experimental results demonstrate that DDSBM effectively optimizes molecules' property-of-interest with minimal graph transformation, successfully retaining other features.
在生成模型中,在任意分布之间传输是一个基本目标。最近提出的扩散桥模型提供了一个潜在的解决方案,但它们依赖于在实际中难以获得的联合分布。此外,基于连续领域的公式对诸如图这样的离散领域缺乏应用。为了克服这些限制,我们提出了Discrete Diffusion Schrödinger Bridge Matching (DDSBM),一种利用连续时间随机过程解决高维离散状态空间中SB问题的全新框架。我们的方法将迭代随机拟合扩展到离散领域,并且已经证明了其收敛到SB。此外,我们还对图变换进行了调整,并证明了我们对节点和边的基本动态修改可以解释为具有描述性增益函数的等效传输,该函数由图编辑距离描述。为了证明我们框架的有效性,我们将其应用于化学领域的分子优化。实验结果表明,DDSBM有效地通过最小化图变换优化化合物的感兴趣性质,同时成功地保留其他特征。
https://arxiv.org/abs/2410.01500
Zero-shot voice conversion (VC) aims to transform the source speaker timbre into an arbitrary unseen one without altering the original speech this http URL recent advancements in zero-shot VC methods have shown remarkable progress, there still remains considerable potential for improvement in terms of improving speaker similarity and speech this http URL this paper, we propose Takin-VC, a novel zero-shot VC framework based on jointly hybrid content and memory-augmented context-aware timbre modeling to tackle this challenge. Specifically, an effective hybrid content encoder, guided by neural codec training, that leverages quantized features from pre-trained WavLM and HybridFormer is first presented to extract the linguistic content of the source speech. Subsequently, we introduce an advanced cross-attention-based context-aware timbre modeling approach that learns the fine-grained, semantically associated target timbre features. To further enhance both speaker similarity and real-time performance, we utilize a conditional flow matching model to reconstruct the Mel-spectrogram of the source speech. Additionally, we advocate an efficient memory-augmented module designed to generate high-quality conditional target inputs for the flow matching process, thereby improving the overall performance of the proposed system. Experimental results demonstrate that the proposed Takin-VC method surpasses state-of-the-art zero-shot VC systems, delivering superior performance in terms of both speech naturalness and speaker similarity.
零 shot语音转换(VC)旨在将源讲话者的音色变换为任意未见过的音色,而不会改变原始语音。近期在零 shot VC 方法上的最新进展表明,该技术在提高说话者相似度和语音质量方面取得了显著的进步。然而,在提高说话者相似度和语音质量方面,仍然有很大的改进空间。本文提出了一种名为 Takin-VC 的新颖零 shot VC 框架,该框架基于联合训练内容的感知和记忆增强的音色建模,以解决这一挑战。 具体来说,我们首先介绍了一种有效的混合内容编码器,通过神经码解码训练,利用预训练的 WavLM 和 HybridFormer 的量化特征提取源语音的语义内容。接着,我们引入了一种高级 cross-attention 引导的上下文感知音色建模方法,学习细粒度、语义相关的目标音色特征。为了进一步提高说话者相似度和实时性能,我们利用条件流匹配模型重构源语音的 Mel-频谱图。此外,我们倡导了一种高效的记忆增强模块,旨在生成高质量的条件目标输入流匹配过程,从而提高所提出系统的整体性能。 实验结果表明,所提出的 Takin-VC 方法超越了最先进的零 shot VC 系统,在语音自然性和说话者相似度方面都取得了卓越的性能。
https://arxiv.org/abs/2410.01350
As speech becomes an increasingly common modality for interacting with large language models (LLMs), it is becoming desirable to develop systems where LLMs can take into account users' emotions or speaking styles when providing their responses. In this work, we study the potential of an LLM to understand these aspects of speech without fine-tuning its weights. To do this, we utilize an end-to-end system with a speech encoder; the encoder is trained to produce token embeddings such that the LLM's response to an expressive speech prompt is aligned with its response to a semantically matching text prompt where the speaker's emotion has also been specified. We find that this training framework allows the encoder to generate tokens that capture both semantic and paralinguistic information in speech and effectively convey it to the LLM, even when the LLM remains completely frozen. We also explore training on additional emotion and style-related response alignment tasks, finding that they further increase the amount of paralinguistic information explicitly captured in the speech tokens. Experiments demonstrate that our system is able to produce higher quality and more empathetic responses to expressive speech prompts compared to several baselines.
随着自然语言处理技术的大型语言模型(LLM)越来越成为与大型语言模型互动的常见方式,开发系统使LLM在提供回答时考虑到用户的情感或说话方式变得越来越具有吸引力。在这项工作中,我们研究了LLM在不需要微调其权重的情况下理解这些情感的潜力。为此,我们使用了一个端到端的系统,该系统包括一个语音编码器;编码器被训练成生成具有词汇嵌入,使得LLM对表达性讲话提示的回答与对语义匹配的文本提示的回答相符。我们发现,这种训练框架使编码器能够生成捕捉到语音中语义和语调信息的词汇,并能有效地将它们传递给LLM,即使LLM完全冻住了。我们还研究了在情感和风格相关响应对齐任务上的训练,发现它们进一步增加了在语音词汇中明确捕捉的语调信息量。实验证明,我们的系统能够比几个基线产生更高质量、更富有情感的响应,这些响应是通过对表达性讲话提示的回答来实现的。
https://arxiv.org/abs/2410.01162
Endoscopy is a crucial tool for diagnosing the gastrointestinal tract, but its effectiveness is often limited by a narrow field of view and the dynamic nature of the internal environment, especially in the esophagus, where complex and repetitive patterns make image stitching challenging. This paper introduces a novel automatic image unfolding and stitching framework tailored for esophageal videos captured during endoscopy. The method combines feature matching algorithms, including LoFTR, SIFT, and ORB, to create a feature filtering pool and employs a Density-Weighted Homography Optimization (DWHO) algorithm to enhance stitching accuracy. By merging consecutive frames, the framework generates a detailed panoramic view of the esophagus, enabling thorough and accurate visual analysis. Experimental results show the framework achieves low Root Mean Square Error (RMSE) and high Structural Similarity Index (SSIM) across extensive video sequences, demonstrating its potential for clinical use and improving the quality and continuity of endoscopic visual data.
内窥镜是诊断胃肠道疾病的关键工具,但其效果往往受到视野狭窄和内部环境动态性质的限制,尤其是在食管,其中复杂和重复的模式使图像拼接具有挑战性。本文介绍了一种针对内窥镜捕捉到的食管视频的新型自动图像展开和拼接框架。该方法结合了包括LoFTR、SIFT和ORB等特征匹配算法,创建了一个特征过滤池,并采用Density-Weighted Homography Optimization(DWHO)算法来增强拼接准确性。通过合并连续帧,该框架生成了详细的食管全景图,从而实现了对食管的全面和准确视觉分析。实验结果表明,该框架在广泛的视频序列中实现了较低的根均方误差(RMSE)和高结构相似性指数(SSIM),证明了其临床应用的潜力和提高内窥镜视觉数据质量和连续性。
https://arxiv.org/abs/2410.01148
Few-shot medical image segmentation (FSMIS) aims to perform the limited annotated data learning in the medical image analysis scope. Despite the progress has been achieved, current FSMIS models are all trained and deployed on the same data domain, as is not consistent with the clinical reality that medical imaging data is always across different data domains (e.g. imaging modalities, institutions and equipment sequences). How to enhance the FSMIS models to generalize well across the different specific medical imaging domains? In this paper, we focus on the matching mechanism of the few-shot semantic segmentation models and introduce an Earth Mover's Distance (EMD) calculation based domain robust matching mechanism for the cross-domain scenario. Specifically, we formulate the EMD transportation process between the foreground support-query features, the texture structure aware weights generation method, which proposes to perform the sobel based image gradient calculation over the nodes, is introduced in the EMD matching flow to restrain the domain relevant nodes. Besides, the point set level distance measurement metric is introduced to calculated the cost for the transportation from support set nodes to query set nodes. To evaluate the performance of our model, we conduct experiments on three scenarios (i.e., cross-modal, cross-sequence and cross-institution), which includes eight medical datasets and involves three body regions, and the results demonstrate that our model achieves the SoTA performance against the compared models.
少样本医疗图像分割(FSMIS)旨在在医学图像分析范围内执行有限标注数据的训练和部署。尽管已经取得了一定的进展,但目前的FSMIS模型都在相同的数据域上进行训练和部署,这并不符合临床实践中医疗影像数据总是在不同数据域之间(如成像方式、机构和设备序列)的特点。如何增强FSMIS模型在跨域情况下的泛化能力?在本文中,我们关注少样本语义分割模型的匹配机制,并引入了跨域情况下基于域的鲁棒匹配机制。具体来说,我们在EMD匹配流中引入了基于节点的Sobel图像梯度计算,以约束域相关节点。此外,我们还引入了点集级别距离测量指标,以计算从支持集节点到查询集节点的传输成本。为了评估我们模型的性能,我们在三个场景(即跨模态、跨序列和跨机构)上进行了实验(包括八个医疗数据集和三个身体区域),结果表明,我们的模型与比较模型相比实现了SoTA性能。
https://arxiv.org/abs/2410.01110
The reasoning steps generated by LLMs might be incomplete, as they mimic logical leaps common in everyday communication found in their pre-training data: underlying rationales are frequently left implicit (unstated). To address this challenge, we introduce RATIONALYST, a model for process-supervision of reasoning based on pre-training on a vast collection of rationale annotations extracted from unlabeled data. We extract 79k rationales from web-scale unlabelled dataset (the Pile) and a combination of reasoning datasets with minimal human intervention. This web-scale pre-training for reasoning allows RATIONALYST to consistently generalize across diverse reasoning tasks, including mathematical, commonsense, scientific, and logical reasoning. Fine-tuned from LLaMa-3-8B, RATIONALYST improves the accuracy of reasoning by an average of 3.9% on 7 representative reasoning benchmarks. It also demonstrates superior performance compared to significantly larger verifiers like GPT-4 and similarly sized models fine-tuned on matching training sets.
推理模型的生成推理步骤可能是不完整的,因为它们模仿了他们在预训练数据中常见的逻辑跳跃:隐含的合理原因经常被暗示(未说明)。为解决这个问题,我们引入了RATIONALYST,一个基于预训练的推理过程监督模型,基于从未标记数据中提取的庞大理由注释的推理。我们从大规模无标签数据(称为Pile)中提取了79k个合理性,并使用最小的人类干预的推理数据集的组合。这种规模的预训练推理允许RATIONALYST在各种推理任务上保持一致的泛化能力,包括数学、常识、科学和逻辑推理。从LLaMa-3-8B开始进行微调后,RATIONALYST通过平均每个代表推理基准的准确性提高3.9%来改善推理精度。它还证明了与在相同训练集上进行微调的明显更大的验证器GPT-4和类似大小的模型相比,具有卓越的性能。
https://arxiv.org/abs/2410.01044
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
对比语言图像预训练(CLIP)已经在许多应用中得到了广泛研究和应用。然而,在预训练过程中强调简短的总结文本,使得CLIP无法理解长描述。对于视频来说,这个问题尤为突出,因为视频通常包含丰富的详细内容。在本文中,我们提出了VideoCLIP-XL(扩展长度)模型,旨在释放视频CLIP模型的长描述理解能力。首先,我们建立了一个自动数据收集系统,收集了大规模的VILD预训练数据集,包括Video和Long-Description对。然后,我们提出了文本相似性引导的主成分匹配(TPCM)来更好地学习特征空间的同时扩展长描述能力。我们还引入了两个新任务,即 detail-aware 描述排名(DDR)和 Hallucination-aware 描述排名(HDR),以进一步改进理解能力。最后,我们构建了一个 Long Video Description Ranking(LVDR)基准,用于更全面地评估长描述能力。与短描述和长描述的广泛使用的文本视频检索基准以及我们的LVDR基准的实验结果可以充分证明我们方法的的有效性。
https://arxiv.org/abs/2410.00741
Soundscape appropriateness (SA) provides supplemental information on the matching degree between auditory information and the surrounding scene in soundscape perception. This indicator has been integrated into the standard ISO process for collecting soundscape data, forming a component of the sound quality assessment questionnaire. However, its role in soundscape quality assessment has not been fully understood. Herein, we present the findings from soundscape data collected from Beiling Park in Shenyang, China. A method was developed that integrates mediation effect models with multiscale geographically weighted regression (MGWR) models to explore the mediating role of SA in the impact of sound source types on soundscape quality, as well as the spatial heterogeneity of this mediation effect. The results confirm that SA does mediates the influence of sound source types on acoustics comfort (AC). Specifically, natural sounds (indirect effect / total effect = 0.19 / 0.19), traffic sounds (indirect effect / total effect = -0.46 / -0.65), and commercial sounds (indirect effect / total effect = -0.25 / -0.12) impact the perception of AC by either enhancing or reducing SA. Moreover, the relationships among variables depicted in this model demonstrate spatial heterogeneity, demonstrating that in urban open spaces with complex constructures, local spatial models may be needed for soundscape assessment. The research reaffirms the significance of SA in urban open spaces. In terms of practical implications for urban and landscape planners, when sound sources cannot be controlled or altered, coordinating between the sound and the surrounding environment through landscape optimisation could also improve the quality of the soundscape through enhancing SA and help achieve the goal of creating healthy urban open spaces.
Soundscape appropriateness (SA) 提供了关于听觉信息与周围场景之间匹配程度的补充信息,这是在收集声景数据的标准ISO过程中整合到声景质量评估问卷中的一个组成部分。然而,在声景质量评估中,它对声音来源类型对声景质量的影响并没有得到完全的理解。在这里,我们报道了从沈阳北陵公园收集的声景数据的发现。我们开发了一种方法,将中介效应模型与多尺度地理加权回归(MGWR)模型相结合,探讨了 SA 在影响声音来源类型对声景质量的影响以及这种中介效应的地理异质性。结果证实了 SA 在影响声学舒适度(AC)方面确实起到了中介作用。具体来说,自然声音(直接影响/总影响=0.19/0.19),交通声音(直接影响/总影响=-0.46/-0.65)和商业声音(直接影响/总影响=-0.25/-0.12)通过增强或减少 SA 影响了人们对 AC 的感知。此外,这个模型中变量的关系表明了空间异质性,这表明在具有复杂结构的都市开放空间中,当地的空间模型可能也需要用于声景评估。这项研究确认了 SA 在都市开放空间中的重要性。从城市和景观规划的实际意义来看,当无法控制或改变声音源时,通过景观优化来协调声音和周围环境也可以提高声景质量,帮助实现创建健康都市开放空间的目標。
https://arxiv.org/abs/2410.00667
Explainability in artificial intelligence is crucial for restoring trust, particularly in areas like face forgery detection, where viewers often struggle to distinguish between real and fabricated content. Vision and Large Language Models (VLLM) bridge computer vision and natural language, offering numerous applications driven by strong common-sense reasoning. Despite their success in various tasks, the potential of vision and language remains underexplored in face forgery detection, where they hold promise for enhancing explainability by leveraging the intrinsic reasoning capabilities of language to analyse fine-grained manipulation areas. As such, there is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task to systematically and fairly evaluate these capabilities. Previous efforts for unified benchmarks in deepfake detection have focused on the simpler binary task, overlooking evaluation protocols for fine-grained detection and text-generative models. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap. In the first stage, we assess the models' performance on the binary task and their sensitivity to given instructions using several prompts. In the second stage, we delve deeper into fine-grained detection by identifying areas of manipulation in a multiple-choice VQA setting. In the third stage, we convert the fine-grained detection to an open-ended question and compare several matching strategies for the multi-label classification task. Finally, we qualitatively evaluate the fine-grained responses of the VLLMs included in the benchmark. We apply our benchmark to several popular models, providing a detailed comparison of binary, multiple-choice, and open-ended VQA evaluation across seven datasets. \url{this https URL}
在人工智能中,可解释性(Explainability)对恢复信任至关重要,尤其是在人脸伪造检测等领域,观众往往很难区分真实和伪造内容。视觉和大语言模型(VLLM)将计算机视觉和自然语言连接起来,提供了大量基于强共情推理的应用程序。尽管在各种任务上取得成功,但在人脸伪造检测中,视觉和语言模型的潜力仍然没有被充分利用。因此,有必要制定一种将人脸伪造检测转化为视觉问答(VQA)任务的评估方法,以系统地且公平地评估这些能力。之前在深度伪造检测中的统一基准关注的是简单的二进制任务,而忽略了细粒度检测和文本生成模型的评估协议。我们提出了一个多阶段方法,从传统的二进制决策范式中脱离出来,以解决这一空白。在第一阶段,我们评估模型在二进制任务上的表现和对给定指令的敏感度。在第二阶段,我们更深入地研究了细粒度检测,通过在多选题VQA环境中确定操作区域来识别 manipulation。在第三阶段,我们将细粒度检测转换为开放性问题,并比较了多个分类任务的匹配策略。最后,我们定性评估了基准中包含的VLLM的细粒度响应。我们将基准应用到多个流行模型上,为七种数据集之间的人脸伪造检测、二进制、多选题和开放型VQA评估提供了详细的比较。
https://arxiv.org/abs/2410.00485
Understanding local risks from extreme rainfall, such as flooding, requires both long records (to sample rare events) and high-resolution products (to assess localized hazards). Unfortunately, there is a dearth of long-record and high-resolution products that can be used to understand local risk and precipitation science. In this paper, we present a novel generative diffusion model that downscales (super-resolves) globally available Climate Prediction Center (CPC) gauge-based precipitation products and ERA5 reanalysis data to generate kilometer-scale precipitation estimates. Downscaling gauge-based precipitation from 55 km to 1 km while recovering extreme rainfall signals poses significant challenges. To enforce our model (named WassDiff) to produce well-calibrated precipitation intensity values, we introduce a Wasserstein Distance Regularization (WDR) term for the score-matching training objective in the diffusion denoising process. We show that WDR greatly enhances the model's ability to capture extreme values compared to diffusion without WDR. Extensive evaluation shows that WassDiff has better reconstruction accuracy and bias scores than conventional score-based diffusion models. Case studies of extreme weather phenomena, like tropical storms and cold fronts, demonstrate WassDiff's ability to produce appropriate spatial patterns while capturing extremes. Such downscaling capability enables the generation of extensive km-scale precipitation datasets from existing historical global gauge records and current gauge measurements in areas without high-resolution radar.
理解极端降雨所带来的地方风险(如洪水)需要长时间的观测记录(以采样罕见事件)和高分辨率产品(以评估局部危险)。然而,目前可用的长时间的观测和高分辨率产品较少,用于了解地方风险和降水科学。在本文中,我们提出了一个新颖的生成扩散模型,该模型将全球可用的 Climate Prediction Center(CPC)气象产品和高分辨率ERA5再分析数据下渗(超分辨率)到千米级别的降水估计。将55千米到1千米的气温降雨从下渗到恢复极端降雨信号具有重大挑战。为了使我们的模型(名为WassDiff)产生准确的降水强度值,我们在扩散去噪过程中引入了Wasserstein距离 regularization(WDR)项。我们证明了WDR极大地增强了模型在扩散没有WDR时的捕捉极端值的能力。广泛的评估表明,WassDiff具有比传统评分基于扩散模型的更高的重建精度和偏差评分。极端天气现象(如台风和寒潮)案例研究证明了WassDiff在捕捉极端值的同时产生适当的空间模式的能力。这种下渗能力使得从现有历史全球气象观测记录和当前气象测量数据的地方生成广泛千米级别的降水数据集成为可能。
https://arxiv.org/abs/2410.00381