This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
本文讨论了第三版Monocular Depth Estimation Challenge(MDEC)的结果。该挑战关注于将零散样本到具有挑战性的SYNCSH-Patches数据集,场景位于自然和室内环境中。与前几版一样,方法可以使用任何形式的监督,即监督或自监督。挑战在测试集上总共获得了19个提交,超过了基线:其中10个提交了一份报告,描述了他们的方法,并突出了基础模型如Depth Anything在方法核心中发现的扩散使用情况。挑战获胜者大幅提高了3D F- Score性能,从17.51%到23.72%。
https://arxiv.org/abs/2404.16831
Physically realistic materials are pivotal in augmenting the realism of 3D assets across various applications and lighting conditions. However, existing 3D assets and generative models often lack authentic material properties. Manual assignment of materials using graphic software is a tedious and time-consuming task. In this paper, we exploit advancements in Multimodal Large Language Models (MLLMs), particularly GPT-4V, to present a novel approach, Make-it-Real: 1) We demonstrate that GPT-4V can effectively recognize and describe materials, allowing the construction of a detailed material library. 2) Utilizing a combination of visual cues and hierarchical text prompts, GPT-4V precisely identifies and aligns materials with the corresponding components of 3D objects. 3) The correctly matched materials are then meticulously applied as reference for the new SVBRDF material generation according to the original diffuse map, significantly enhancing their visual authenticity. Make-it-Real offers a streamlined integration into the 3D content creation workflow, showcasing its utility as an essential tool for developers of 3D assets.
物理真实感材料在增强3D资产的各种应用和光照条件下的真实感方面至关重要。然而,现有的3D资产和生成模型通常缺乏真实材料的属性。使用图形软件手动分配材料是一个费力且耗时的任务。在本文中,我们利用多模态大型语言模型(MMLMs)的进步,特别是GPT-4V,提出了一个新方法,名为Make-it-Real:1)我们证明了GPT-4V可以有效地识别和描述材料,使得构建详细材料库成为可能。2)利用视觉提示和分层文本提示,GPT-4V准确地识别和校准材料与3D物体相应部件的对应关系。3)然后,正确匹配的材料被用作根据原始漫射图生成新的SVBRDF材料的新参考,显著增强了它们的视觉真实感。Make-it-Real使3D内容创建工作流程更加流畅,展示了它在开发者3D资产方面作为关键工具的重要作用。
https://arxiv.org/abs/2404.16829
Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.
我们的目标是发现和局部化图像序列中的单调时间变化。为了实现这一目标,我们利用了一个简单的代理任务,即对随机图像序列进行排序,其中`time'作为监督信号,因为只有与时间相关的单调变化才能得到正确的排序。我们还引入了一个灵活的Transformer-based模型,用于对任意长度的图像序列进行通用排序,并内置归一化映射。在训练之后,该模型在成功发现和局部化单调变化的同时,忽略了循环和随机变化。我们在多个视频设置中展示了该模型的应用,涵盖了不同的场景和对象类型,发现了未见过的序列中的物体级和环境变化。我们还证明了基于注意的归一化映射可以作为分割变化区域的有效提示,并且学到的表示可以用于下游应用。最后,我们证明了该模型在为给定一组图像排序的基准测试中达到了最先进的水平。
https://arxiv.org/abs/2404.16828
With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced for reducing transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content viewed on head mounted displays (HMDs) is actually a rendered viewport instead of an ERP image. In this work, we emphasize that focusing solely on ERP quality results in inferior viewport visual experiences for users. Thus, we propose ResVR, which is the first comprehensive framework for the joint Rescaling and Viewport Rendering of ODIs. ResVR allows obtaining LR ERP images for transmission while rendering high-quality viewports for users to watch on HMDs. In our ResVR, a novel discrete pixel sampling strategy is developed to tackle the complex mapping between the viewport and ERP, enabling end-to-end training of ResVR pipeline. Furthermore, a spherical pixel shape representation technique is innovatively derived from spherical differentiation to significantly improve the visual quality of rendered viewports. Extensive experiments demonstrate that our ResVR outperforms existing methods in viewport rendering tasks across different fields of view, resolutions, and view directions while keeping a low transmission overhead.
随着虚拟现实技术的的出现,全方向图像(ODI)缩放技术逐渐受到欢迎,用于减小传输和存储文件的大小,同时保留高图像质量。尽管如此,目前ODI缩放方法主要集中在增强等角投影(ERP)格式下图像的质量,而忽略了用户在头戴显示器(HMD)上看到的实际内容是一个渲染视图而不是ERP图像。在本文中,我们强调,仅关注ERP质量会导致用户获得劣质视图视觉体验。因此,我们提出了ResVR,这是第一个全面框架,旨在实现ODI的联合缩放和视图渲染。ResVR允许在传输过程中获得高光晕(LR)ERP图像,同时为用户在HMD上观看高质量视图。在我们的ResVR中,我们开发了一种新颖的离散像素采样策略,以解决视图和ERP之间的复杂映射,实现ResVR管道的端到端训练。此外,我们还创新地从球形差分中得到球形像素形状表示技术,显著提高了渲染视图的质量。大量实验证明,我们的ResVR在不同的视角、分辨率和对角线范围内,相较于现有方法在视图渲染任务中具有优异的性能,同时保持较低的传输开销。
https://arxiv.org/abs/2404.16825
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
AI生成的视频已经推动了短视频制作、电影制作和个性化媒体的发展,使视频本地编辑成为必不可少的工具。然而,这一进步也模糊了现实与虚构之间的界线,对多媒体forensics造成了挑战。为解决这一紧迫问题,V2A-Mark提出了通过解决当前视频篡改forensics的局限性来 addressing the limitations of current video tampering forensics,例如缺乏一般性、单一功能和单模态关注。将视频转录为视频的隐式视觉-音频本地化水印和版权水印相结合,我们的方法可以将隐形的视觉-音频本地化水印和版权水印嵌入原始视频帧和音频中,实现精确的本地处理和版权保护。我们还设计了一个时间对齐和融合模块以及退化提示学习来提高定位准确性和解码稳健性。同时,我们引入了音频级联定位方法和跨模态版权提取机制,将音频和视频帧的信息耦合在一起。V2A-Mark的有效性已在视觉-音频篡改数据集上得到验证,强调了其在本地定位精度和版权准确性方面的优越性,这对AIGC视频时代的视频编辑的可持续发展至关重要。
https://arxiv.org/abs/2404.16824
Aiming to replicate human-like dexterity, perceptual experiences, and motion patterns, we explore learning from human demonstrations using a bimanual system with multifingered hands and visuotactile data. Two significant challenges exist: the lack of an affordable and accessible teleoperation system suitable for a dual-arm setup with multifingered hands, and the scarcity of multifingered hand hardware equipped with touch sensing. To tackle the first challenge, we develop HATO, a low-cost hands-arms teleoperation system that leverages off-the-shelf electronics, complemented with a software suite that enables efficient data collection; the comprehensive software suite also supports multimodal data processing, scalable policy learning, and smooth policy deployment. To tackle the latter challenge, we introduce a novel hardware adaptation by repurposing two prosthetic hands equipped with touch sensors for research. Using visuotactile data collected from our system, we learn skills to complete long-horizon, high-precision tasks which are difficult to achieve without multifingered dexterity and touch feedback. Furthermore, we empirically investigate the effects of dataset size, sensing modality, and visual input preprocessing on policy learning. Our results mark a promising step forward in bimanual multifingered manipulation from visuotactile data. Videos, code, and datasets can be found at this https URL .
为了复制人类灵巧,感知经验和运动模式,我们使用多手指和视觉触觉数据的人机交互系统来探索从人类演示中学习的可能性。有两个重要挑战:缺乏适合双臂设置的实惠且易用的人工操作系统,以及配备触摸传感器的多指硬件的稀缺性。为解决第一个挑战,我们开发了HATO,一种低成本的多指手-手臂遥控系统,利用了可编程电子元件,并配备了软件 suite以实现高效的数据收集;综合软件 suite 还支持多模态数据处理、可扩展策略学习和平滑策略部署。为解决第二个挑战,我们通过重新利用配备触摸传感器的两个假肢来引入了一种新颖的硬件适应。使用我们系统收集的视觉触觉数据,我们学会了完成长时间、高精度的任务,这是在没有多指灵巧和触摸反馈的情况下无法实现的。此外,我们通过实验研究了数据集大小、感知模式和视觉输入预处理对策略学习的影响。我们的结果表明,从视觉触觉数据中进行多指灵巧操纵在未来具有很大的前景。视频、代码和数据集可在此链接找到:https://www.youtube.com/watch?v=uRstRQZ0Q7g&list=PLMZ0jIH2Q-7z4&t=0s
https://arxiv.org/abs/2404.16823
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at this https URL.
在这份报告中,我们介绍了InternVL 1.5,一个开源的多模态大型语言模型(MLLM),以弥合开源和商业模型在多模态理解能力方面的差距。我们介绍了三个简单的改进:(1)强视图编码器:我们对 large-scale vision foundation model -- InternViT-6B 进行连续学习,提高了其视觉理解能力,并使其可以迁移和重用于不同的LLM。 (2)动态高分辨率:我们根据输入图像的透视率和分辨率将图像划分为从1到40个448x448像素的方块,支持最高4K分辨率输入。 (3)高质量双语数据集:我们仔细收集了一个高质量的双语数据集,涵盖了常见的场景、文档图像,并使用英语和中文问题与答案对它们进行了标注,显著提高了 OCR- 和与中文相关的任务的表现。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和商业模型相比,InternVL 1.5显示出具有竞争力的性能,在8个基准测试中实现了最先进的结果。代码已发布在https://这个网址。
https://arxiv.org/abs/2404.16821
While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.
尽管文本到图像(T2I)生成模型已经变得无处不在,但它们并不一定生成与给定提示相符的图像。之前的工作已经通过提出指标、基准和模板来评估T2I的准确性,但这些组件的质量和系统的评估并未进行系统性的测量。人类评分集通常较小,而且用于比较模型的提示集的可靠性并未进行评估。为了填补这个空白,我们通过评估自监督指标和人类模板来进行了广泛的研究。我们提供了三个主要贡献:(1)我们引入了一个全面技能为基础的基准,可以区分不同的人类模板中的模型。这个技能基准将提示分为子技能,使得实践者不仅可以确定哪些技能具有挑战性,而且还可以确定技能变得具有挑战性的程度。(2)我们收集了四个人类模板和四个T2I模型的所有人类评分,共计超过10万条注释。这使我们能够了解由于提示固有的歧义而产生的差异,以及由于指标和模型质量的差异而产生的差异。(3)最后,我们引入了一种新的基于问答的自监督指标,该指标比我们新数据集中的现有指标与人类评分之间的相关性更高。这种指标在不同的人类模板和TIFA160上都有所表现。
https://arxiv.org/abs/2404.16820
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
As large language models (LLMs) see increasing adoption across the globe, it is imperative for LLMs to be representative of the linguistic diversity of the world. India is a linguistically diverse country of 1.4 Billion people. To facilitate research on multilingual LLM evaluation, we release IndicGenBench - the largest benchmark for evaluating LLMs on user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families. IndicGenBench is composed of diverse generation tasks like cross-lingual summarization, machine translation, and cross-lingual question answering. IndicGenBench extends existing benchmarks to many Indic languages through human curation providing multi-way parallel evaluation data for many under-represented Indic languages for the first time. We evaluate a wide range of proprietary and open-source LLMs including GPT-3.5, GPT-4, PaLM-2, mT5, Gemma, BLOOM and LLaMA on IndicGenBench in a variety of settings. The largest PaLM-2 models performs the best on most tasks, however, there is a significant performance gap in all languages compared to English showing that further research is needed for the development of more inclusive multilingual language models. IndicGenBench is released at this http URL
随着大型语言模型(LLMs)在全球范围内的应用不断增加,LLMs代表世界语言多样性至关重要。印度是一个拥有14亿人口的多语言国家。为了促进对多语言LLM评估的研究,我们发布了IndicGenBench - 针对13个脚本和4个语言家庭的多语言用户生成任务评估的最大基准。IndicGenBench由跨语言摘要、机器翻译和跨语言问题回答等多样化的生成任务组成。通过人类审核,IndicGenBench为许多印度语言提供了多途径并行评估数据,为许多代表性不足的印度语言首次提供了全面评估。我们在IndicGenBench上评估了各种专有和开源LLM,包括GPT-3.5、GPT-4、PaLM-2、mT5、Gemma、BLOOM和LLLaMA。在IndicGenBench上,最大的PaLM-2模型在大多数任务上表现最佳,然而,与英语相比,所有语言之间的性能差距都很大,这表明需要进一步研究更包容的多语言语言模型的开发。IndicGenBench发布在以下URL:
https://arxiv.org/abs/2404.16816
Addressing the challenges of rare diseases is difficult, especially with the limited number of reference images and a small patient population. This is more evident in rare skin diseases, where we encounter long-tailed data distributions that make it difficult to develop unbiased and broadly effective models. The diverse ways in which image datasets are gathered and their distinct purposes also add to these challenges. Our study conducts a detailed examination of the benefits and drawbacks of episodic and conventional training methodologies, adopting a few-shot learning approach alongside transfer learning. We evaluated our models using the ISIC2018, Derm7pt, and SD-198 datasets. With minimal labeled examples, our models showed substantial information gains and better performance compared to previously trained models. Our research emphasizes the improved ability to represent features in DenseNet121 and MobileNetV2 models, achieved by using pre-trained models on ImageNet to increase similarities within classes. Moreover, our experiments, ranging from 2-way to 5-way classifications with up to 10 examples, showed a growing success rate for traditional transfer learning methods as the number of examples increased. The addition of data augmentation techniques significantly improved our transfer learning based model performance, leading to higher performances than existing methods, especially in the SD-198 and ISIC2018 datasets. All source code related to this work will be made publicly available soon at the provided URL.
解决罕见疾病面临的挑战是困难的,尤其是在参考图像数量有限且患者人口规模较小的情况下。这在罕见皮肤疾病中更加明显,因为我们会遇到具有长尾数据分布的疾病,这使得开发无偏差且具有广泛效果的模型变得困难。图像数据集的收集方式和它们的独特目的也增加了这些挑战。我们的研究详细探讨了周期性训练方法和传统训练方法的优缺点,并采用少量样本学习方法与迁移学习相结合。我们使用ISIC2018、Derm7pt和SD-198数据集来评估我们的模型。由于样本标注数量很少,我们的模型在性能上与之前训练的模型相比取得了很大的信息和特征增益。我们的研究重点是改善DenseNet121和MobileNetV2模型的特征表示能力,通过在ImageNet上预训练模型来增加类内相似度。此外,我们的实验,从2-way到5-way分类,有 up to 10 个样本,表明随着样本数量的增加,传统迁移学习方法的转移学习效果逐渐提高。数据增强技术极大地提高了基于模型的迁移学习性能,特别是在SD-198和ISIC2018数据集上,使得现有方法的性能更优。所有与本研究相关的源代码都将很快在提供的URL上公开发布。
https://arxiv.org/abs/2404.16814
While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: this https URL.
虽然许多当代大型语言模型(LLMs)可以处理长输入,但它们仍然很难在长上下文中完全利用信息,这被称为迷失在中间的挑战。我们假设这源于在长上下文训练期间缺乏明确的监督,这没有强调任何长上下文中的位置都可能持有关键信息。根据这个直觉,我们的研究提出了信息密集型(IN2)训练,这是一种完全数据驱动的解决方案来克服迷失在中间的挑战。具体来说,IN2训练利用合成长上下文问题-答案数据集,其中答案需要(1)在合成长上下文(4K-32K个词)中的短片段(~128个词)进行精细信息意识,以及(2)来自两个或更多短片段的信息整合和推理。通过在Mistral-7B上应用这一信息密集型训练,我们提出了FILM-7B(FILM-在中间)。为了全面评估FILM-7B在利用长上下文方面的能力,我们设计了一个涵盖各种上下文风格(文档、代码和结构化数据)和信息检索模式(前向、后向和双向检索)的三个探针任务。探针结果表明,FILM-7B可以稳健地从其32K个上下文窗口中的不同位置检索信息。除了这些探针任务之外,FILM-7B在现实世界的长上下文任务中的性能显著提高(例如,在NarrativeQA上的23.5->26.9 F1得分),同时它在短上下文任务中的性能与预相当(例如,在MMLU上的59.3->59.2准确率)。Github链接:https://github.com/。
https://arxiv.org/abs/2404.16811
Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge, while generating coherent sentences. Although the quality of the generated sentences is crucial, the diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts. Large Language Models (LLMs) have shown proficiency in enhancing the generation quality across various tasks through in-context learning (ICL) using given examples without the need for any fine-tuning. However, the diversity aspect in LLM outputs has not been systematically studied before. To address this, we propose a simple method that diversifies the LLM generations, while preserving their quality. Experimental results on three benchmark GCR datasets show that our method achieves an ideal balance between the quality and diversity. Moreover, the sentences generated by our proposed method can be used as training data to improve diversity in existing commonsense generators.
生成常识推理(GCR)需要一个模型使用常识知识来推理关于一种情况的句子,同时生成连贯的句子。尽管生成的句子的质量至关重要,但生成多样性同样重要,因为它反映了模型能够使用一系列常识知识事实的能力。大型语言模型(LLMs)通过在上下文中学来提高各种任务的生成质量,而不需要进行微调。然而,LLM输出的多样性方面之前还没有系统地研究过。为了解决这个问题,我们提出了一个简单的方法,它扩展了LLM的生成,同时保留了其质量。在三个基准GCR数据集上的实验结果表明,我们的方法实现了质量与多样性的理想平衡。此外,我们提出的方法生成的句子可以作为现有常识生成器的训练数据,以提高其多样性。
https://arxiv.org/abs/2404.16807
Recent advances in large pre-trained vision-language models have demonstrated remarkable performance on zero-shot downstream tasks. Building upon this, recent studies, such as CoOp and CoCoOp, have proposed the use of prompt learning, where context within a prompt is replaced with learnable vectors, leading to significant improvements over manually crafted prompts. However, the performance improvement for unseen classes is still marginal, and to tackle this problem, data augmentation has been frequently used in traditional zero-shot learning techniques. Through our experiments, we have identified important issues in CoOp and CoCoOp: the context learned through traditional image augmentation is biased toward seen classes, negatively impacting generalization to unseen classes. To address this problem, we propose adversarial token embedding to disentangle low-level visual augmentation features from high-level class information when inducing bias in learnable prompts. Through our novel mechanism called "Adding Attributes to Prompt Learning", AAPL, we guide the learnable context to effectively extract text features by focusing on high-level features for unseen classes. We have conducted experiments across 11 datasets, and overall, AAPL shows favorable performances compared to the existing methods in few-shot learning, zero-shot learning, cross-dataset, and domain generalization tasks.
近年来,大型预训练视觉语言模型在零散分布任务上的表现已经引人注目。在此基础上,一些研究,如CoOp和CoCoOp,提出了使用提示学习的方法,其中上下文在提示中替换为可学习向量,从而在手动设计的提示上取得了显著的改进。然而,对于未见过的类别的性能提升仍然很小,为了解决这个问题,传统零散学习技术中经常使用数据增强。通过我们的实验,我们发现了CoOp和CoCoOp中重要的问题:通过传统图像增强学习到的上下文存在偏见,不利于对未见过的类别的泛化。为了解决这个问题,我们提出了一个对抗性标记嵌入策略,当在提示中诱导偏见时,将低级视觉增强特征与高级分类信息分离。通过我们新颖的机制“在提示中添加属性”,AAPL,我们引导可学习上下文有效地提取未见过的类别的文本特征。我们在11个数据集上进行了实验,总体而言,AAPL在零散分布学习、少样本学习、跨数据集学习和领域泛化任务上的表现与现有方法相比具有优势。
https://arxiv.org/abs/2404.16804
Although the capabilities of large language models (LLMs) ideally scale up with increasing data and compute, they are inevitably constrained by limited resources in reality. Suppose we have a moderately trained LLM (e.g., trained to align with human preference) in hand, can we further exploit its potential and cheaply acquire a stronger model? In this paper, we propose a simple method called ExPO to boost LLMs' alignment with human preference. ExPO assumes that a medium-aligned model can be interpolated between a less-aligned (weaker) model, e.g., the initial SFT model, and a better-aligned (stronger) one, thereby directly obtaining this stronger model by extrapolating from the weights of the former two relatively weaker models. On the AlpacaEval 2.0 benchmark, we show that ExPO pushes models trained with less preference data (e.g., 10% or 20%) to reach and even surpass the fully-trained one, without any additional training. Furthermore, ExPO also significantly improves off-the-shelf DPO/RLHF models and exhibits decent scalability across model sizes from 7B to 70B. Our work demonstrates the efficacy of model extrapolation in exploiting LLMs' capabilities, suggesting a promising direction that deserves future exploration.
尽管大型语言模型(LLMs)在理想情况下能够随着数据和计算能力的增加而扩展其能力,但它们在现实中受到有限资源的限制。假设我们手中有一个适度训练的LLM(例如,训练以与人类偏好对齐),我们能否进一步发掘其潜力并以较低的成本获得更强的模型?在本文中,我们提出了一个简单的方法叫做ExPO,用于提高LLMs与人类偏好的对齐程度。ExPO假设一个中庸对齐的模型可以平滑地存在于一个较不满意的(较弱)模型和更好对齐的(较强)模型之间,从而通过从这两个较弱模型的权重中进行扩展直接获得这个更强的模型。在AlpacaEval 2.0基准上,我们证明了ExPO将偏好数据较少的模型(例如10%或20%)推向并甚至超过完全训练的模型,而没有任何额外的训练。此外,ExPO还显著地改善了标准DPO/RLHF模型,并在模型规模从7B到70B时表现出良好的可扩展性。我们的工作表明,模型扩展在利用LLM的能力方面具有有效性,为未来的探索提供了一个有前景的方向。
https://arxiv.org/abs/2404.16792
Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at this https URL.
理解丰富文本的视觉内容对于多模态大型语言模型的实际应用至关重要,因为这种场景在现实生活中随处可见,特点是图像中嵌入大量文本。近年来,具有令人印象深刻的多样性的MLLM的出现提高了我们对MLLM的期望,然而,对于这些MLLM在丰富文本场景中的表现,我们还没有进行全面的、客观的评估,因为目前的MLLM基准主要关注评估通用视觉理解。在本文中,我们介绍了SEED-Bench-2-Plus,一个专门为评估MLLM的丰富文本视觉理解而设计的基准。我们的基准包括2300多个多选题问题,带有精确的人类注释,涵盖了三个广泛的类别:图表、地图和网站,每个类别涵盖了现实世界中的广泛文本丰富场景。由于它们的固有复杂性和多样性,这些类别有效地模拟了现实世界的文本丰富环境。我们进一步对34个著名的MLLM(包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus)进行了深入评估,并强调了MLLM在丰富文本视觉理解方面的当前局限性。我们希望我们的工作能为现有的MLLM基准提供宝贵的补充,提供有关丰富文本视觉理解与MLLM的进一步研究,以及有益的观察。数据和评估代码可以在此链接访问:https://url.in/
https://arxiv.org/abs/2404.16790
The recent success of large language models (LLMs) trained on static, pre-collected, general datasets has sparked numerous research directions and applications. One such direction addresses the non-trivial challenge of integrating pre-trained LLMs into dynamic data distributions, task structures, and user preferences. Pre-trained LLMs, when tailored for specific needs, often experience significant performance degradation in previous knowledge domains -- a phenomenon known as "catastrophic forgetting". While extensively studied in the continual learning (CL) community, it presents new manifestations in the realm of LLMs. In this survey, we provide a comprehensive overview of the current research progress on LLMs within the context of CL. This survey is structured into four main sections: we first describe an overview of continually learning LLMs, consisting of two directions of continuity: vertical continuity (or vertical continual learning), i.e., continual adaptation from general to specific capabilities, and horizontal continuity (or horizontal continual learning), i.e., continual adaptation across time and domains (Section 3). We then summarize three stages of learning LLMs in the context of modern CL: Continual Pre-Training (CPT), Domain-Adaptive Pre-training (DAP), and Continual Fine-Tuning (CFT) (Section 4). Then we provide an overview of evaluation protocols for continual learning with LLMs, along with the current available data sources (Section 5). Finally, we discuss intriguing questions pertaining to continual learning for LLMs (Section 6). The full list of papers examined in this survey is available at this https URL.
近年来,基于静态、预先收集的通用数据集训练的大语言模型(LLMs)的成功引发了大量的研究方向和应用。其中一种方向解决了将预训练的LLM集成到动态数据分布、任务结构和用户偏好中的非平凡挑战。经过专门调整以满足特定需求后,预训练的LLM在先前知识领域中的表现常常会显著下降,这种现象被称为“灾难性遗忘”。尽管在持续学习(CL)领域得到了广泛研究,但LLMs在LLM领域中呈现出了新的表现形式。在本次调查中,我们全面概述了LLM在CL背景下的当前研究进展。本次调查分为四个主要部分:我们首先描述了持续学习LLMs的概述,包括两个方向:垂直连续(或垂直持续学习),即从通用到特定能力的持续适应,以及水平连续(或水平持续学习),即跨越时间和领域的持续适应(第3节)。接着我们总结了在现代CL背景下学习LLM的三个阶段:持续预训练(CPT)、领域自适应预训练(DAP)和持续微调(CFT)(第4节)。然后我们概述了使用LLMs进行持续学习的评估协议以及当前可用的数据源(第5节)。最后,我们讨论了与LLM的持续学习相关的一些有趣问题(第6节)。本次调查中审查的论文清单可以在这个https:// URL中找到。
https://arxiv.org/abs/2404.16789
In human neuroimaging studies, atlas registration enables mapping MRI scans to a common coordinate frame, which is necessary to aggregate data from multiple subjects. Machine learning registration methods have achieved excellent speed and accuracy but lack interpretability. More recently, keypoint-based methods have been proposed to tackle this issue, but their accuracy is still subpar, particularly when fitting nonlinear transforms. Here we propose Registration by Regression (RbR), a novel atlas registration framework that is highly robust and flexible, conceptually simple, and can be trained with cheaply obtained data. RbR predicts the (x,y,z) atlas coordinates for every voxel of the input scan (i.e., every voxel is a keypoint), and then uses closed-form expressions to quickly fit transforms using a wide array of possible deformation models, including affine and nonlinear (e.g., Bspline, Demons, invertible diffeomorphic models, etc.). Robustness is provided by the large number of voxels informing the registration and can be further increased by robust estimators like RANSAC. Experiments on independent public datasets show that RbR yields more accurate registration than competing keypoint approaches, while providing full control of the deformation model.
在人类神经影像研究中,空间映射允许将MRI扫描映射到共同的坐标框架中,这对于从多个受试者中汇总数据是必要的。机器学习配准方法取得了良好的速度和精度,但缺乏可解释性。最近,基于关键点的配准方法提出了来解决这个 issue,但它们的准确性仍然较低,特别是在拟合非线性变换时。因此,我们提出了一个名为注册 by 回归 (RbR) 的新的配准框架,它具有很高的稳健性和灵活性,从低廉的数据中进行训练,并且具有直观简单的概念。RbR预测输入扫描中每个体素(即每个体素是一个关键点)的 (x,y,z) 配准坐标,然后使用一系列可能的变化模型(包括线性变换、非线性变换等)来使用闭式公式快速拟合变换。通过大量的体素的信息进行注册,可以进一步增加稳健性,而RANSAC等 robust estimator 可以使这种效果得到改善。在独立公共数据集上的实验表明,RbR 产生的配准比竞争性的关键点方法更准确,同时提供了对变形模型的完全控制。
https://arxiv.org/abs/2404.16781
The success of many RL techniques heavily relies on human-engineered dense rewards, which typically demand substantial domain expertise and extensive trial and error. In our work, we propose DrS (Dense reward learning from Stages), a novel approach for learning reusable dense rewards for multi-stage tasks in a data-driven manner. By leveraging the stage structures of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations if given. The learned rewards can be \textit{reused} in unseen tasks, thus reducing the human effort for reward engineering. Extensive experiments on three physical robot manipulation task families with 1000+ task variants demonstrate that our learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms. The learned rewards even achieve comparable performance to human-engineered rewards on some tasks. See our project page (this https URL) for more details.
许多强化学习(RL)技术的成功很大程度上依赖于人类设计的密集奖励,通常需要深厚的领域专业知识以及广泛的尝试和误差。在我们的工作中,我们提出了DrS(从Stages进行密集奖励学习),一种学习可重用密集奖励以数据驱动方式处理多阶段任务的新颖方法。通过利用任务的阶段结构,DrS从稀疏奖励和演示中学习高质量密集奖励。所学习的奖励可以在未见过的任务中\textit{重用},从而减轻了奖励工程的人力成本。在三个物理机器人操作任务家族(具有1000+任务变体)的广泛实验中,我们的学习奖励在未见过的任务中可以\textit{重用},从而提高了RL算法的表现和样本效率。甚至有些任务的学习奖励甚至可以达到与人类设计的奖励相媲美的水平。更多详情,请查看我们的项目页面(此https URL)。
https://arxiv.org/abs/2404.16779
Representation-based Siamese networks have risen to popularity in lightweight text matching due to their low deployment and inference costs. While word-level attention mechanisms have been implemented within Siamese networks to improve performance, we propose Feature Attention (FA), a novel downstream block designed to enrich the modeling of dependencies among embedding features. Employing "squeeze-and-excitation" techniques, the FA block dynamically adjusts the emphasis on individual features, enabling the network to concentrate more on features that significantly contribute to the final classification. Building upon FA, we introduce a dynamic "selection" mechanism called Selective Feature Attention (SFA), which leverages a stacked BiGRU Inception structure. The SFA block facilitates multi-scale semantic extraction by traversing different stacked BiGRU layers, encouraging the network to selectively concentrate on semantic information and embedding features across varying levels of abstraction. Both the FA and SFA blocks offer a seamless integration capability with various Siamese networks, showcasing a plug-and-play characteristic. Experimental evaluations conducted across diverse text matching baselines and benchmarks underscore the indispensability of modeling feature attention and the superiority of the "selection" mechanism.
以表示为基础的孪生网络因低部署和推理成本而在轻量文本匹配中备受欢迎。尽管在孪生网络中已经实现了词级关注机制以提高性能,但我们提出了特征关注(FA)这一新型的下游块,旨在丰富嵌入特征之间的建模。通过采用“收缩和激发”技术,FA块动态地调整对单个特征的强调,使得网络能更关注对最终分类具有重要影响的特征。基于FA,我们引入了一种动态选择机制,称为选择性特征关注(SFA),并利用堆叠BiGRU Inception结构。SFA块通过穿越不同的堆叠BiGRU层,促使网络在不同的抽象层次上集中关注语义信息和嵌入特征。FA和SFA块都具有与各种孪生网络的无缝集成能力,展示了可插拔和 Play 的特点。在多样文本匹配基准和挑战中进行实验评估,结果表明建模特征关注和选择机制至关重要,而“选择”机制的优越性得到了充分证实。
https://arxiv.org/abs/2404.16776