This paper discusses the results of the third edition of the Monocular Depth Estimation Challenge (MDEC). The challenge focuses on zero-shot generalization to the challenging SYNS-Patches dataset, featuring complex scenes in natural and indoor settings. As with the previous edition, methods can use any form of supervision, i.e. supervised or self-supervised. The challenge received a total of 19 submissions outperforming the baseline on the test set: 10 among them submitted a report describing their approach, highlighting a diffused use of foundational models such as Depth Anything at the core of their method. The challenge winners drastically improved 3D F-Score performance, from 17.51% to 23.72%.
本文讨论了第三版Monocular Depth Estimation Challenge(MDEC)的结果。该挑战关注于将零散样本到具有挑战性的SYNCSH-Patches数据集,场景位于自然和室内环境中。与前几版一样,方法可以使用任何形式的监督,即监督或自监督。挑战在测试集上总共获得了19个提交,超过了基线:其中10个提交了一份报告,描述了他们的方法,并突出了基础模型如Depth Anything在方法核心中发现的扩散使用情况。挑战获胜者大幅提高了3D F- Score性能,从17.51%到23.72%。
https://arxiv.org/abs/2404.16831
Our objective is to discover and localize monotonic temporal changes in a sequence of images. To achieve this, we exploit a simple proxy task of ordering a shuffled image sequence, with `time' serving as a supervisory signal since only changes that are monotonic with time can give rise to the correct ordering. We also introduce a flexible transformer-based model for general-purpose ordering of image sequences of arbitrary length with built-in attribution maps. After training, the model successfully discovers and localizes monotonic changes while ignoring cyclic and stochastic ones. We demonstrate applications of the model in multiple video settings covering different scene and object types, discovering both object-level and environmental changes in unseen sequences. We also demonstrate that the attention-based attribution maps function as effective prompts for segmenting the changing regions, and that the learned representations can be used for downstream applications. Finally, we show that the model achieves the state of the art on standard benchmarks for ordering a set of images.
我们的目标是发现和局部化图像序列中的单调时间变化。为了实现这一目标,我们利用了一个简单的代理任务,即对随机图像序列进行排序,其中`time'作为监督信号,因为只有与时间相关的单调变化才能得到正确的排序。我们还引入了一个灵活的Transformer-based模型,用于对任意长度的图像序列进行通用排序,并内置归一化映射。在训练之后,该模型在成功发现和局部化单调变化的同时,忽略了循环和随机变化。我们在多个视频设置中展示了该模型的应用,涵盖了不同的场景和对象类型,发现了未见过的序列中的物体级和环境变化。我们还证明了基于注意的归一化映射可以作为分割变化区域的有效提示,并且学到的表示可以用于下游应用。最后,我们证明了该模型在为给定一组图像排序的基准测试中达到了最先进的水平。
https://arxiv.org/abs/2404.16828
With the advent of virtual reality technology, omnidirectional image (ODI) rescaling techniques are increasingly embraced for reducing transmitted and stored file sizes while preserving high image quality. Despite this progress, current ODI rescaling methods predominantly focus on enhancing the quality of images in equirectangular projection (ERP) format, which overlooks the fact that the content viewed on head mounted displays (HMDs) is actually a rendered viewport instead of an ERP image. In this work, we emphasize that focusing solely on ERP quality results in inferior viewport visual experiences for users. Thus, we propose ResVR, which is the first comprehensive framework for the joint Rescaling and Viewport Rendering of ODIs. ResVR allows obtaining LR ERP images for transmission while rendering high-quality viewports for users to watch on HMDs. In our ResVR, a novel discrete pixel sampling strategy is developed to tackle the complex mapping between the viewport and ERP, enabling end-to-end training of ResVR pipeline. Furthermore, a spherical pixel shape representation technique is innovatively derived from spherical differentiation to significantly improve the visual quality of rendered viewports. Extensive experiments demonstrate that our ResVR outperforms existing methods in viewport rendering tasks across different fields of view, resolutions, and view directions while keeping a low transmission overhead.
随着虚拟现实技术的的出现,全方向图像(ODI)缩放技术逐渐受到欢迎,用于减小传输和存储文件的大小,同时保留高图像质量。尽管如此,目前ODI缩放方法主要集中在增强等角投影(ERP)格式下图像的质量,而忽略了用户在头戴显示器(HMD)上看到的实际内容是一个渲染视图而不是ERP图像。在本文中,我们强调,仅关注ERP质量会导致用户获得劣质视图视觉体验。因此,我们提出了ResVR,这是第一个全面框架,旨在实现ODI的联合缩放和视图渲染。ResVR允许在传输过程中获得高光晕(LR)ERP图像,同时为用户在HMD上观看高质量视图。在我们的ResVR中,我们开发了一种新颖的离散像素采样策略,以解决视图和ERP之间的复杂映射,实现ResVR管道的端到端训练。此外,我们还创新地从球形差分中得到球形像素形状表示技术,显著提高了渲染视图的质量。大量实验证明,我们的ResVR在不同的视角、分辨率和对角线范围内,相较于现有方法在视图渲染任务中具有优异的性能,同时保持较低的传输开销。
https://arxiv.org/abs/2404.16825
AI-generated video has revolutionized short video production, filmmaking, and personalized media, making video local editing an essential tool. However, this progress also blurs the line between reality and fiction, posing challenges in multimedia forensics. To solve this urgent issue, V2A-Mark is proposed to address the limitations of current video tampering forensics, such as poor generalizability, singular function, and single modality focus. Combining the fragility of video-into-video steganography with deep robust watermarking, our method can embed invisible visual-audio localization watermarks and copyright watermarks into the original video frames and audio, enabling precise manipulation localization and copyright protection. We also design a temporal alignment and fusion module and degradation prompt learning to enhance the localization accuracy and decoding robustness. Meanwhile, we introduce a sample-level audio localization method and a cross-modal copyright extraction mechanism to couple the information of audio and video frames. The effectiveness of V2A-Mark has been verified on a visual-audio tampering dataset, emphasizing its superiority in localization precision and copyright accuracy, crucial for the sustainable development of video editing in the AIGC video era.
AI生成的视频已经推动了短视频制作、电影制作和个性化媒体的发展,使视频本地编辑成为必不可少的工具。然而,这一进步也模糊了现实与虚构之间的界线,对多媒体forensics造成了挑战。为解决这一紧迫问题,V2A-Mark提出了通过解决当前视频篡改forensics的局限性来 addressing the limitations of current video tampering forensics,例如缺乏一般性、单一功能和单模态关注。将视频转录为视频的隐式视觉-音频本地化水印和版权水印相结合,我们的方法可以将隐形的视觉-音频本地化水印和版权水印嵌入原始视频帧和音频中,实现精确的本地处理和版权保护。我们还设计了一个时间对齐和融合模块以及退化提示学习来提高定位准确性和解码稳健性。同时,我们引入了音频级联定位方法和跨模态版权提取机制,将音频和视频帧的信息耦合在一起。V2A-Mark的有效性已在视觉-音频篡改数据集上得到验证,强调了其在本地定位精度和版权准确性方面的优越性,这对AIGC视频时代的视频编辑的可持续发展至关重要。
https://arxiv.org/abs/2404.16824
In this report, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements: (1) Strong Vision Encoder: we explored a continuous learning strategy for the large-scale vision foundation model -- InternViT-6B, boosting its visual understanding capabilities, and making it can be transferred and reused in different LLMs. (2) Dynamic High-Resolution: we divide images into tiles ranging from 1 to 40 of 448$\times$448 pixels according to the aspect ratio and resolution of the input images, which supports up to 4K resolution input. (3) High-Quality Bilingual Dataset: we carefully collected a high-quality bilingual dataset that covers common scenes, document images, and annotated them with English and Chinese question-answer pairs, significantly enhancing performance in OCR- and Chinese-related tasks. We evaluate InternVL 1.5 through a series of benchmarks and comparative studies. Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks. Code has been released at this https URL.
在这份报告中,我们介绍了InternVL 1.5,一个开源的多模态大型语言模型(MLLM),以弥合开源和商业模型在多模态理解能力方面的差距。我们介绍了三个简单的改进:(1)强视图编码器:我们对 large-scale vision foundation model -- InternViT-6B 进行连续学习,提高了其视觉理解能力,并使其可以迁移和重用于不同的LLM。 (2)动态高分辨率:我们根据输入图像的透视率和分辨率将图像划分为从1到40个448x448像素的方块,支持最高4K分辨率输入。 (3)高质量双语数据集:我们仔细收集了一个高质量的双语数据集,涵盖了常见的场景、文档图像,并使用英语和中文问题与答案对它们进行了标注,显著提高了 OCR- 和与中文相关的任务的表现。我们通过一系列基准测试和比较研究评估了InternVL 1.5。与开源和商业模型相比,InternVL 1.5显示出具有竞争力的性能,在8个基准测试中实现了最先进的结果。代码已发布在https://这个网址。
https://arxiv.org/abs/2404.16821
While text-to-image (T2I) generative models have become ubiquitous, they do not necessarily generate images that align with a given prompt. While previous work has evaluated T2I alignment by proposing metrics, benchmarks, and templates for collecting human judgements, the quality of these components is not systematically measured. Human-rated prompt sets are generally small and the reliability of the ratings -- and thereby the prompt set used to compare models -- is not evaluated. We address this gap by performing an extensive study evaluating auto-eval metrics and human templates. We provide three main contributions: (1) We introduce a comprehensive skills-based benchmark that can discriminate models across different human templates. This skills-based benchmark categorises prompts into sub-skills, allowing a practitioner to pinpoint not only which skills are challenging, but at what level of complexity a skill becomes challenging. (2) We gather human ratings across four templates and four T2I models for a total of >100K annotations. This allows us to understand where differences arise due to inherent ambiguity in the prompt and where they arise due to differences in metric and model quality. (3) Finally, we introduce a new QA-based auto-eval metric that is better correlated with human ratings than existing metrics for our new dataset, across different human templates, and on TIFA160.
尽管文本到图像(T2I)生成模型已经变得无处不在,但它们并不一定生成与给定提示相符的图像。之前的工作已经通过提出指标、基准和模板来评估T2I的准确性,但这些组件的质量和系统的评估并未进行系统性的测量。人类评分集通常较小,而且用于比较模型的提示集的可靠性并未进行评估。为了填补这个空白,我们通过评估自监督指标和人类模板来进行了广泛的研究。我们提供了三个主要贡献:(1)我们引入了一个全面技能为基础的基准,可以区分不同的人类模板中的模型。这个技能基准将提示分为子技能,使得实践者不仅可以确定哪些技能具有挑战性,而且还可以确定技能变得具有挑战性的程度。(2)我们收集了四个人类模板和四个T2I模型的所有人类评分,共计超过10万条注释。这使我们能够了解由于提示固有的歧义而产生的差异,以及由于指标和模型质量的差异而产生的差异。(3)最后,我们引入了一种新的基于问答的自监督指标,该指标比我们新数据集中的现有指标与人类评分之间的相关性更高。这种指标在不同的人类模板和TIFA160上都有所表现。
https://arxiv.org/abs/2404.16820
Unsupervised semantic segmentation aims to automatically partition images into semantically meaningful regions by identifying global categories within an image corpus without any form of annotation. Building upon recent advances in self-supervised representation learning, we focus on how to leverage these large pre-trained models for the downstream task of unsupervised segmentation. We present PriMaPs - Principal Mask Proposals - decomposing images into semantically meaningful masks based on their feature representation. This allows us to realize unsupervised semantic segmentation by fitting class prototypes to PriMaPs with a stochastic expectation-maximization algorithm, PriMaPs-EM. Despite its conceptual simplicity, PriMaPs-EM leads to competitive results across various pre-trained backbone models, including DINO and DINOv2, and across datasets, such as Cityscapes, COCO-Stuff, and Potsdam-3. Importantly, PriMaPs-EM is able to boost results when applied orthogonally to current state-of-the-art unsupervised semantic segmentation pipelines.
无监督语义分割旨在通过在图像集合中识别全局类别,自动将图像划分为语义上有意义的区域。无监督语义分割是基于自监督表示学习最近取得的进展,我们关注如何利用这些大型的预训练模型来实现下游任务的未监督分割。我们提出了PrimeMaPs - 主要掩码建议,通过基于它们的特征表示分解图像为语义上有意义的掩码。这使我们能够通过随机期望-最大化算法将类原型拟合到PrimeMaPs-EM,实现无监督语义分割。尽管其概念上很简单,但PrimeMaPs-EM在各种预训练骨干模型(包括DINO和DINOv2)和各种数据集(如Cityscapes、COCO-Stuff和Potsdam-3)上都取得了竞争力的结果。重要的是,当应用与当前最先进的无监督语义分割管道成角度时,PrimeMaPs-EM能够提高结果。
https://arxiv.org/abs/2404.16818
Addressing the challenges of rare diseases is difficult, especially with the limited number of reference images and a small patient population. This is more evident in rare skin diseases, where we encounter long-tailed data distributions that make it difficult to develop unbiased and broadly effective models. The diverse ways in which image datasets are gathered and their distinct purposes also add to these challenges. Our study conducts a detailed examination of the benefits and drawbacks of episodic and conventional training methodologies, adopting a few-shot learning approach alongside transfer learning. We evaluated our models using the ISIC2018, Derm7pt, and SD-198 datasets. With minimal labeled examples, our models showed substantial information gains and better performance compared to previously trained models. Our research emphasizes the improved ability to represent features in DenseNet121 and MobileNetV2 models, achieved by using pre-trained models on ImageNet to increase similarities within classes. Moreover, our experiments, ranging from 2-way to 5-way classifications with up to 10 examples, showed a growing success rate for traditional transfer learning methods as the number of examples increased. The addition of data augmentation techniques significantly improved our transfer learning based model performance, leading to higher performances than existing methods, especially in the SD-198 and ISIC2018 datasets. All source code related to this work will be made publicly available soon at the provided URL.
解决罕见疾病面临的挑战是困难的,尤其是在参考图像数量有限且患者人口规模较小的情况下。这在罕见皮肤疾病中更加明显,因为我们会遇到具有长尾数据分布的疾病,这使得开发无偏差且具有广泛效果的模型变得困难。图像数据集的收集方式和它们的独特目的也增加了这些挑战。我们的研究详细探讨了周期性训练方法和传统训练方法的优缺点,并采用少量样本学习方法与迁移学习相结合。我们使用ISIC2018、Derm7pt和SD-198数据集来评估我们的模型。由于样本标注数量很少,我们的模型在性能上与之前训练的模型相比取得了很大的信息和特征增益。我们的研究重点是改善DenseNet121和MobileNetV2模型的特征表示能力,通过在ImageNet上预训练模型来增加类内相似度。此外,我们的实验,从2-way到5-way分类,有 up to 10 个样本,表明随着样本数量的增加,传统迁移学习方法的转移学习效果逐渐提高。数据增强技术极大地提高了基于模型的迁移学习性能,特别是在SD-198和ISIC2018数据集上,使得现有方法的性能更优。所有与本研究相关的源代码都将很快在提供的URL上公开发布。
https://arxiv.org/abs/2404.16814
Comprehending text-rich visual content is paramount for the practical application of Multimodal Large Language Models (MLLMs), since text-rich scenarios are ubiquitous in the real world, which are characterized by the presence of extensive texts embedded within images. Recently, the advent of MLLMs with impressive versatility has raised the bar for what we can expect from MLLMs. However, their proficiency in text-rich scenarios has yet to be comprehensively and objectively assessed, since current MLLM benchmarks primarily focus on evaluating general visual comprehension. In this work, we introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating \textbf{text-rich visual comprehension} of MLLMs. Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs, each of which covers a wide spectrum of text-rich scenarios in the real world. These categories, due to their inherent complexity and diversity, effectively simulate real-world text-rich environments. We further conduct a thorough evaluation involving 34 prominent MLLMs (including GPT-4V, Gemini-Pro-Vision and Claude-3-Opus) and emphasize the current limitations of MLLMs in text-rich visual comprehension. We hope that our work can serve as a valuable addition to existing MLLM benchmarks, providing insightful observations and inspiring further research in the area of text-rich visual comprehension with MLLMs. The dataset and evaluation code can be accessed at this https URL.
理解丰富文本的视觉内容对于多模态大型语言模型的实际应用至关重要,因为这种场景在现实生活中随处可见,特点是图像中嵌入大量文本。近年来,具有令人印象深刻的多样性的MLLM的出现提高了我们对MLLM的期望,然而,对于这些MLLM在丰富文本场景中的表现,我们还没有进行全面的、客观的评估,因为目前的MLLM基准主要关注评估通用视觉理解。在本文中,我们介绍了SEED-Bench-2-Plus,一个专门为评估MLLM的丰富文本视觉理解而设计的基准。我们的基准包括2300多个多选题问题,带有精确的人类注释,涵盖了三个广泛的类别:图表、地图和网站,每个类别涵盖了现实世界中的广泛文本丰富场景。由于它们的固有复杂性和多样性,这些类别有效地模拟了现实世界的文本丰富环境。我们进一步对34个著名的MLLM(包括GPT-4V、Gemini-Pro-Vision和Claude-3-Opus)进行了深入评估,并强调了MLLM在丰富文本视觉理解方面的当前局限性。我们希望我们的工作能为现有的MLLM基准提供宝贵的补充,提供有关丰富文本视觉理解与MLLM的进一步研究,以及有益的观察。数据和评估代码可以在此链接访问:https://url.in/
https://arxiv.org/abs/2404.16790
In human neuroimaging studies, atlas registration enables mapping MRI scans to a common coordinate frame, which is necessary to aggregate data from multiple subjects. Machine learning registration methods have achieved excellent speed and accuracy but lack interpretability. More recently, keypoint-based methods have been proposed to tackle this issue, but their accuracy is still subpar, particularly when fitting nonlinear transforms. Here we propose Registration by Regression (RbR), a novel atlas registration framework that is highly robust and flexible, conceptually simple, and can be trained with cheaply obtained data. RbR predicts the (x,y,z) atlas coordinates for every voxel of the input scan (i.e., every voxel is a keypoint), and then uses closed-form expressions to quickly fit transforms using a wide array of possible deformation models, including affine and nonlinear (e.g., Bspline, Demons, invertible diffeomorphic models, etc.). Robustness is provided by the large number of voxels informing the registration and can be further increased by robust estimators like RANSAC. Experiments on independent public datasets show that RbR yields more accurate registration than competing keypoint approaches, while providing full control of the deformation model.
在人类神经影像研究中,空间映射允许将MRI扫描映射到共同的坐标框架中,这对于从多个受试者中汇总数据是必要的。机器学习配准方法取得了良好的速度和精度,但缺乏可解释性。最近,基于关键点的配准方法提出了来解决这个 issue,但它们的准确性仍然较低,特别是在拟合非线性变换时。因此,我们提出了一个名为注册 by 回归 (RbR) 的新的配准框架,它具有很高的稳健性和灵活性,从低廉的数据中进行训练,并且具有直观简单的概念。RbR预测输入扫描中每个体素(即每个体素是一个关键点)的 (x,y,z) 配准坐标,然后使用一系列可能的变化模型(包括线性变换、非线性变换等)来使用闭式公式快速拟合变换。通过大量的体素的信息进行注册,可以进一步增加稳健性,而RANSAC等 robust estimator 可以使这种效果得到改善。在独立公共数据集上的实验表明,RbR 产生的配准比竞争性的关键点方法更准确,同时提供了对变形模型的完全控制。
https://arxiv.org/abs/2404.16781
Self-supervised contrastive learning has emerged as one of the most successful deep learning paradigms. In this regard, it has seen extensive use in image registration and, more recently, in the particular field of medical image registration. In this work, we propose to test and extend and improve a state-of-the-art framework for color fundus image registration, ConKeD. Using the ConKeD framework we test multiple loss functions, adapting them to the framework and the application domain. Furthermore, we evaluate our models using the standarized benchmark dataset FIRE as well as several datasets that have never been used before for color fundus registration, for which we are releasing the pairing data as well as a standardized evaluation approach. Our work demonstrates state-of-the-art performance across all datasets and metrics demonstrating several advantages over current SOTA color fundus registration methods
自监督对比学习已经成为最成功的深度学习范式之一。在这方面,它在图像配准和更近期的医学图像配准领域看到了广泛的应用。在这项工作中,我们提出了一个用于测试和改进最先进的颜色 fundus 图像配准框架ConKeD的框架。使用ConKeD框架我们测试了多个损失函数,并将其适应框架和应用领域。此外,我们还使用标准化基准数据集FIRE以及之前没有用于颜色 fundus 图像配准的数据集来评估我们的模型。我们的工作在所有数据集和指标上都展示了当前最佳性能,并比当前最佳方法具有几个优势。
https://arxiv.org/abs/2404.16773
While originally developed for continuous control problems, Proximal Policy Optimization (PPO) has emerged as the work-horse of a variety of reinforcement learning (RL) applications including the fine-tuning of generative models. Unfortunately, PPO requires multiple heuristics to enable stable convergence (e.g. value networks, clipping) and is notorious for its sensitivity to the precise implementation of these components. In response, we take a step back and ask what a minimalist RL algorithm for the era of generative models would look like. We propose REBEL, an algorithm that cleanly reduces the problem of policy optimization to regressing the relative rewards via a direct policy parameterization between two completions to a prompt, enabling strikingly lightweight implementation. In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL, which allows us to match the strongest known theoretical guarantees in terms of convergence and sample complexity in the RL literature. REBEL can also cleanly incorporate offline data and handle the intransitive preferences we frequently see in practice. Empirically, we find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO, all while being simpler to implement and more computationally tractable than PPO.
最初是为连续控制问题而开发的,但Proximal Policy Optimization (PPO)现在已成为各种强化学习(RL)应用(包括对生成模型的微调)的摇钱树。不幸的是,PPO需要多个技巧来实现稳定的收敛(例如值网络,截断) ,并以其对这些组件的具体实现非常敏感而闻名。为了应对这个问题,我们回退一步并问:在生成模型时代,一个简约的RL算法会是什么样子?我们提出了REBEL,一种通过直接对两个完成之间的策略参数化来降低策略优化问题的算法。在理论方面,我们证明了诸如自然策略梯度等基本RL算法可以被视为REBEL的变体,这使我们能够匹配RL文献中关于收敛和样本复杂性的最强已知理论保证。REBEL还可以干净地整合离线数据,并处理我们经常遇到的实际问题中的自偏好。在实证研究中,我们发现REBEL在语言建模和图像生成方面的性能与PPO和DPO相当或更好,同时比PPO更简单地实现,并且具有更快的计算可处理性。
https://arxiv.org/abs/2404.16767
Developing generalist foundation model has recently attracted tremendous attention among researchers in the field of AI for Medicine (AI4Medicine). A pivotal insight in developing these models is their reliance on dataset scaling, which emphasizes the requirements on developing open-source medical image datasets that incorporate diverse supervision signals across various imaging modalities. In this paper, we introduce RadGenome-Chest CT, a comprehensive, large-scale, region-guided 3D chest CT interpretation dataset based on CT-RATE. Specifically, we leverage the latest powerful universal segmentation and large language models, to extend the original datasets (over 25,692 non-contrast 3D chest CT volume and reports from 20,000 patients) from the following aspects: (i) organ-level segmentation masks covering 197 categories, which provide intermediate reasoning visual clues for interpretation; (ii) 665 K multi-granularity grounded reports, where each sentence of the report is linked to the corresponding anatomical region of CT volume in the form of a segmentation mask; (iii) 1.3 M grounded VQA pairs, where questions and answers are all linked with reference segmentation masks, enabling models to associate visual evidence with textual explanations. All grounded reports and VQA pairs in the validation set have gone through manual verification to ensure dataset quality. We believe that RadGenome-Chest CT can significantly advance the development of multimodal medical foundation models, by training to generate texts based on given segmentation regions, which is unattainable with previous relevant datasets. We will release all segmentation masks, grounded reports, and VQA pairs to facilitate further research and development in this field.
在人工智能领域(AI4Medicine)的研究者中,开发通用基础模型最近引起了巨大的关注。这些模型的关键在于它们对数据集扩大的依赖,强调开发包含各种成像模式下不同监督信号的开放医疗图像数据集。在本文中,我们介绍了RadGenome-Chest CT,一个基于CT-RATE的全面、大规模、区域指导的3D chest CT解释数据集。具体来说,我们利用最先进的强大通用分割和大型语言模型,从以下方面扩展了原始数据集:(一)覆盖197个类别的器官级别分割掩码,为解释提供中间推理的视觉提示;(二)665K个多粒度 grounded 报告,其中每个报告的句子都与相应的 CT 体积的解剖区域通过分割掩码链接;(三)1.3M个 grounded VQA 对,其中问题及其答案都与参考分割掩码链接,使模型能够将视觉证据与文本解释相关联。所有验证集中的 grounded 报告和 VQA 对都经过手动验证,以确保数据集质量。我们相信,RadGenome-Chest CT 可以通过根据给定分割区域生成文本,从而显著推动多模态医疗基础模型的开发,这是之前相关数据集无法实现的。我们将释放所有分割掩码、 grounded 报告和 VQA 对,以促进该领域进一步的研究和发展。
https://arxiv.org/abs/2404.16754
We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at this https URL.
我们关注从一张图片上回归3D人体姿势和形状的问题,重点关注3D准确性。目前最佳方法利用大量的3D伪地面真(p-GT)和2D关键点数据集,导致稳健的性能。然而,随着2D准确性的增加,3D姿势准确性的下降是一个悖论。这是由于p-GT和近似相机投影模型的偏差导致的。我们计算了当前相机模型引起的误差,并表明,精确地匹配2D关键点和p-GT确实会导致错误的3D姿势。我们的分析定义了在最小化2D和p-GT损失时会导致无效距离的区间。我们使用这个方法来定义一个新的损失函数:Threshold-Adaptive Loss Scaling(TALS)。这个损失函数惩罚 gross 2D和p-GT损失,但不惩罚更小的损失。有了这样的损失,有很多3D姿势都可以解释2D证据。为了减少这种歧义,我们需要在有效的人体姿势上建立一个先验,但这样的先验可能会引入不必要的偏差。为了解决这个问题,我们利用人体姿势的标记化表示来重新定义问题,并将其转化为标记预测问题。这限制了估计姿势在有效姿势的空间内,有效地提供了均匀的先验。在EMDB和3DPW数据集上进行的大量实验证明,我们重新定义的关键点损失和标记化使我们能够在野外数据上进行训练,同时提高3D准确性超过现有水平。我们的模型和代码可在https://这个链接上进行研究。
https://arxiv.org/abs/2404.16752
This paper addresses the task of 3D clothed human generation from textural descriptions. Previous works usually encode the human body and clothes as a holistic model and generate the whole model in a single-stage optimization, which makes them struggle for clothing editing and meanwhile lose fine-grained control over the whole generation process. To solve this, we propose a layer-wise clothed human representation combined with a progressive optimization strategy, which produces clothing-disentangled 3D human models while providing control capacity for the generation process. The basic idea is progressively generating a minimal-clothed human body and layer-wise clothes. During clothing generation, a novel stratified compositional rendering method is proposed to fuse multi-layer human models, and a new loss function is utilized to help decouple the clothing model from the human body. The proposed method achieves high-quality disentanglement, which thereby provides an effective way for 3D garment generation. Extensive experiments demonstrate that our approach achieves state-of-the-art 3D clothed human generation while also supporting cloth editing applications such as virtual try-on. Project page: this http URL
本文讨论了从文本描述中生成3D带衣服的人的任务。以前的工作通常将人体和衣服编码为一个整体模型,并在一个阶段优化中生成整个模型,这使得他们在衣物编辑方面挣扎,同时失去了对整个生成过程的细粒度控制。为了解决这个问题,我们提出了一个逐层的带衣服的人表示与渐进优化策略相结合的方法,从而在生成过程中实现衣物分离的3D人体模型,并提供了对生成过程的控制能力。基本思路是逐步生成最小带衣服的人体和逐层生成衣服。在服装生成过程中,我们提出了一种新的分层组合渲染方法来融合多层人体模型,并使用新的损失函数帮助解耦服装模型与人体。所提出的方法实现了高质量的分离,从而为3D服装生成提供了一种有效的方法。大量的实验证明,我们的方法在实现最先进的3D带衣服的人生成的同时,还支持虚拟试穿等衣物编辑应用。项目页面:http:// this http URL
https://arxiv.org/abs/2404.16748
Cancelable Biometric is a challenging research field in which security of an original biometric image is ensured by transforming the original biometric into another irreversible domain. Several approaches have been suggested in literature for generating cancelable biometric templates. In this paper, two novel and simple cancelable biometric template generation methods based on Random Walk (CBRW) have been proposed. By employing random walk and other steps given in the proposed two algorithms viz. CBRW-BitXOR and CBRW-BitCMP, the original biometric is transformed into a cancellable template. The performance of the proposed methods is compared with other state-of-the-art methods. Experiments have been performed on eight publicly available gray and color datasets i.e. CP (ear) (gray and color), UTIRIS (iris) (gray and color), ORL (face) (gray), IIT Delhi (iris) (gray and color), and AR (face) (color). Performance of the generated templates is measured in terms of Correlation Coefficient (Cr), Root Mean Square Error (RMSE), Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM), Mean Absolute Error (MAE), Number of Pixel Change Rate (NPCR), and Unified Average Changing Intensity (UACI). By experimental results, it has been proved that proposed methods are superior than other state-of-the-art methods in qualitative as well as quantitative analysis. Furthermore, CBRW performs better on both gray as well as color images.
可取消生物特征是一个具有挑战性的研究领域,其中通过将原始生物特征变换为另一个不可逆的领域来确保原始生物特征的安全。在文献中,已经提出了几种生成可取消生物特征模板的方法。本文提出了两种基于随机漫步(CBRW)的新颖且简单的可取消生物特征模板生成方法。通过使用提出的两种算法 viz. CBRW-BitXOR 和 CBRW-BitCMP,将原始生物特征变换为可取消模板。本文方法的表现与最先进的算法进行了比较。实验在八个公开可用的灰度和彩色数据集上进行,即CP(耳朵)(灰度与彩色),UTIRIS(眼睛)(灰度与彩色),ORL(面部)(灰度),IIT德里亚(眼睛)(灰度与彩色),和AR(眼睛)(彩色)。生成模板的表现用相关系数系数(Cr)、算术均方误差(RMSE)、峰值信号与噪声比(PSNR)、结构相似性(SSIM)、平均绝对误差(MAE)、像素变化率(NPCR)和统一平均变化强度(UACI)进行衡量。通过实验结果,证明了本文方法在质量和数量分析方面优于其他最先进的算法。此外,CBRW在灰度和彩色图像上表现更好。
https://arxiv.org/abs/2404.16739
We propose a novel multi-stage trans-dimensional architecture for multi-view cardiac image segmentation. Our method exploits the relationship between long-axis (2D) and short-axis (3D) magnetic resonance (MR) images to perform a sequential 3D-to-2D-to-3D segmentation, segmenting the long-axis and short-axis images. In the first stage, 3D segmentation is performed using the short-axis image, and the prediction is transformed to the long-axis view and used as a segmentation prior in the next stage. In the second step, the heart region is localized and cropped around the segmentation prior using a Heart Localization and Cropping (HLC) module, focusing the subsequent model on the heart region of the image, where a 2D segmentation is performed. Similarly, we transform the long-axis prediction to the short-axis view, localize and crop the heart region and again perform a 3D segmentation to refine the initial short-axis segmentation. We evaluate our proposed method on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) dataset, where our method outperforms state-of-the-art methods in segmenting cardiac regions of interest in both short-axis and long-axis images. The pre-trained models, source code, and implementation details will be publicly available.
我们提出了一个新颖的多阶段多视角心肌图像分割架构。我们的方法利用长轴(2D)和短轴(3D)磁共振(MR)图像之间的关系进行级联3D-to-2D-to-3D分割,分割长轴和短轴图像。在第一阶段,使用短轴图像进行3D分割,并将预测转换为长轴视图,用作下一阶段的分割先决条件。在第二阶段,使用心定位和裁剪(HLC)模块将心区域定位和裁剪在分割先决条件周围,将后续模型聚焦于图像中的心区域,并进行2D分割。同样,我们将长轴预测转换为短轴视图,将心区域定位和裁剪,并再次进行3D分割,以优化初始的短轴分割。我们在M&M-2数据集上评估我们的方法,该数据集包括多病种、多视角和多中心右心室分割。我们的方法在短轴和长轴图像中分割感兴趣的心脏区域方面均优于最先进的Methods。预训练模型、源代码和实现细节将公开可用。
https://arxiv.org/abs/2404.16708
This paper reports on the NTIRE 2024 Quality Assessment of AI-Generated Content Challenge, which will be held in conjunction with the New Trends in Image Restoration and Enhancement Workshop (NTIRE) at CVPR 2024. This challenge is to address a major challenge in the field of image and video processing, namely, Image Quality Assessment (IQA) and Video Quality Assessment (VQA) for AI-Generated Content (AIGC). The challenge is divided into the image track and the video track. The image track uses the AIGIQA-20K, which contains 20,000 AI-Generated Images (AIGIs) generated by 15 popular generative models. The image track has a total of 318 registered participants. A total of 1,646 submissions are received in the development phase, and 221 submissions are received in the test phase. Finally, 16 participating teams submitted their models and fact sheets. The video track uses the T2VQA-DB, which contains 10,000 AI-Generated Videos (AIGVs) generated by 9 popular Text-to-Video (T2V) models. A total of 196 participants have registered in the video track. A total of 991 submissions are received in the development phase, and 185 submissions are received in the test phase. Finally, 12 participating teams submitted their models and fact sheets. Some methods have achieved better results than baseline methods, and the winning methods in both tracks have demonstrated superior prediction performance on AIGC.
这篇论文报告了NTIRE 2024人工智能生成内容挑战赛,该挑战赛将与CVPR 2024中的图像修复和增强研讨会(NTIRE)同时举办。这项挑战的目标是解决图像和视频处理领域的一个重大挑战,即人工智能生成内容(AIGC)的图像质量和视频质量评估(VQA)。挑战分为图像赛道和视频赛道。图像赛道使用了AIGIQA-20K,它包含了由15个流行生成模型生成的20,000个AI生成图像(AIGIs)。图像赛道共有318名注册参与者。在开发阶段共收到1,646篇提交,测试阶段收到了221篇提交。最后,16支参赛队伍提交了他们的模型和报告。视频赛道使用了T2VQA-DB,它包含了由9个流行文本转视频(T2V)模型生成的10,000个AI生成视频(AIGVs)。共有196名参与者登记注册在视频赛道上。在开发阶段共收到991篇提交,测试阶段收到了185篇提交。最后,12支参赛队伍提交了他们的模型和报告。有些方法取得了比基线方法更好的效果,两赛道获胜的方法在AIGC上表现出卓越的预测性能。
https://arxiv.org/abs/2404.16687
Colorizing grayscale images offers an engaging visual experience. Existing automatic colorization methods often fail to generate satisfactory results due to incorrect semantic colors and unsaturated colors. In this work, we propose an automatic colorization pipeline to overcome these challenges. We leverage the extraordinary generative ability of the diffusion prior to synthesize color with plausible semantics. To overcome the artifacts introduced by the diffusion prior, we apply the luminance conditional guidance. Moreover, we adopt multimodal high-level semantic priors to help the model understand the image content and deliver saturated colors. Besides, a luminance-aware decoder is designed to restore details and enhance overall visual quality. The proposed pipeline synthesizes saturated colors while maintaining plausible semantics. Experiments indicate that our proposed method considers both diversity and fidelity, surpassing previous methods in terms of perceptual realism and gain most human preference.
给灰度图像上色提供了一个有趣的视觉体验。然而,现有的自动上色方法由于错误的语义颜色和不饱和的颜色而往往无法产生令人满意的结果。在这项工作中,我们提出了一个自动上色管道来克服这些挑战。我们利用扩散前的无与伦比的生成能力来合成具有合理语义的颜色。为了克服扩散前的伪影,我们应用了亮度条件指导。此外,我们还采用了多模态高级语义 prior 来帮助模型理解图像内容,并呈现饱和的颜色。此外,还设计了一个亮度感知解码器,以恢复细节并提高整体视觉质量。我们提出的管道在保持合理语义的同时合成饱和的颜色。实验结果表明,与以前的方法相比,我们的方法考虑了多样性和准确性,在感知真实感和人偏好方面超过了以前的方法。
https://arxiv.org/abs/2404.16678
While neural implicit representations have gained popularity in multi-view 3D reconstruction, previous work struggles to yield physically plausible results, thereby limiting their applications in physics-demanding domains like embodied AI and robotics. The lack of plausibility originates from both the absence of physics modeling in the existing pipeline and their inability to recover intricate geometrical structures. In this paper, we introduce PhyRecon, which stands as the first approach to harness both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Our framework proposes a novel differentiable particle-based physical simulator seamlessly integrated with the neural implicit representation. At its core is an efficient transformation between SDF-based implicit representation and explicit surface points by our proposed algorithm, Surface Points Marching Cubes (SP-MC), enabling differentiable learning with both rendering and physical losses. Moreover, we model both rendering and physical uncertainty to identify and compensate for the inconsistent and inaccurate monocular geometric priors. The physical uncertainty additionally enables a physics-guided pixel sampling to enhance the learning of slender structures. By amalgamating these techniques, our model facilitates efficient joint modeling with appearance, geometry, and physics. Extensive experiments demonstrate that PhyRecon significantly outperforms all state-of-the-art methods in terms of reconstruction quality. Our reconstruction results also yield superior physical stability, verified by Isaac Gym, with at least a 40% improvement across all datasets, opening broader avenues for future physics-based applications.
虽然多视角3D重建中神经隐式表示已经获得了越来越多的关注,但之前的 work 很难产生物理上合理的成果,从而限制了它们在需要物理要求的领域(如 embodied AI 和机器人学)的应用。缺乏可信度源于现有流程中缺少物理建模以及它们无法恢复复杂的几何结构。在本文中,我们引入了 PhyRecon,这是第一个利用可导渲染和可导物理仿真来学习隐式表面表示的方法。我们的框架将新颖的可导粒子基于物理仿真与神经隐式表示无缝集成。其核心是基于我们提出的表面点前进立方(SP-MC)算法在 SDF 基于隐式表示和显式表面点之间进行有效的转换,实现基于渲染和物理损失的可导学习。此外,我们还建模了渲染和物理不确定性以识别和弥补不一致和不准确的单目几何先验。物理不确定性还允许我们进行基于物理的像素采样,以增强对细长结构的学习。通过将这些技术相结合,我们的模型实现了与外观、几何和物理的效率共生建模。大量实验证明,PhyRecon 在重建质量方面显著超过了所有现有方法。我们的重建结果还证明了伊萨·格雷戈尔(Isaac Gym)验证的卓越物理稳定性,在所有数据集上实现了至少 40% 的改进,为未来的基于物理的应用于开辟了更广泛的道路。
https://arxiv.org/abs/2404.16666