Generating high fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case finetuning or usually missing the identity details in video generation process. In this study, we present ID-Animator, a zero-shot human-video generation approach that can perform personalized video generation given single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline, which incorporates decoupled human attribute and action captioning technique from a constructed facial image pool. Based on this pipeline, a random face reference training method is further devised to precisely capture the ID-relevant embeddings from reference images, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints will be released at this https URL.
生成指定身份的高保真度人类视频在内容生成社区中引起了广泛关注。然而,现有的技术在训练效率和身份保留之间往往难以取得平衡,或者需要耗时的案例逐个微调,或者在视频生成过程中通常会丢失身份细节。在本文中,我们提出了ID-Animator,一种零散拍摄人类视频的方法,可以根据单个参考面部图像生成个性化的视频,无需进一步训练。ID-Animator继承了现有的扩散基视频生成骨架,带有面部适配器来编码与ID相关的特征嵌入。为了在视频生成过程中促进身份信息的提取,我们引入了从构建面部图像池中分离的人体属性和动作标题技术,ID导向的数据构建管道。基于该管道,我们还设计了一种随机的面部参考训练方法,精确捕捉参考图像中的ID相关嵌入,从而提高我们的模型在ID特定视频生成方面的保真度和泛化能力。大量实验证明,ID-Animator在生成个性化人类视频方面优于 previous 模型。此外,我们的方法与如animatediff 和各种社区骨架模型等热门预训练 T2V 模型高度兼容,在现实世界中,对于需要高度保留身份的视频生成,我们的方法具有很高的可扩展性。我们的代码和检查点将发布在https:// this URL。
https://arxiv.org/abs/2404.15275
Recent advancements in machine learning have led to novel imaging systems and algorithms that address ill-posed problems. Assessing their trustworthiness and understanding how to deploy them safely at test time remains an important and open problem. We propose a method that leverages conformal prediction to retrieve upper/lower bounds and statistical inliers/outliers of reconstructions based on the prediction intervals of downstream metrics. We apply our method to sparse-view CT for downstream radiotherapy planning and show 1) that metric-guided bounds have valid coverage for downstream metrics while conventional pixel-wise bounds do not and 2) anatomical differences of upper/lower bounds between metric-guided and pixel-wise methods. Our work paves the way for more meaningful reconstruction bounds. Code available at this https URL
近年来机器学习的进步导致了处理欠拟合问题的新颖成像系统和算法。评估它们的可信度和在测试时安全部署它们仍然是一个重要且开放的问题。我们提出了一种利用同构预测来检索基于预测间隔的重建的上/下界以及统计异常/正常化的方法。我们将该方法应用于稀疏视野CT downstream放射治疗计划,并证明了:在下游指标的指导下,指标指导的边界具有有效的覆盖范围,而传统的像素级边界则没有;以及指标指导和像素级方法的upper/lower边界之间解剖学差异。我们的工作为更有意义的研究奠定了基础。代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.15274
Recent advancements in controllable human image generation have led to zero-shot generation using structural signals (e.g., pose, depth) or facial appearance. Yet, generating human images conditioned on multiple parts of human appearance remains challenging. Addressing this, we introduce Parts2Whole, a novel framework designed for generating customized portraits from multiple reference images, including pose images and various aspects of human appearance. To achieve this, we first develop a semantic-aware appearance encoder to retain details of different human parts, which processes each image based on its textual label to a series of multi-scale feature maps rather than one image token, preserving the image dimension. Second, our framework supports multi-image conditioned generation through a shared self-attention mechanism that operates across reference and target features during the diffusion process. We enhance the vanilla attention mechanism by incorporating mask information from the reference human images, allowing for the precise selection of any part. Extensive experiments demonstrate the superiority of our approach over existing alternatives, offering advanced capabilities for multi-part controllable human image customization. See our project page at this https URL.
近年来,可控制的人脸图像生成技术的发展已经导致了使用结构信号(例如姿态、深度)或人脸外观的零散生成。然而,根据多个人脸特征生成人脸图像仍然具有挑战性。为了解决这个问题,我们引入了Parts2Whole,一种专门为从多个参考图像中生成定制肖像的新颖框架,包括姿态图像和各种人脸特征。为了实现这一目标,我们首先开发了一个语义感知的人脸外观编码器,以保留不同人脸部分的细节,该编码器根据每个图像的文本标签处理每个图像,生成一系列多尺度特征图,而不是一个图像令牌。其次,我们的框架通过共享自注意机制支持多图像条件生成,该机制在扩散过程中对参考和目标特征进行操作。我们通过引入参考人类图像的掩码信息来增强普通关注机制,允许精确选择任何部分。大量实验证明,我们的方法在现有替代方案中具有优越性,为多部分可控制的人脸图像定制提供了先进的功能。请参阅我们的项目页面,链接在此:https://www.aclweb.org/anthology/N22-2122
https://arxiv.org/abs/2404.15267
Radiance fields have demonstrated impressive performance in synthesizing lifelike 3D talking heads. However, due to the difficulty in fitting steep appearance changes, the prevailing paradigm that presents facial motions by directly modifying point appearance may lead to distortions in dynamic regions. To tackle this challenge, we introduce TalkingGaussian, a deformation-based radiance fields framework for high-fidelity talking head synthesis. Leveraging the point-based Gaussian Splatting, facial motions can be represented in our method by applying smooth and continuous deformations to persistent Gaussian primitives, without requiring to learn the difficult appearance change like previous methods. Due to this simplification, precise facial motions can be synthesized while keeping a highly intact facial feature. Under such a deformation paradigm, we further identify a face-mouth motion inconsistency that would affect the learning of detailed speaking motions. To address this conflict, we decompose the model into two branches separately for the face and inside mouth areas, therefore simplifying the learning tasks to help reconstruct more accurate motion and structure of the mouth region. Extensive experiments demonstrate that our method renders high-quality lip-synchronized talking head videos, with better facial fidelity and higher efficiency compared with previous methods.
光晕场在合成逼真的人脸3D对话头方面表现出了出色的性能。然而,由于难以通过直接修改点外观来呈现面部运动,因此普遍采用通过直接修改点外观来呈现面部运动的范式可能导致动态区域的扭曲。为解决这个问题,我们引入了TalkingGaussian,一种基于变形的高保真度对话头合成光晕场框架。通过利用基于点的高斯平展,我们可以通过应用平滑和连续的变形来表示我们的方法中的面部运动,而无需像以前的方法一样学习困难的面部变化。由于这种简化,可以在保留高度完整的面部特征的同时合成精确的面部运动。在这样一种变形范式下,我们进一步识别出可能影响学习详细口语运动的方向。为了应对这个冲突,我们分别将模型分解为面部和嘴部两个分支,从而简化学习任务,有助于重建更准确的面部区域的运动和结构。大量实验证明,我们的方法生成的 lip-synchronized 对话头视频具有高质量、面部特征更加逼真,以及比以前方法更高的效率。
https://arxiv.org/abs/2404.15264
We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at this http URL
我们介绍了一种新的多会话SLAM系统,该系统在单个全局参考下跟踪相机运动。我们的方法将预测光流与求解层相结合来估计相机姿态。骨架使用一种新的具有差分隐私的求解器进行端到端训练,用于估计宽基线两视图姿态。完整的系统可以连接离散序列,执行视觉姿态估计和全局优化。与现有方法相比,我们的设计准确且对灾难性故障具有鲁棒性。代码可在此处下载:http://www.example.com
https://arxiv.org/abs/2404.15263
This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).
本文介绍了FlowMap,一种端到端的不同iable方法,用于求解视频序列中的精确相机姿态、相机内参和逐帧密集深度。我们的方法通过简单最小二乘目标函数对深度、内参和姿态引起的光学流进行逐视频梯度下降最小化。在点跟踪的使用下,我们引入了可进行一级优化的深度、内参和姿态的可导性重新参数化。我们通过实验验证,我们的方法能够使用高斯平铺实现照片现实感的360度轨迹合成。与基于梯度的 bundle adjustment 方法相比,我们的方法不仅远远超过了先前的结果,而且与最先进的SfM方法COLMAP在360度新视图合成下游任务的表现相当。尽管我们的方法是基于梯度的,完全不同导,完全与传统SfM不同,但它成功地克服了传统SfM的局限性。
https://arxiv.org/abs/2404.15259
This paper presents the UniMER dataset to provide the first study on Mathematical Expression Recognition (MER) towards complex real-world scenarios. The UniMER dataset consists of a large-scale training set UniMER-1M offering an unprecedented scale and diversity with one million training instances and a meticulously designed test set UniMER-Test that reflects a diverse range of formula distributions prevalent in real-world scenarios. Therefore, the UniMER dataset enables the training of a robust and high-accuracy MER model and comprehensive evaluation of model performance. Moreover, we introduce the Universal Mathematical Expression Recognition Network (UniMERNet), an innovative framework designed to enhance MER in practical scenarios. UniMERNet incorporates a Length-Aware Module to process formulas of varied lengths efficiently, thereby enabling the model to handle complex mathematical expressions with greater accuracy. In addition, UniMERNet employs our UniMER-1M data and image augmentation techniques to improve the model's robustness under different noise conditions. Our extensive experiments demonstrate that UniMERNet outperforms existing MER models, setting a new benchmark in various scenarios and ensuring superior recognition quality in real-world applications. The dataset and model are available at this https URL.
本文介绍了UniMER数据集,以提供数学表达识别(MER)在复杂现实场景中的第一研究。UniMER数据集包括一个大规模训练集UniMER-1M,提供前所未有的规模和多样性,以及一个精心设计的测试集UniMER-Test,反映了现实场景中普遍存在的公式分布。因此,UniMER数据集使得训练具有稳健和高精度的MER模型,全面评估模型性能成为可能。此外,我们引入了通用数学表达识别网络(UniMERNet),一种旨在增强MER在实际场景中的框架。UniMERNet包括一个长度感知模块,以处理不同长度的公式,从而使模型能够更准确地处理复杂数学表达。此外,UniMERNet利用我们的UniMER-1M数据和图像增强技术,在不同噪声条件下提高模型的稳健性。我们广泛的实验证明,UniMERNet在各种场景中优于现有MER模型,为各种应用场景树立了新的基准,并确保在现实应用中具有卓越的识别质量。数据集和模型可通过此链接获取:https://url.cn/xyz6h
https://arxiv.org/abs/2404.15254
When deploying pre-trained video object detectors in real-world scenarios, the domain gap between training and testing data caused by adverse image conditions often leads to performance degradation. Addressing this issue becomes particularly challenging when only the pre-trained model and degraded videos are available. Although various source-free domain adaptation (SFDA) methods have been proposed for single-frame object detectors, SFDA for video object detection (VOD) remains unexplored. Moreover, most unsupervised domain adaptation works for object detection rely on two-stage detectors, while SFDA for one-stage detectors, which are more vulnerable to fine-tuning, is not well addressed in the literature. In this paper, we propose Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT), a simple yet effective SFDA method for VOD. Specifically, we aim to improve the performance of the one-stage VOD method, YOLOV, under adverse image conditions, including noise, air turbulence, and haze. Extensive experiments on the ImageNetVOD dataset and its degraded versions demonstrate that our method consistently improves video object detection performance in challenging imaging conditions, showcasing its potential for real-world applications.
在将预训练的视频物体检测器应用于现实场景时,训练和测试数据之间的领域差异会导致性能下降。当只有预训练模型和降级视频可用时,解决此问题变得尤为具有挑战性。尽管已经提出了各种源域免费域适应(SFDA)方法用于单帧物体检测器,但视频物体检测器(VOD)的SFDA仍然没有被探索。此外,大多数无监督域适应方法,这些方法在物体检测中依赖于两阶段检测器,而我们的SFDA方法针对一阶段检测器,这些检测器更容易受到微调的影响,在文献中没有得到很好的解决。在本文中,我们提出了Spatial-Temporal Alternate Refinement with Mean Teacher (STAR-MT),一种简单而有效的SFDA方法用于VOD。具体来说,我们旨在改善在恶劣图像条件下,包括噪声、气流和雾的,预训练的VOD方法YOLOV的性能。对ImageNetVOD数据集及其降本版本的大规模实验证明,我们的方法在具有挑战性的图像条件下持续改善视频物体检测器的性能,表明其具有在现实应用中的潜在。
https://arxiv.org/abs/2404.15252
Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first step is to train the parent architecture to enable early exiting from the encoder. The second step is to create an derived dataset of the ideal number of encoder layers required for each training example. The third step is to use the aforementioned derived dataset to train a gating network that predicts the number of encoder layers to be used, conditioned on the input image. Additionally, to change the computational-accuracy tradeoff, only steps two and three need to be repeated which significantly reduces retraining time. Experiments on the public datasets show that the proposed approach reduces expected encoder computational cost while maintaining performance, adapts to various user compute resources, is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
基于视觉 transformer 的模型在图像分割任务中带来了显著的改进。尽管这些架构无论具体分割任务还是其他任务都提供了强大的功能,但它们在部署设备上的计算资源消耗较大。克服这一挑战的一种方法是将计算级别适配于输入图像的特定需求,而不是一刀切的方法。为此,我们引入了 ECO-M2F 或 EffiCient TransfOrmer Encoders(M2F-style 模型的编码器模块)来支持掩膜2形式模型的自适应计算级别。注意到 M2F-style 模型的编码器模块会引入高资源密集型计算,因此 ECO-M2F 为输入图像自适应选择编码器层数提供了一种策略。为了实现这一自适应能力,并提供性能和计算效率之间的平衡,我们提出了一个三步食谱。第一步是训练父架构以实现从编码器早期退出。第二步是创建一个所需的编码器层数取决于每个训练实例的派生数据集。第三步是使用上述派生数据集训练一个预测编码器层数以使用输入图像的编码器层数的 gating 网络。此外,为了改变计算准确性与计算效率之间的权衡,只需重复步骤2和3,这使得重新训练时间显著减少。在公开数据集上的实验表明,与建议的方法相比,所提出的方案在减轻预期编码器计算成本的同时保持性能,并适应各种用户计算资源,在架构配置方面具有灵活性,并且可以扩展到分割任务之外的对象检测。
https://arxiv.org/abs/2404.15244
Face recognition applications have grown in parallel with the size of datasets, complexity of deep learning models and computational power. However, while deep learning models evolve to become more capable and computational power keeps increasing, the datasets available are being retracted and removed from public access. Privacy and ethical concerns are relevant topics within these domains. Through generative artificial intelligence, researchers have put efforts into the development of completely synthetic datasets that can be used to train face recognition systems. Nonetheless, the recent advances have not been sufficient to achieve performance comparable to the state-of-the-art models trained on real data. To study the drift between the performance of models trained on real and synthetic datasets, we leverage a massive attribute classifier (MAC) to create annotations for four datasets: two real and two synthetic. From these annotations, we conduct studies on the distribution of each attribute within all four datasets. Additionally, we further inspect the differences between real and synthetic datasets on the attribute set. When comparing through the Kullback-Leibler divergence we have found differences between real and synthetic samples. Interestingly enough, we have verified that while real samples suffice to explain the synthetic distribution, the opposite could not be further from being true.
面部识别应用程序与数据集的大小、深度学习模型的复杂性和计算能力成正比增长。然而,尽管深度学习模型不断进化变得更具弹性和计算能力在增加,但可用的数据集正在减少和移除。隐私和伦理问题在这些领域内具有相关性。通过生成人工智能,研究人员在开发完全 synthetic 数据集以供训练面部识别系统方面付出了努力。然而,最近的研究成果尚不能达到与基于真实数据的先进模型的性能相当的水平。为了研究在真实和合成数据上训练模型的性能漂移,我们利用大规模属性分类器(MAC)为四个数据集:两个真实和两个合成创建注释:从这些注释,我们研究了每个属性的所有四个数据集中的分布。此外,我们进一步研究了真实和合成数据在属性集上的差异。通过Kullback-Leibler散度比较,我们发现了真实和合成样本之间的差异。有趣的是,我们已经证实,尽管真实样本足以解释合成分布,但相反的说法并不完全正确。
https://arxiv.org/abs/2404.15234
Inverse graphics -- the task of inverting an image into physical variables that, when rendered, enable reproduction of the observed scene -- is a fundamental challenge in computer vision and graphics. Disentangling an image into its constituent elements, such as the shape, color, and material properties of the objects of the 3D scene that produced it, requires a comprehensive understanding of the environment. This requirement limits the ability of existing carefully engineered approaches to generalize across domains. Inspired by the zero-shot ability of large language models (LLMs) to generalize to novel contexts, we investigate the possibility of leveraging the broad world knowledge encoded in such models in solving inverse-graphics problems. To this end, we propose the Inverse-Graphics Large Language Model (IG-LLM), an inverse-graphics framework centered around an LLM, that autoregressively decodes a visual embedding into a structured, compositional 3D-scene representation. We incorporate a frozen pre-trained visual encoder and a continuous numeric head to enable end-to-end training. Through our investigation, we demonstrate the potential of LLMs to facilitate inverse graphics through next-token prediction, without the use of image-space supervision. Our analysis opens up new possibilities for precise spatial reasoning about images that exploit the visual knowledge of LLMs. We will release our code and data to ensure the reproducibility of our investigation and to facilitate future research at this https URL
逆图形--将图像转换为物理量,使在渲染时能够复制所观察到的场景--是计算机视觉和图形学中的一个基本挑战。将图像分解为其构成元素,例如产生图像的三维场景中物体的形状、颜色和材质属性,需要对环境有全面的了解。这一要求限制了现有经过精心设计的方法的泛化能力。受到大型语言模型(LLMs)零 shot 能力的启发,我们研究了利用这些模型所编码的广泛世界知识来解决逆图形问题的可能性。为此,我们提出了 Inverse-Graphics Large Language Model (IG-LLM),这是一个以LLM为中心的逆图形框架,它可以自回归地解码视觉嵌入到一个结构化、组合的三维场景表示中。我们引入了一个冻结的先验视觉编码器和一个连续的数值头,以实现端到端的训练。通过我们的调查,我们证明了LLMs通过next-token预测助力逆图形,而无需使用图像空间监督。我们的分析为利用LLM的视觉知识实现精确的空间推理关于图像打开了新的可能性。我们将发布我们的代码和数据,以确保调查的重复性,并促进未来研究在 this URL。
https://arxiv.org/abs/2404.15228
Embodied reasoning systems integrate robotic hardware and cognitive processes to perform complex tasks typically in response to a natural language query about a specific physical environment. This usually involves changing the belief about the scene or physically interacting and changing the scene (e.g. 'Sort the objects from lightest to heaviest'). In order to facilitate the development of such systems we introduce a new simulating environment that makes use of MuJoCo physics engine and high-quality renderer Blender to provide realistic visual observations that are also accurate to the physical state of the scene. Together with the simulator we propose a new benchmark composed of 10 classes of multi-step reasoning scenarios that require simultaneous visual and physical measurements. Finally, we develop a new modular Closed Loop Interactive Reasoning (CLIER) approach that takes into account the measurements of non-visual object properties, changes in the scene caused by external disturbances as well as uncertain outcomes of robotic actions. We extensively evaluate our reasoning approach in simulation and in the real world manipulation tasks with a success rate above 76% and 64%, respectively.
人体推理系统将机器人硬件和认知过程集成在一起,以应对自然语言查询关于特定物理环境中复杂任务的执行。这通常包括改变场景的信念或通过身体交互来改变场景(例如,'将物体按重量从轻到重排序')。为了促进这种系统的发展,我们引入了一个新的模拟环境,利用MuJoCo物理引擎和高质量渲染器Blender提供真实的视觉观察,并且准确地反映场景的物理状态。与模拟器一起,我们提出了一个由10个多步骤推理场景组成的新的基准。最后,我们开发了一种新型的模块化闭合循环交互推理(CLIER)方法,考虑了非视觉对象属性的测量、由外部干扰引起的场景变化以及机器人行动不确定性的结果。我们在模拟和现实世界的操作任务中对其推理方法进行了广泛评估。我们在模拟和现实世界的操作任务中的成功率分别达到76%和64%。
https://arxiv.org/abs/2404.15194
Recently, implicit neural representations (INR) have made significant strides in various vision-related domains, providing a novel solution for Multispectral and Hyperspectral Image Fusion (MHIF) tasks. However, INR is prone to losing high-frequency information and is confined to the lack of global perceptual capabilities. To address these issues, this paper introduces a Fourier-enhanced Implicit Neural Fusion Network (FeINFN) specifically designed for MHIF task, targeting the following phenomena: The Fourier amplitudes of the HR-HSI latent code and LR-HSI are remarkably similar; however, their phases exhibit different patterns. In FeINFN, we innovatively propose a spatial and frequency implicit fusion function (Spa-Fre IFF), helping INR capture high-frequency information and expanding the receptive field. Besides, a new decoder employing a complex Gabor wavelet activation function, called Spatial-Frequency Interactive Decoder (SFID), is invented to enhance the interaction of INR features. Especially, we further theoretically prove that the Gabor wavelet activation possesses a time-frequency tightness property that favors learning the optimal bandwidths in the decoder. Experiments on two benchmark MHIF datasets verify the state-of-the-art (SOTA) performance of the proposed method, both visually and quantitatively. Also, ablation studies demonstrate the mentioned contributions. The code will be available on Anonymous GitHub (https://anonymous.4open.science/r/FeINFN-15C9/) after possible acceptance.
近年来,隐含神经表示(INR)在各种视觉相关领域取得了显著的进步,为多光谱和超光谱图像融合(MHIF)任务提供了新的解决方案。然而,INR容易丢失高频信息,并且局限于缺乏全局感知能力。为了应对这些问题,本文提出了一种专门为MHIF任务设计的傅里叶增强隐含神经融合网络(FeINFN),旨在解决以下现象:HR-HSI隐含码的傅里叶振幅和LR-HSI隐含码的傅里叶振幅显著相似;然而,它们的相位表现出不同的模式。在FeINFN中,我们创新地提出了一种空间和频率隐含融合函数(Spa-Fre IFF),帮助INR捕获高频信息并扩大感受野。此外,还提出了一种新的解码器,采用复杂的高尔顿卷积激活函数,称为空间频率交互解码器(SFID),以增强INR特征之间的交互。特别地,我们进一步理论证明,高尔顿卷积激活具有时间-频率紧密性特性,有利于在解码器中学习最优带宽。在两个基准MHIF数据集上的实验证实了所提出方法的最先进性能,无论是在视觉方面还是量化方面。此外,消融研究还证明了上述贡献。代码将在经过可能接受后,在匿名GitHub上发布(https://anonymous.4open.science/r/FeINFN-15C9/)。
https://arxiv.org/abs/2404.15174
With the increasing maturity of the text-to-image and image-to-image generative models, AI-generated images (AGIs) have shown great application potential in advertisement, entertainment, education, social media, etc. Although remarkable advancements have been achieved in generative models, very few efforts have been paid to design relevant quality assessment models. In this paper, we propose a novel blind image quality assessment (IQA) network, named AMFF-Net, for AGIs. AMFF-Net evaluates AGI quality from three dimensions, i.e., "visual quality", "authenticity", and "consistency". Specifically, inspired by the characteristics of the human visual system and motivated by the observation that "visual quality" and "authenticity" are characterized by both local and global aspects, AMFF-Net scales the image up and down and takes the scaled images and original-sized image as the inputs to obtain multi-scale features. After that, an Adaptive Feature Fusion (AFF) block is used to adaptively fuse the multi-scale features with learnable weights. In addition, considering the correlation between the image and prompt, AMFF-Net compares the semantic features from text encoder and image encoder to evaluate the text-to-image alignment. We carry out extensive experiments on three AGI quality assessment databases, and the experimental results show that our AMFF-Net obtains better performance than nine state-of-the-art blind IQA methods. The results of ablation experiments further demonstrate the effectiveness of the proposed multi-scale input strategy and AFF block.
随着文本到图像和图像到图像生成模型的成熟,AI生成的图像(AGIs)在广告、娱乐、教育、社交媒体等领域的应用潜力得到了很大的提升。尽管在生成模型方面取得了显著的进步,但很少有精力致力于设计相关的质量评估模型。在本文中,我们提出了一个名为AMFF-Net的新颖的盲图像质量评估(IQA)网络,用于AGIs。AMFF-Net从“视觉质量”、“真实性和一致性”三个维度评估AGI的质量。具体来说,为了模仿人视觉系统的特点,并受到观察到“视觉质量和真实性”既具有局部又具有全局特征的启发,AMFF-Net上下文扩展图像并获取多尺度特征。然后,使用自适应特征融合(AFF)块将多尺度特征与可学习权重进行自适应融合。此外,考虑到图像和提示之间的相关性,AMFF-Net将文本编码器和解码器中的语义特征与图像编码器中的语义特征进行比较,以评估文本到图像的对齐效果。我们在三个AGI质量评估数据库上进行了广泛的实验,实验结果表明,我们的AMFF-Net的性能优于九个最先进的盲IQA方法。消融实验的结果进一步证明了所提出的多尺度输入策略和AFF块的有效性。
https://arxiv.org/abs/2404.15163
Understanding videos that contain multiple modalities is crucial, especially in egocentric videos, where combining various sensory inputs significantly improves tasks like action recognition and moment localization. However, real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. Current methods, while effective, often necessitate retraining the model entirely to handle missing modalities, making them computationally intensive, particularly with large training datasets. In this study, we propose a novel approach to address this issue at test time without requiring retraining. We frame the problem as a test-time adaptation task, where the model adjusts to the available unlabeled data at test time. Our method, MiDl~(Mutual information with self-Distillation), encourages the model to be insensitive to the specific modality source present during testing by minimizing the mutual information between the prediction and the available modality. Additionally, we incorporate self-distillation to maintain the model's original performance when both modalities are available. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time. Through experiments with various pretrained models and datasets, MiDl demonstrates substantial performance improvement without the need for retraining.
理解包含多种模态的视频对来说非常重要,尤其是在自闭型视频中,将各种感官输入结合起来可以显著提高诸如动作识别和时刻定位等任务。然而,由于隐私问题、效率需求或硬件问题等原因,现实世界的应用经常面临模态不完整的情况。尽管现有的方法非常有效,但通常需要重新训练整个模型来处理缺失的模态,这使得它们在计算上是密集的,尤其是在大型训练数据集的情况下。在本文中,我们提出了一种在测试时不需要重新训练的方法来解决这个问题。我们将问题建模为测试时的自适应任务,在这个任务中,模型根据测试时的未标注数据进行调整。我们的方法MiDl~( mutual information with self-distillation)通过最小化预测和可用模态之间的互信息来鼓励模型对测试时的具体模态保持鲁棒性。此外,我们还将自监督学习集成到模型中,以便在模态都存在时保持模型的原始性能。MiDl是第一个在测试时专门处理缺失模态的自监督在线解决方案。通过使用各种预训练模型和数据集进行实验,MiDl证明了在无需重新训练的情况下具有显著的性能提升。
https://arxiv.org/abs/2404.15161
Medical image analysis is a significant application of artificial intelligence for disease diagnosis. A crucial step in this process is the identification of regions of interest within the images. This task can be automated using object detection algorithms. YOLO and Faster R-CNN are renowned for such algorithms, each with its own strengths and weaknesses. This study aims to explore the advantages of both techniques to select more accurate bounding boxes for gallbladder detection from ultrasound images, thereby enhancing gallbladder cancer classification. A fusion method that leverages the benefits of both techniques is presented in this study. The proposed method demonstrated superior classification performance, with an accuracy of 92.62%, compared to the individual use of Faster R-CNN and YOLOv8, which yielded accuracies of 90.16% and 82.79%, respectively.
医学图像分析是人工智能在疾病诊断中的一个重要应用。这一过程中一个关键步骤是在图像中识别感兴趣区域。这项任务可以使用物体检测算法来自动化。YOLO和Faster R-CNN是臭名昭著的这类算法,各自具有自己的优势和劣势。本研究旨在探讨两种算法的优势,以提高从超声图像中检测胆囊结石的边界框的准确性,从而提高胆囊癌分类。本研究提出了一种融合方法,利用两种算法的优势。与单独使用Faster R-CNN和YOLOv8相比,该方法显示出卓越的分类性能,准确率为92.62%,而分别使用这两种方法时的准确率分别为90.16%和82.79%。
https://arxiv.org/abs/2404.15129
The rapid advancement of large-scale vision-language models has showcased remarkable capabilities across various tasks. However, the lack of extensive and high-quality image-text data in medicine has greatly hindered the development of large-scale medical vision-language models. In this work, we present a diagnosis-guided bootstrapping strategy that exploits both image and label information to construct vision-language datasets. Based on the constructed dataset, we developed MedDr, a generalist foundation model for healthcare capable of handling diverse medical data modalities, including radiology, pathology, dermatology, retinography, and endoscopy. Moreover, during inference, we propose a simple but effective retrieval-augmented medical diagnosis strategy, which enhances the model's generalization ability. Extensive experiments on visual question answering, medical report generation, and medical image diagnosis demonstrate the superiority of our method.
大规模视觉语言模型的快速发展在各种任务中展示了令人印象深刻的性能。然而,在医学领域中缺乏大量高质量的图像-文本数据,大大阻碍了大规模医疗视觉语言模型的开发。在这项工作中,我们提出了一个指导下的bootstrap策略,该策略利用图像和标签信息来构建视觉语言数据集。基于构建的数据集,我们开发了MedDr,一种通用医疗数据处理模型,能够处理各种医疗数据模式,包括放射学、病理学、皮肤病学、眼科和内窥镜。此外,在推理过程中,我们提出了一种简单但有效的检索增强医疗诊断策略,可以增强模型的泛化能力。在视觉问答、医学报告生成和医学图像诊断等大量实验中,证明了我们方法的优势。
https://arxiv.org/abs/2404.15127
Recent studies have demonstrated the exceptional potentials of leveraging human preference datasets to refine text-to-image generative models, enhancing the alignment between generated images and textual prompts. Despite these advances, current human preference datasets are either prohibitively expensive to construct or suffer from a lack of diversity in preference dimensions, resulting in limited applicability for instruction tuning in open-source text-to-image generative models and hinder further exploration. To address these challenges and promote the alignment of generative models through instruction tuning, we leverage multimodal large language models to create VisionPrefer, a high-quality and fine-grained preference dataset that captures multiple preference aspects. We aggregate feedback from AI annotators across four aspects: prompt-following, aesthetic, fidelity, and harmlessness to construct VisionPrefer. To validate the effectiveness of VisionPrefer, we train a reward model VP-Score over VisionPrefer to guide the training of text-to-image generative models and the preference prediction accuracy of VP-Score is comparable to human annotators. Furthermore, we use two reinforcement learning methods to supervised fine-tune generative models to evaluate the performance of VisionPrefer, and extensive experimental results demonstrate that VisionPrefer significantly improves text-image alignment in compositional image generation across diverse aspects, e.g., aesthetic, and generalizes better than previous human-preference metrics across various image distributions. Moreover, VisionPrefer indicates that the integration of AI-generated synthetic data as a supervisory signal is a promising avenue for achieving improved alignment with human preferences in vision generative models.
近年来,研究表明,利用人类偏好数据集来优化文本到图像生成模型具有异常的潜力,从而增强生成图像与文本提示之间的对齐。尽管取得了这些进步,但目前的用户偏好数据集要么过于昂贵,难以构建,要么在偏好维度上缺乏多样性,导致对于开源文本到图像生成模型的指令调整应用有限。为了应对这些挑战,促进生成模型的指令调整以实现更好的人机偏好匹配,我们利用多模态大型语言模型创建了VisionPrefer,一个高质量且细粒度的偏好数据集,涵盖了多个偏好方面。我们通过汇总AI注释者的反馈,从提示跟随、美学、忠实度和无害性四个方面构建VisionPrefer。为了验证VisionPrefer的有效性,我们在VisionPrefer上训练了一个奖励模型VP-Score,以指导文本到图像生成模型的训练和VP-Score的偏好预测精度与人类注释者相当。此外,我们还使用两种强化学习方法,在监督下微调生成模型以评估VisionPrefer的表现,实验结果表明,VisionPrefer在各种图像分布下的文本图像对齐方面显著改善,例如美学和扩展性优于以前的人类偏好指标。此外,VisionPrefer表明,将AI生成的模拟数据作为监督信号实现人机偏好与视觉生成模型的对齐是一个有前景的方向。
https://arxiv.org/abs/2404.15100
Recording and identifying faint objects through atmospheric scattering media by an optical system are fundamentally interesting and technologically important. In this work, we introduce a comprehensive model that incorporates contributions from target characteristics, atmospheric effects, imaging system, digital processing, and visual perception to assess the ultimate perceptible limit of geometrical imaging, specifically the angular resolution at the boundary of visible distance. The model allows to reevaluate the effectiveness of conventional imaging recording, processing, and perception and to analyze the limiting factors that constrain image recognition capabilities in atmospheric media. The simulations were compared with the experimental results measured in a fog chamber and outdoor settings. The results reveal general good agreement between analysis and experimental, pointing out the way to harnessing the physical limit for optical imaging in scattering media. An immediate application of the study is the extension of the image range by an amount of 1.2 times with noise reduction via multi-frame averaging, hence greatly enhancing the capability of optical imaging in the atmosphere.
通过一个光学系统对大气散射介质中记录和识别微弱物体的过程既有趣又具有技术重要性。在这项工作中,我们介绍了一个全面的模型,该模型结合了目标特征、大气效应、成像系统、数字处理和视觉感知等因素,来评估几何成像的最终可感知极限,特别是可见距离边界处的角分辨率。该模型允许重新评估传统成像记录、处理和感知的效果,并分析限制图像识别能力的大气媒体中的限制因素。通过与雾室和户外设置的实验结果进行比较,结果显示分析结果与实验结果之间存在很好的一致性,指出了在散射介质中利用物理极限进行光学成像的方法。本研究的直接应用是在大气中通过多帧平均降噪扩展图像范围,从而极大地增强了大气中光学成像的能力。
https://arxiv.org/abs/2404.15082
Diffusion models (DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high-quality and realistic data samples. However, their widespread use has also brought forth new challenges in model security, which motivates the creation of more effective adversarial attackers on DMs to understand its vulnerability. We propose CAAT, a simple but generic and efficient approach that does not require costly training to effectively fool latent diffusion models (LDMs). The approach is based on the observation that cross-attention layers exhibits higher sensitivity to gradient change, allowing for leveraging subtle perturbations on published images to significantly corrupt the generated images. We show that a subtle perturbation on an image can significantly impact the cross-attention layers, thus changing the mapping between text and image during the fine-tuning of customized diffusion models. Extensive experiments demonstrate that CAAT is compatible with diverse diffusion models and outperforms baseline attack methods in a more effective (more noise) and efficient (twice as fast as Anti-DreamBooth and Mist) manner.
扩散模型(DMs)踏上了一个新的生成建模时代,并为高质和真实数据样本的生成提供了更多机会。然而,它们的广泛应用也催生了模型安全性的新挑战。这促使我们在DM上创建更有效的对抗攻击者,以了解其漏洞。我们提出了CAAT,一种简单但通用且高效的解决方案,不需要昂贵的训练,就能有效地欺骗潜在扩散模型(LDMs)。该方法基于一个观察,即跨注意层表现出对梯度变化的更高敏感性,允许我们利用已发布图像上的微小扰动来显著破坏生成的图像。我们证明了在图像上微小的扰动会对跨注意层产生重大影响,从而在自定义扩散模型微调过程中改变文本与图像之间的映射。大量实验证明,CAAT与各种扩散模型兼容,并且在更有效(更多噪声)和更高效(是Anti-DreamBooth和Mist的两倍快)的方式上优于基线攻击方法。
https://arxiv.org/abs/2404.15081