Human matting is a foundation task in image and video processing, where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe, Masked Guided Gradual Human Instance Matting, which predicts alpha mattes progressively for each human instances while maintaining the computational cost, precision, and consistency. Our method leverages modern architectures, including transformer attention and sparse convolution, to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario, our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks, the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios.
人类遮罩是图像和视频处理中的一个基础任务,其中从输入中提取人类前景像素。先前的 works 要么通过额外的指导来提高准确性,要么通过在帧之间改善单个实例的时序一致性。我们提出了一种新的框架 MaGGIe,掩码引导的逐步人类实例遮罩,在预测每个人类实例的 alpha 遮罩的同时保持计算成本、精度和一致性。我们的方法利用了现代架构,包括 transformer 注意力和稀疏卷积,同时输出所有实例遮罩,而不会导致内存和延迟的爆炸。 尽管在多实例场景中保持不变的推理成本,但我们的框架在拟合真实场景中表现出稳健和多功能的性能。随着高质量图像和视频遮罩基准的提高,我们引入了一种来自公开来源的多实例合成方法,以提高模型在现实场景中的泛化能力。
https://arxiv.org/abs/2404.16035
With the advent of large language models(LLMs) enhanced by the chain-of-thought(CoT) methodology, visual reasoning problem is usually decomposed into manageable sub-tasks and tackled sequentially with various external tools. However, such a paradigm faces the challenge of the potential "determining hallucinations" in decision-making due to insufficient visual information and the limitation of low-level perception tools that fail to provide abstract summaries necessary for comprehensive reasoning. We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. This paper delves into the realm of multimodal CoT to solve intricate visual reasoning tasks with multimodal large language models(MLLMs) and their cognitive capability. To this end, we propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Cantor first acts as a decision generator and integrates visual inputs to analyze the image and problem, ensuring a closer alignment with the actual context. Furthermore, Cantor leverages the advanced cognitive functions of MLLMs to perform as multifaceted experts for deriving higher-level information, enhancing the CoT generation process. Our extensive experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance across two complex visual reasoning datasets, without necessitating fine-tuning or ground-truth rationales. Project Page: this https URL .
随着大型语言模型(LLMs)通过链式思考(CoT)方法的优势进入,通常将视觉推理问题分解为可管理的小子任务,并使用各种外部工具按顺序解决。然而,这种范式面临着由于缺乏视觉信息而导致的决策中潜在的“感知错觉”以及低级感知工具无法提供全面推理所需抽象摘要的局限。我们认为,收敛的视觉上下文获取和逻辑推理对于解决视觉推理任务至关重要。本文将深入研究多模态CoT,使用多模态大型语言模型(MMLMs)及其认知能力解决复杂的视觉推理任务。为此,我们提出了一个创新的多模态CoT框架,称为Cantor,其特点是一个感知决策架构。Cantor首先充当一个决策生成器,并将视觉输入整合分析图像和问题,确保更贴近实际上下文。此外,Cantor利用MLLMs的先进认知功能执行多面手特征,提高CoT生成过程。我们进行的广泛实验证明,所提出的框架的有效性,表明在两个复杂的视觉推理数据集上,多模态CoT性能得到了显著的提高,而无需进行微调或目标推理的 ground-truth 理由。项目页面:https:// this URL.
https://arxiv.org/abs/2404.16033
The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at this https URL.
对比性语言-图像预训练(CLIP)的成功取决于图像与摘要之间的配对监督,而这类数据往往存在噪声。我们提出了混合数据专家(MoDE)方法并通过聚类学习系统。每个数据专家在一个数据聚类上进行训练,对其他聚类的虚假负噪声更不敏感。在推理时,我们通过任务元数据与聚类条件的关联来应用权重。为了精确估计相关性,一个聚类的样本应该在语义上相似,但数据专家的数量仍应保持在训练和推理的合理范围内。因此,我们在人类语言的语义层次上考虑元数据,并建议在粗粒度层面使用细粒度聚类中心来表示每个数据专家。实验研究表明,在ViT-B/16上,四个CLIP数据专家超过了ViT-L/14上的OpenAI CLIP和OpenCLIP在零散图像分类上的表现,但训练成本较低(<35%)。与此同时,MoDE可以异步训练所有数据专家,并可以灵活地包括新的数据专家。代码可在此处下载:https://thisurl.com
https://arxiv.org/abs/2404.16030
Diffusion models have made significant advances in text-guided synthesis tasks. However, editing user-provided images remains challenging, as the high dimensional noise input space of diffusion models is not naturally suited for image inversion or spatial editing. In this work, we propose an image representation that promotes spatial editing of input images using a diffusion model. Concretely, we learn to encode an input into "image elements" that can faithfully reconstruct an input image. These elements can be intuitively edited by a user, and are decoded by a diffusion model into realistic images. We show the effectiveness of our representation on various image editing tasks, such as object resizing, rearrangement, dragging, de-occlusion, removal, variation, and image composition. Project page: this https URL
扩散模型在文本引导的合成任务中取得了显著的进展。然而,编辑用户提供的图像仍然具有挑战性,因为扩散模型的高维噪声输入空间并不自然适合图像反演或空间编辑。在这项工作中,我们提出了一种使用扩散模型促进输入图像空间编辑的图像表示。具体来说,我们学会了将输入编码成“图像元素”,这些元素可以忠实地重构输入图像。这些元素可以直观地编辑用户,并由扩散模型解码为逼真的图像。我们在各种图像编辑任务上展示了我们表示的有效性,包括对象缩放、重新排列、拖动、消除、去除、变化和图像组合。页面链接:这是 https:// this URL。
https://arxiv.org/abs/2404.16029
Physics-based simulations have accelerated progress in robot learning for driving, manipulation, and locomotion. Yet, a fast, accurate, and robust surgical simulation environment remains a challenge. In this paper, we present ORBIT-Surgical, a physics-based surgical robot simulation framework with photorealistic rendering in NVIDIA Omniverse. We provide 14 benchmark surgical tasks for the da Vinci Research Kit (dVRK) and Smart Tissue Autonomous Robot (STAR) which represent common subtasks in surgical training. ORBIT-Surgical leverages GPU parallelization to train reinforcement learning and imitation learning algorithms to facilitate study of robot learning to augment human surgical skills. ORBIT-Surgical also facilitates realistic synthetic data generation for active perception tasks. We demonstrate ORBIT-Surgical sim-to-real transfer of learned policies onto a physical dVRK robot. Project website: this http URL
基于物理的机器人学习在驾驶、操作和移动方面已经取得了进展。然而,快速、准确和稳健的手术模拟环境仍然是一个挑战。在本文中,我们提出了ORBIT-Surgical,一个基于物理的手术机器人模拟框架,在NVIDIA Omniverse中实现光栅化渲染。我们为达芬奇研究工具包(dVRK)和智能组织自主机器人(STAR)提供了14个基准手术任务,这些任务代表了手术训练中常见的子任务。ORBIT-Surgical利用GPU并行训练强化学习和模仿学习算法,以促进研究机器人学习以提高人类手术技能。ORBIT-Surgical还促进了真实合成数据生成,用于主动感知任务。我们证明了ORBIT-Surgical将学习到的策略在物理dVRK机器人上实现模拟-到-实转。项目网站:这个链接
https://arxiv.org/abs/2404.16027
We propose Pure and Lightning ID customization (PuLID), a novel tuning-free ID customization method for text-to-image generation. By incorporating a Lightning T2I branch with a standard diffusion one, PuLID introduces both contrastive alignment loss and accurate ID loss, minimizing disruption to the original model and ensuring high ID fidelity. Experiments show that PuLID achieves superior performance in both ID fidelity and editability. Another attractive property of PuLID is that the image elements (e.g., background, lighting, composition, and style) before and after the ID insertion are kept as consistent as possible. Codes and models will be available at this https URL
我们提出了PuLID(纯和闪电ID定制),一种新的文本到图像生成的自定义ID方法。通过结合标准扩散一层的闪电T2I分支,PuLID引入了对比对齐损失和精确ID损失,最小化对原始模型的干扰并确保高ID保真度。实验结果表明,PuLID在ID保真度和编辑性方面都取得了卓越的性能。PuLID的另一个吸引人的特性是,在ID插入之前和之后,图像元素(例如背景、光照、构图和样式)保持一致。代码和模型将在这个https:// URL上提供。
https://arxiv.org/abs/2404.16022
Recent work has developed optimization procedures to find token sequences, called adversarial triggers, which can elicit unsafe responses from aligned language models. These triggers are believed to be universally transferable, i.e., a trigger optimized on one model can jailbreak other models. In this paper, we concretely show that such adversarial triggers are not universal. We extensively investigate trigger transfer amongst 13 open models and observe inconsistent transfer. Our experiments further reveal a significant difference in robustness to adversarial triggers between models Aligned by Preference Optimization (APO) and models Aligned by Fine-Tuning (AFT). We find that APO models are extremely hard to jailbreak even when the trigger is optimized directly on the model. On the other hand, while AFT models may appear safe on the surface, exhibiting refusals to a range of unsafe instructions, we show that they are highly susceptible to adversarial triggers. Lastly, we observe that most triggers optimized on AFT models also generalize to new unsafe instructions from five diverse domains, further emphasizing their vulnerability. Overall, our work highlights the need for more comprehensive safety evaluations for aligned language models.
近年来,研究者们开发了寻找令牌序列的优化方法,称为对抗性触发器,这些触发器可以从对齐的语言模型中引起不安全的反应。这些触发器被认为具有普遍可转移性,即在一种模型上优化的触发器可以解锁其他模型。在本文中,我们明确地证明了这种普遍可转移的对抗性触发器并不存在。我们深入研究了13个开源模型之间的触发器传递,并观察到不一致的传递。我们的实验进一步揭示了使用偏好优化(APO)模型和 Fine-Tuning(FT)模型对 adversarial 触发器的鲁棒性差异。我们发现,即使 APO 模型直接优化触发器,也很难被破解。另一方面,虽然 AFT 模型在表面上看起来非常安全,对各种不安全的指令表现出拒绝,但我们发现它们对 adversarial 触发器非常敏感。最后,我们观察到,大多数在 AFT 模型上优化的触发器也适用于来自五个不同领域的全新不安全指令,这进一步突显了它们的脆弱性。总体而言,我们的工作强调了对于对齐语言模型的更全面的安全性评估的必要性。
https://arxiv.org/abs/2404.16020
Human feedback plays a central role in the alignment of Large Language Models (LLMs). However, open questions remain about the methods (how), domains (where), people (who) and objectives (to what end) of human feedback collection. To navigate these questions, we introduce PRISM, a new dataset which maps the sociodemographics and stated preferences of 1,500 diverse participants from 75 countries, to their contextual preferences and fine-grained feedback in 8,011 live conversations with 21 LLMs. PRISM contributes (i) wide geographic and demographic participation in human feedback data; (ii) two census-representative samples for understanding collective welfare (UK and US); and (iii) individualised feedback where every rating is linked to a detailed participant profile, thus permitting exploration of personalisation and attribution of sample artefacts. We focus on collecting conversations that centre subjective and multicultural perspectives on value-laden and controversial topics, where we expect the most interpersonal and cross-cultural disagreement. We demonstrate the usefulness of PRISM via three case studies of dialogue diversity, preference diversity, and welfare outcomes, showing that it matters which humans set alignment norms. As well as offering a rich community resource, we advocate for broader participation in AI development and a more inclusive approach to technology design.
人类反馈在大型语言模型的对齐中扮演着中心角色。然而,关于人类反馈的方法(如何)、领域(在哪里)、参与人群(谁)以及目标(为什么)等问题,仍然存在 open questions。为了回答这些问题,我们引入了 PRISM,一个新数据集,它将 1,500 个不同国家和地区的参与者的社会人口统计学和个人陈述偏好与他们对语境中的人工智能模型的反馈联系起来,在 8,011 个与 21 个大型语言模型进行的有 21,011 个实时对话。PRISM 作出了以下贡献:(i)在人类反馈数据中广泛地理和人口统计学参与;(ii)两个具有代表性的英国和美国人口统计样本,以了解共同福利;(iii)每个人工智能模型中的评分都与详细参与者个人资料相关联,因此可以探索个性化以及对样本元数据的归属。我们关注的是收集那些关注有价值和争议话题的对话,我们预计这将是人与人之间最人际化和跨文化分歧最大的情况。通过三个对话多样性的案例研究、偏好多样性案例研究和福利结果案例研究,我们展示了 PRISM 的有用性。它不仅提供了一个丰富的社区资源,还倡导更广泛地参与人工智能发展和更包容的技术设计。
https://arxiv.org/abs/2404.16019
We introduce the RetinaRegNet model, which can achieve state-of-the-art performance across various retinal image registration tasks. RetinaRegNet does not require training on any retinal images. It begins by establishing point correspondences between two retinal images using image features derived from diffusion models. This process involves the selection of feature points from the moving image using the SIFT algorithm alongside random point sampling. For each selected feature point, a 2D correlation map is computed by assessing the similarity between the feature vector at that point and the feature vectors of all pixels in the fixed image. The pixel with the highest similarity score in the correlation map corresponds to the feature point in the moving image. To remove outliers in the estimated point correspondences, we first applied an inverse consistency constraint, followed by a transformation-based outlier detector. This method proved to outperform the widely used random sample consensus (RANSAC) outlier detector by a significant margin. To handle large deformations, we utilized a two-stage image registration framework. A homography transformation was used in the first stage and a more accurate third-order polynomial transformation was used in the second stage. The model's effectiveness was demonstrated across three retinal image datasets: color fundus images, fluorescein angiography images, and laser speckle flowgraphy images. RetinaRegNet outperformed current state-of-the-art methods in all three datasets. It was especially effective for registering image pairs with large displacement and scaling deformations. This innovation holds promise for various applications in retinal image analysis. Our code is publicly available at this https URL.
我们提出了RetinaRegNet模型,可以实现各种视网膜图像配准任务的当前最佳性能。RetinaRegNet不需要对任何视网膜图像进行训练。它首先通过扩散模型从图像特征中建立两个视网膜图像之间的点对应关系。这个过程涉及使用SIFT算法从运动图像中选择特征点,并进行随机点采样。对于选定的每个特征点,计算该点与固定图像中所有像素的特征向量之间的相似度张量。在张量中,相似度得分最高的像素对应于运动图像中的特征点。为了消除估计点对应关系的估计方差,我们首先应用了反一致性约束,然后应用了基于变换的异常检测。这种方法在广泛使用的随机样本一致性(RANSAC)异常检测器方面显著优于该方法。为了处理大的变形,我们使用了两阶段图像配准框架。在第一阶段,使用单应性变换;在第二阶段,使用更精确的第三阶多项式变换。该模型的有效性在三个视网膜图像数据集上得到了证实:彩色 fundus 图像,荧光素造影图像和激光散射流量图。RetinaRegNet在所有三个数据集上都超越了当前的最佳方法。尤其是在对大位移和缩放变形进行图像对齐时,该创新具有很大的潜力,为视网膜图像分析各种应用。我们的代码现在公开可用,在这个链接上:https://www.xxx。
https://arxiv.org/abs/2404.16017
We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed of 120 FPS, surpassing previous benchmarks. Our code is made available at this https URL .
我们提出了GaussianTalker,一种用于实时生成具有姿态控制的说话头的新框架。它利用了3D高斯平滑(3DGS)的快速渲染能力,同时解决了直接用语音音频控制3DGS的挑战。GaussianTalker构建了一个规范的3DGS头部的表示,并与其同步变形。一个关键的见解是将3D高斯属性编码成一个共享的隐式特征表示,其中它与音频特征合并以操纵每个高斯属性。这种设计利用了空间感知特征,并强制处理相邻点之间的交互。然后将特征嵌入 feed 到一个空间-音频关注模块,该模块预测每个高斯属性的时偏移。与之前的方法相比,GaussianTalker在面部保真度、 lip 同步准确性和渲染速度方面具有优越性。具体来说,GaussianTalker实现了令人印象深刻的120 FPS的渲染速度,超过了之前的基准。我们的代码可在此处访问的 URL 下载。
https://arxiv.org/abs/2404.16012
The livestock industry faces several challenges, including labor-intensive management, the threat of predators and environmental sustainability concerns. Therefore, this paper explores the integration of quadruped robots in extensive livestock farming as a novel application of field robotics. The SELF-AIR project, an acronym for Supporting Extensive Livestock Farming with the use of Autonomous Intelligent Robots, exemplifies this innovative approach. Through advanced sensors, artificial intelligence, and autonomous navigation systems, these robots exhibit remarkable capabilities in navigating diverse terrains, monitoring large herds, and aiding in various farming tasks. This work provides insight into the SELF-AIR project, presenting the lessons learned.
畜牧业面临着几个挑战,包括劳动密集管理、捕食者的威胁和环境可持续性问题。因此,本文探讨了在放养畜牧业中集成四足机器人的新型应用,作为场机器人技术的一种创新。SELF-AIR项目,即使用智能机器人支持放养畜牧业的缩写,体现了这种创新方法。通过先进的传感器、人工智能和自主导航系统,这些机器人表现出在复杂地形中导航、监测大规模群牛和协助各种农场任务的非凡能力。本工作提供了对SELF-AIR项目的了解,分享了其中的经验教训。
https://arxiv.org/abs/2404.16008
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
While the field of medical image analysis has undergone a transformative shift with the integration of machine learning techniques, the main challenge of these techniques is often the scarcity of large, diverse, and well-annotated datasets. Medical images vary in format, size, and other parameters and therefore require extensive preprocessing and standardization, for usage in machine learning. Addressing these challenges, we introduce the Medical Imaging Meta-Dataset (MedIMeta), a novel multi-domain, multi-task meta-dataset. MedIMeta contains 19 medical imaging datasets spanning 10 different domains and encompassing 54 distinct medical tasks, all of which are standardized to the same format and readily usable in PyTorch or other ML frameworks. We perform a technical validation of MedIMeta, demonstrating its utility through fully supervised and cross-domain few-shot learning baselines.
尽管将机器学习技术融入医学图像分析领域已经经历了一次变革性的转变,但这种技术的主要挑战通常是缺乏大型、多样化和具有良好标注的大型数据集。 医学图像在格式、大小和其他参数上有所不同,因此需要进行广泛的预处理和标准化,以便在机器学习应用程序中使用。为解决这些挑战,我们引入了医学图像元数据集(MedIMeta),这是一个新型的多领域、多任务元数据集。MedIMeta包含19个医学图像数据集,跨越10个不同的领域,涵盖54个不同的医学任务,所有这些数据集都已标准化为相同的格式,且易于在PyTorch或其他ML框架中使用。我们通过完全监督和跨域少样本学习基准对MedIMeta进行了技术验证,证明了其实用性。
https://arxiv.org/abs/2404.16000
Large language models (LLMs) are highly capable of many tasks but they can sometimes generate unreliable or inaccurate outputs. To tackle this issue, this paper studies the problem of uncertainty estimation and calibration for LLMs. We begin by formulating the uncertainty estimation problem for LLMs and then propose a supervised approach that takes advantage of the labeled datasets and estimates the uncertainty of the LLMs' responses. Based on the formulation, we illustrate the difference between the uncertainty estimation for LLMs and that for standard ML models and explain why the hidden activations of the LLMs contain uncertainty information. Our designed approach effectively demonstrates the benefits of utilizing hidden activations for enhanced uncertainty estimation across various tasks and shows robust transferability in out-of-distribution settings. Moreover, we distinguish the uncertainty estimation task from the uncertainty calibration task and show that a better uncertainty estimation mode leads to a better calibration performance. In practice, our method is easy to implement and is adaptable to different levels of model transparency including black box, grey box, and white box, each demonstrating strong performance based on the accessibility of the LLM's internal mechanisms.
大语言模型(LLMs)具有许多任务的丰富能力,但有时它们可能生成不可靠或不准确的结果。为解决这个问题,本文研究了LLMs的不确定性估计和校准问题。我们首先形式化LLMs的不确定性估计问题,然后提出了一种监督方法,该方法利用已标记的数据集并估计LLMs的响应不确定性。根据公式,我们阐明了LLMs和标准机器学习模型不确定性估计之间的差异,并解释了LLMs隐藏激活中包含不确定性信息的原因。我们设计的方法有效地证明了利用隐藏激活增强不确定估计在各种任务中的优势,并在离散设置中展示了鲁棒性。此外,我们区分了不确定性估计任务和不确定性校准任务,并表明更好的不确定性估计模式会导致更好的校准性能。在实践中,我们的方法易于实现,并适用于包括黑盒、灰盒和白盒在内的不同模型透明度级别,每个级别都基于LLM内部机制的可访问性表现出强大的性能。
https://arxiv.org/abs/2404.15993
Infrared and visible image fusion (IVIF) aims to preserve thermal radiation information from infrared images while integrating texture details from visible images, enabling the capture of important features and hidden details of subjects in complex scenes and disturbed environments. Consequently, IVIF offers distinct advantages in practical applications such as video surveillance, night navigation, and target recognition. However, prevailing methods often face challenges in simultaneously capturing thermal region features and detailed information due to the disparate characteristics of infrared and visible images. Consequently, fusion outcomes frequently entail a compromise between thermal target area information and texture details. In this study, we introduce a novel heterogeneous dual-discriminator generative adversarial network (HDDGAN) to address this issue. Specifically, the generator is structured as a multi-scale skip-connected structure, facilitating the extraction of essential features from different source images. To enhance the information representation ability of the fusion result, an attention mechanism is employed to construct the information fusion layer within the generator, leveraging the disparities between the source images. Moreover, recognizing the distinct learning requirements of information in infrared and visible images, we design two discriminators with differing structures. This approach aims to guide the model to learn salient information from infrared images while simultaneously capturing detailed information from visible images. Extensive experiments conducted on various public datasets demonstrate the superiority of our proposed HDDGAN over other state-of-the-art (SOTA) algorithms, highlighting its enhanced potential for practical applications.
红外和可见图像融合(IVIF)旨在保留红外图像的热辐射信息,同时整合可见图像的纹理细节,从而使捕捉复杂场景和受干扰环境中的主题重要特征和隐藏细节成为可能。因此,在实际应用中,例如视频监控、夜间导航和目标识别,IVIF具有显著的优势。然而,由于红外和可见图像的差异特征,现有的方法在同时捕捉热区域特征和详细信息方面常常面临挑战。因此,融合结果通常需要在热目标区域信息与纹理细节之间做出权衡。在这项研究中,我们引入了一种新颖的异质双判别器生成对抗网络(HDDGAN)来解决这一问题。具体来说,生成器采用多尺度跳转连接结构,促进从不同源图像中提取关键特征。为了增强融合结果的信息表示能力,采用关注机制在生成器中构建信息融合层,利用源图像之间的差异。此外,考虑到红外和可见图像之间的不同学习需求,我们设计了两部分结构不同的判别器。这种方法旨在指导模型从红外图像中学习显著信息,同时从可见图像中捕捉详细信息。在各种公开数据集上进行的大量实验证明,与最先进的(SOTA)算法相比,我们提出的HDDGAN具有卓越的实用性能,强调了其在实际应用中的潜在优势。
https://arxiv.org/abs/2404.15992
Analyzing volumetric data with rotational invariance or equivariance is an active topic in current research. Existing deep-learning approaches utilize either group convolutional networks limited to discrete rotations or steerable convolutional networks with constrained filter structures. This work proposes a novel equivariant neural network architecture that achieves analytical Equivariance to Local Pattern Orientation on the continuous SO(3) group while allowing unconstrained trainable filters - EquiLoPO Network. Our key innovations are a group convolutional operation leveraging irreducible representations as the Fourier basis and a local activation function in the SO(3) space that provides a well-defined mapping from input to output functions, preserving equivariance. By integrating these operations into a ResNet-style architecture, we propose a model that overcomes the limitations of prior methods. A comprehensive evaluation on diverse 3D medical imaging datasets from MedMNIST3D demonstrates the effectiveness of our approach, which consistently outperforms state of the art. This work suggests the benefits of true rotational equivariance on SO(3) and flexible unconstrained filters enabled by the local activation function, providing a flexible framework for equivariant deep learning on volumetric data with potential applications across domains. Our code is publicly available at \url{this https URL}.
分析体积数据具有旋转不变性或等价性是当前研究的一个活跃主题。现有的深度学习方法要么是有限离散旋转的组卷积网络,要么是具有约束滤波器结构的可调节卷积网络。本文提出了一种新颖的等价神经网络架构,可以在连续SO(3)组上实现对局部模式方向的分析等价性,同时允许无约束的训练滤波器 - EquiLoPO网络。我们的关键创新点是一个利用不可约表示作为傅里叶基的组卷积操作,以及SO(3)空间中提供输入到输出函数的良好定义的局部激活函数。通过将这些操作整合到ResNet风格的架构中,我们提出了一个克服了先前方法局限性的模型。对MedMNIST3D等多样3D医疗成像数据集的全面评估表明,我们的方法的有效性得到了充分证明,该方法 consistently超越了最先进的技术水平。这项工作揭示了真旋转等价性对SO(3)的益处以及由局部激活函数实现的可伸缩和不约束滤波器,为在体积数据上实现等价深度学习提供了灵活的框架,具有广泛的应用前景。我们的代码公开可用,通过点击以下链接访问:https:// this https URL。
https://arxiv.org/abs/2404.15979
State space models (SSMs) with selection mechanisms and hardware-aware architectures, namely Mamba, have recently demonstrated significant promise in long-sequence modeling. Since the self-attention mechanism in transformers has quadratic complexity with image size and increasing computational demands, the researchers are now exploring how to adapt Mamba for computer vision tasks. This paper is the first comprehensive survey aiming to provide an in-depth analysis of Mamba models in the field of computer vision. It begins by exploring the foundational concepts contributing to Mamba's success, including the state space model framework, selection mechanisms, and hardware-aware design. Next, we review these vision mamba models by categorizing them into foundational ones and enhancing them with techniques such as convolution, recurrence, and attention to improve their sophistication. We further delve into the widespread applications of Mamba in vision tasks, which include their use as a backbone in various levels of vision processing. This encompasses general visual tasks, Medical visual tasks (e.g., 2D / 3D segmentation, classification, and image registration, etc.), and Remote Sensing visual tasks. We specially introduce general visual tasks from two levels: High/Mid-level vision (e.g., Object detection, Segmentation, Video classification, etc.) and Low-level vision (e.g., Image super-resolution, Image restoration, Visual generation, etc.). We hope this endeavor will spark additional interest within the community to address current challenges and further apply Mamba models in computer vision.
带有选择机制和硬件感知架构的状态空间模型(SSMs),如Mamba,在长序列建模方面最近取得了显著的进展。由于Transformer中自注意力机制的复杂性随着图像尺寸的增加而增加,计算机视觉任务的计算需求也在增加,因此研究人员现在正在探索如何将Mamba适应计算机视觉任务。本文是旨在为计算机视觉领域提供对Mamba模型的深入分析的第一篇全面调查。文章首先探讨了导致Mamba成功的基本概念,包括状态空间模型框架、选择机制和硬件感知设计。接下来,我们通过分类这些视觉Mamba模型为基本模型并使用卷积、递归和注意等技术对其进行改进,来回顾这些模型。我们深入探讨了Mamba在计算机视觉任务中的广泛应用,包括在各种级别视觉处理中的作为骨干的应用。这包括一般视觉任务(如物体检测、分割、分类和图像配准等)、医学视觉任务(如2D/3D分割、分类和图像配准等)和遥感视觉任务。我们特别引入了两个层面的通用视觉任务:高/中级别视觉(如物体检测、分割、视频分类等)和低级别视觉(如图像超分辨率、图像修复、视觉生成等)。我们希望这个努力将在社区中激发更多的兴趣,以解决当前的挑战并进一步将Mamba模型应用于计算机视觉。
https://arxiv.org/abs/2404.15956
Recent advances in generative AI have led to the development of techniques to generate visually realistic synthetic video. While a number of techniques have been developed to detect AI-generated synthetic images, in this paper we show that synthetic image detectors are unable to detect synthetic videos. We demonstrate that this is because synthetic video generators introduce substantially different traces than those left by image generators. Despite this, we show that synthetic video traces can be learned, and used to perform reliable synthetic video detection or generator source attribution even after H.264 re-compression. Furthermore, we demonstrate that while detecting videos from new generators through zero-shot transferability is challenging, accurate detection of videos from a new generator can be achieved through few-shot learning.
近年来,在生成式人工智能(Generative AI)方面的进步导致了生成视觉上逼真的合成视频的技术的发展。虽然已经开发了许多方法来检测由AI生成的合成图像,但在本文中,我们证明了合成图像检测器无法检测合成视频。我们证明了这是因为合成视频生成器引入了与图像生成器留下的痕迹显著不同的迹线。尽管如此,我们证明了合成视频痕迹可以学习,并用于可靠的合成视频检测或生成器来源 attribution,即使在H.264重新压缩之后。此外,我们证明了从零散转移学习中检测新生成器生成的视频是具有挑战性的,但通过几散学习可以准确地从新生成器中检测到视频。
https://arxiv.org/abs/2404.15955
Large Language Models (LLMs), despite their impressive performance on a wide range of tasks, require significant GPU memory and consume substantial computational resources. In addition to model weights, the memory occupied by KV cache increases linearly with sequence length, becoming a main bottleneck for inference. In this paper, we introduce a novel approach for optimizing the KV cache which significantly reduces its memory footprint. Through a comprehensive investigation, we find that on LLaMA2 series models, (i) the similarity between adjacent tokens' query vectors is remarkably high, and (ii) current query's attention calculation can rely solely on the attention information of a small portion of the preceding queries. Based on these observations, we propose CORM, a KV cache eviction policy that dynamically retains important key-value pairs for inference without finetuning the model. We validate that CORM reduces the inference memory usage of KV cache by up to 70% without noticeable performance degradation across six tasks in LongBench.
大语言模型(LLMs)虽然在各种任务上的表现令人印象深刻,但需要大量的GPU内存,并且消耗大量的计算资源。除了模型权重外,KV缓存所占的内存随序列长度的增加而线性增加,成为推理的主要瓶颈。在本文中,我们提出了一种新的优化KV缓存的策略,显著减少了其内存足迹。通过全面的调查,我们发现,在LLaMA2系列模型中,(i)相邻词查询向量之间的相似性非常高,并且(ii)当前查询的注意力计算仅依赖于前几个查询的注意力信息。基于这些观察结果,我们提出了CORM,一种用于保留用于推理的重要键值对的KV缓存淘汰策略,而无需对模型进行微调。我们验证,CORM在LongBench中的六个任务上,将KV缓存的推理内存使用量降低至最多70%,且没有显式的性能下降。
https://arxiv.org/abs/2404.15949
Although fusion of information from multiple views of mammograms plays an important role to increase accuracy of breast cancer detection, developing multi-view mammograms-based computer-aided diagnosis (CAD) schemes still faces challenges and no such CAD schemes have been used in clinical practice. To overcome the challenges, we investigate a new approach based on Contrastive Language-Image Pre-training (CLIP), which has sparked interest across various medical imaging tasks. By solving the challenges in (1) effectively adapting the single-view CLIP for multi-view feature fusion and (2) efficiently fine-tuning this parameter-dense model with limited samples and computational resources, we introduce Mammo-CLIP, the first multi-modal framework to process multi-view mammograms and corresponding simple texts. Mammo-CLIP uses an early feature fusion strategy to learn multi-view relationships in four mammograms acquired from the CC and MLO views of the left and right breasts. To enhance learning efficiency, plug-and-play adapters are added into CLIP image and text encoders for fine-tuning parameters and limiting updates to about 1% of the parameters. For framework evaluation, we assembled two datasets retrospectively. The first dataset, comprising 470 malignant and 479 benign cases, was used for few-shot fine-tuning and internal evaluation of the proposed Mammo-CLIP via 5-fold cross-validation. The second dataset, including 60 malignant and 294 benign cases, was used to test generalizability of Mammo-CLIP. Study results show that Mammo-CLIP outperforms the state-of-art cross-view transformer in AUC (0.841 vs. 0.817, 0.837 vs. 0.807) on both datasets. It also surpasses previous two CLIP-based methods by 20.3% and 14.3%. This study highlights the potential of applying the finetuned vision-language models for developing next-generation, image-text-based CAD schemes of breast cancer.
尽管从多种视角获取的乳腺X光片信息融合对提高乳腺癌检测的准确性具有重要意义,但基于多视角乳腺X光片开发的计算机辅助诊断(CAD)方案仍然面临挑战,并且在临床实践中尚未应用到这样的CAD方案。为了克服这些挑战,我们研究了一种基于Contrastive Language-Image Pre-training(CLIP)的新方法,该方法在各种医学影像任务中引起了人们的关注。通过解决(1)有效适应单视图CLIP的多人视角特征融合和(2)通过有限样本和计算资源 efficiently微调参数密集模型,我们引入了Mammo-CLIP,这是第一个多模态框架处理乳腺多视角X光片和相应简单文本。Mammo-CLIP使用早期特征融合策略从左、右乳头的CC和MLO视角的四个乳腺X光片中学习多视角关系。为了提高学习效率,我们将自适应器添加到CLIP图像和文本编码器以微调参数和限制更新至约参数的1%。对于框架评估,我们将两个数据集按时间顺序组装起来。第一个数据集包括470个恶性和479个良性病例,用于通过5轮交叉验证对提出的Mammo-CLIP进行微调并进行内部评估。第二个数据集包括60个恶性和294个良性病例,用于测试Mammo-CLIP的泛化能力。研究结果表明,Mammo-CLIP在两个数据集上都优于最先进的交叉视角Transformer,其AUC(0.841 vs 0.817,0.837 vs 0.807)均值分别为0.817和0.837。它还比基于CLIP的前两种方法超出20.3%和14.3%。这项研究突出了将微调视觉-语言模型的应用于开发下一代,基于图像-文本乳腺癌检测方案的潜力。
https://arxiv.org/abs/2404.15946