Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage large language models (LLMs) as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.
自动驾驶需要安全的路径规划,尤其是在关键的“长尾”场景中。最近端到端的自动驾驶系统利用大型语言模型(LLM)作为规划器来提高对罕见事件的泛化能力。然而,在测试时使用LLM会引入高昂的计算成本。为了解决这个问题,我们提出了DiMA,这是一种保持无LLM(或基于视觉的)规划器效率的同时又能利用LLM世界知识的端到端自动驾驶系统。通过一组特别设计的代理任务,DiMA将多模态LLM中的信息浓缩成一个基于视觉的端到端规划器。在联合训练策略下,两个网络共用的一个场景编码器生成结构化表示,并且这些表示既语义相关又与最终规划目标对齐。值得注意的是,在推理时不需要使用LLM,从而实现了无需牺牲效率的前提下进行稳健规划的能力。采用DiMA训练后,基于视觉的规划器在L2轨迹误差上减少了37%,碰撞率降低了80%;同时在长尾场景下轨迹误差也减少了44%。此外,DiMA还在nuScenes规划基准测试中取得了最先进的性能水平。
https://arxiv.org/abs/2501.09757
We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{this https URL}
我们介绍了SynthLight,这是一种用于人物肖像重新照明的扩散模型。我们的方法将图像重新照明问题视为一个重渲染过程,在此过程中,像素会根据环境光照条件的变化进行转换。通过基于物理的渲染引擎,我们在不同的光照条件下使用3D头部资产来合成数据集,从而模拟这种光照条件下的转换。 为了弥合合成与真实图像领域的差距,我们提出了两种训练和推理策略:(1)多任务训练方法,利用没有光照标签的真实人类肖像;(2)在推理时间采用基于无分类器引导的扩散采样过程,该过程利用输入肖像来更好地保留细节。我们的方法可以应用于各种真实的照片,并产生现实主义照明效果,包括镜面反射和投影阴影的同时还能保持主体的身份特征。 我们在Light Stage数据上的定量实验显示了与最先进的重新照明方法相当的结果。我们对野外图像的定性结果显示出了丰富且前所未有的照明效果。 项目页面: \url{this https URL}
https://arxiv.org/abs/2501.09756
Our objective is to translate continuous sign language into spoken language text. Inspired by the way human interpreters rely on context for accurate translation, we incorporate additional contextual cues together with the signing video, into a new translation framework. Specifically, besides visual sign recognition features that encode the input video, we integrate complementary textual information from (i) captions describing the background show, (ii) translation of previous sentences, as well as (iii) pseudo-glosses transcribing the signing. These are automatically extracted and inputted along with the visual features to a pre-trained large language model (LLM), which we fine-tune to generate spoken language translations in text form. Through extensive ablation studies, we show the positive contribution of each input cue to the translation performance. We train and evaluate our approach on BOBSL -- the largest British Sign Language dataset currently available. We show that our contextual approach significantly enhances the quality of the translations compared to previously reported results on BOBSL, and also to state-of-the-art methods that we implement as baselines. Furthermore, we demonstrate the generality of our approach by applying it also to How2Sign, an American Sign Language dataset, and achieve competitive results.
我们的目标是将连续的手语翻译成口语文本。受人类口译员依赖上下文进行准确翻译的启发,我们将额外的上下文线索与手语视频整合到一个新的翻译框架中。具体来说,在编码输入视频的手势识别特征之外,我们还集成了三种补充性的文本信息:(i)描述背景节目的字幕;(ii)前一句的口语翻译;以及(iii)转录手势的伪术语。这些信息被自动提取并与视觉特征一起输入到预训练的大语言模型(LLM)中,该模型经过微调后能够生成口语形式的文本翻译。通过大量的消融研究,我们展示了每种输入线索对翻译性能的正面贡献。我们在BOBSL——目前最大的英国手语数据集上进行训练和评估。结果显示,我们的上下文方法显著提高了在BOBSL上的翻译质量,并且优于之前报道的结果以及作为基线实现的最新技术方法。此外,我们通过将其应用于How2Sign(一个美国手语数据集)来展示该方法的通用性,并取得了具有竞争力的结果。
https://arxiv.org/abs/2501.09754
Convolutional neural networks (CNNs) are essential tools for computer vision tasks, but they lack traditionally desired properties of extracted features that could further improve model performance, e.g., rotational equivariance. Such properties are ubiquitous in biomedical images, which often lack explicit orientation. While current work largely relies on data augmentation or explicit modules to capture orientation information, this comes at the expense of increased training costs or ineffective approximations of the desired equivariance. To overcome these challenges, we propose a novel and efficient implementation of the Symmetric Rotation-Equivariant (SRE) Convolution (SRE-Conv) kernel, designed to learn rotation-invariant features while simultaneously compressing the model size. The SRE-Conv kernel can easily be incorporated into any CNN backbone. We validate the ability of a deep SRE-CNN to capture equivariance to rotation using the public MedMNISTv2 dataset (16 total tasks). SRE-Conv-CNN demonstrated improved rotated image classification performance accuracy on all 16 test datasets in both 2D and 3D images, all while increasing efficiency with fewer parameters and reduced memory footprint. The code is available at this https URL.
卷积神经网络(CNN)是计算机视觉任务中的关键工具,但它们缺乏一些传统上期望的特性,这些特性能够进一步提升模型性能,比如旋转等变性。这种性质在生物医学图像中很常见,而这类图像通常没有明确的方向信息。虽然当前的研究主要依赖于数据增强或显式模块来捕捉方向信息,但这会增加训练成本或导致对所需等变性的无效近似。为了克服这些挑战,我们提出了一种新颖且高效的对称旋转等变(SRE)卷积核(SRE-Conv)的实现方法,旨在学习旋转不变特征的同时压缩模型大小。SRE-Conv 核可以轻松集成到任何 CNN 主干网络中。我们使用公开的 MedMNISTv2 数据集(共16个任务)验证了深层 SRE-CNN 捕获旋转等变性的能力。在二维和三维图像的所有16个测试数据集中,SRE-Conv-CNN 在所有情况下都显示出了更高的旋转图像分类精度,并且通过减少参数数量和降低内存占用提高了效率。代码可在以下网址获得:[提供链接]。 请注意,在实际翻译过程中,请确保使用正确的URL以供访问相关资源。
https://arxiv.org/abs/2501.09753
Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method to detect complex anomalies via modeling the interactions between objects using a scene graph with spatio-temporal attributes. With our proposed method and two other state-of-the-art video anomaly detection methods, we obtain baseline scores on ComplexVAD and demonstrate that our new method outperforms existing works.
现有的视频异常检测数据集在表示由于物体之间相互作用而产生的复杂异常方面是不足的。以前的视频异常检测数据集中缺乏复杂的异常情况,这影响了研究方向,使其偏向于关注简单的异常情况。为了解决这个问题,我们引入了一个新的大规模数据集:ComplexVAD。此外,我们还提出了一种新颖的方法,通过使用带有时空属性的场景图来建模物体之间的相互作用,以此检测复杂异常。利用我们的新方法以及其他两种最先进的视频异常检测方法,在ComplexVAD上获得了基准分数,并证明了我们的新方法优于现有的工作。
https://arxiv.org/abs/2501.09733
Generative models have made significant impacts across various domains, largely due to their ability to scale during training by increasing data, computational resources, and model size, a phenomenon characterized by the scaling laws. Recent research has begun to explore inference-time scaling behavior in Large Language Models (LLMs), revealing how performance can further improve with additional computation during inference. Unlike LLMs, diffusion models inherently possess the flexibility to adjust inference-time computation via the number of denoising steps, although the performance gains typically flatten after a few dozen. In this work, we explore the inference-time scaling behavior of diffusion models beyond increasing denoising steps and investigate how the generation performance can further improve with increased computation. Specifically, we consider a search problem aimed at identifying better noises for the diffusion sampling process. We structure the design space along two axes: the verifiers used to provide feedback, and the algorithms used to find better noise candidates. Through extensive experiments on class-conditioned and text-conditioned image generation benchmarks, our findings reveal that increasing inference-time compute leads to substantial improvements in the quality of samples generated by diffusion models, and with the complicated nature of images, combinations of the components in the framework can be specifically chosen to conform with different application scenario.
生成模型已经在多个领域产生了重要影响,这主要归功于它们在训练过程中通过增加数据量、计算资源和模型规模来扩展的能力,这一现象被称为扩展定律。最近的研究已经开始探索大型语言模型(LLMs)的推理时长扩展行为,揭示了如何通过增加推理过程中的计算能力进一步提升性能。与LLMs不同的是,扩散模型本身具备通过调整去噪步骤数量来灵活调节推理时间计算的能力,尽管通常在几十个去噪步骤之后性能增益会趋于平缓。在这项工作中,我们探讨了超出增加去噪步骤之外的扩散模型的推理时长扩展行为,并研究了如何利用更多的计算资源进一步提升生成表现。 具体来说,我们考虑了一个搜索问题,旨在为扩散采样过程找到更好的噪声样本。我们在两个轴上构建设计空间:一是提供反馈的验证器;二是用于寻找更好噪声候选者的算法。通过在类别条件和文本条件图像生成基准上的大量实验,我们的研究发现表明增加推理时间计算能够显著提升由扩散模型生成样本的质量,并且鉴于图像的复杂性,框架中组件的不同组合可以根据不同的应用场景具体选择以符合需求。
https://arxiv.org/abs/2501.09732
Low-Light Image Enhancement (LLIE) is a key task in computational photography and imaging. The problem of enhancing images captured during night or in dark environments has been well-studied in the image signal processing literature. However, current deep learning-based solutions struggle with efficiency and robustness in real-world scenarios (e.g. scenes with noise, saturated pixels, bad illumination). We propose a lightweight neural network that combines image processing in the frequency and spatial domains. Our method, FLOL+, is one of the fastest models for this task, achieving state-of-the-art results on popular real scenes datasets such as LOL and LSRW. Moreover, we are able to process 1080p images under 12ms. Code and models at this https URL
低光图像增强(LLIE)是计算摄影和成像中的一个关键任务。夜间或在光线较暗的环境中拍摄的照片增强问题已经在图像信号处理文献中得到了充分的研究。然而,当前基于深度学习的方法在实际场景中的效率和鲁棒性方面仍然存在问题(例如噪声、饱和像素以及不良照明条件)。我们提出了一种结合频域与空域图像处理的轻量级神经网络。我们的方法FLOL+是执行此任务最快的模型之一,在流行的现实场景数据集(如LOL和LSRW)上实现了最先进的性能。此外,我们可以以不到12毫秒的时间处理1080p分辨率的图片。代码与模型可在该网址获得:[请在此处插入实际URL]。
https://arxiv.org/abs/2501.09718
Hallucination remains a major challenge for Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) has gained increasing attention as a simple solution to hallucination issues. It directly learns from constructed preference pairs that reflect the severity of hallucinations in responses to the same prompt and image. Nonetheless, different data construction methods in existing works bring notable performance variations. We identify a crucial factor here: outcomes are largely contingent on whether the constructed data aligns on-policy w.r.t the initial (reference) policy of DPO. Theoretical analysis suggests that learning from off-policy data is impeded by the presence of KL-divergence between the updated policy and the reference policy. From the perspective of dataset distribution, we systematically summarize the inherent flaws in existing algorithms that employ DPO to address hallucination issues. To alleviate the problems, we propose On-Policy Alignment (OPA)-DPO framework, which uniquely leverages expert feedback to correct hallucinated responses and aligns both the original and expert-revised responses in an on-policy manner. Notably, with only 4.8k data, OPA-DPO achieves an additional reduction in the hallucination rate of LLaVA-1.5-7B: 13.26% on the AMBER benchmark and 5.39% on the Object-Hal benchmark, compared to the previous SOTA algorithm trained with 16k samples.
幻觉仍然是大型视觉语言模型(LVLM)面临的主要挑战之一。直接偏好优化(DPO)作为一种简单的解决方案,近年来受到了越来越多的关注,它通过从反映同一提示和图像的响应中幻觉严重程度所构建的偏好对进行直接学习。然而,现有的工作中的不同数据构建方法带来了显著的性能差异。我们在这里识别了一个关键因素:结果在很大程度上取决于所构建的数据是否与DPO最初的(参考)策略一致。理论上分析表明,从离策略数据学习会受到更新后的策略和参考策略之间存在的KL散度的影响。 从数据集分布的角度来看,我们系统地总结了现有算法使用DPO解决幻觉问题时固有的缺陷。为了解决这些问题,我们提出了在政策对齐(OPA)-DPO框架,它利用专家反馈来纠正幻觉响应,并以在策略的方式对准原始和经过专家修订的响应。值得注意的是,在仅使用4.8k数据的情况下,与先前使用的16k样本训练的最佳现有算法相比,OPA-DPO使LLaVA-1.5-7B模型在AMBER基准测试中实现了幻觉率额外降低13.26%,在Object-Hal基准测试中降低了5.39%。
https://arxiv.org/abs/2501.09695
Open-Vocabulary Part Segmentation (OVPS) is an emerging field for recognizing fine-grained parts in unseen categories. We identify two primary challenges in OVPS: (1) the difficulty in aligning part-level image-text correspondence, and (2) the lack of structural understanding in segmenting object parts. To address these issues, we propose PartCATSeg, a novel framework that integrates object-aware part-level cost aggregation, compositional loss, and structural guidance from DINO. Our approach employs a disentangled cost aggregation strategy that handles object and part-level costs separately, enhancing the precision of part-level segmentation. We also introduce a compositional loss to better capture part-object relationships, compensating for the limited part annotations. Additionally, structural guidance from DINO features improves boundary delineation and inter-part understanding. Extensive experiments on Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets demonstrate that our method significantly outperforms state-of-the-art approaches, setting a new baseline for robust generalization to unseen part categories.
开放词汇部分分割(OVPS)是一个新兴领域,专注于识别未见过类别中的细粒度部分。我们确定了OVPS的两个主要挑战:(1) 部分级别的图像-文本对应关系对齐困难;以及 (2) 缺乏结构化理解以划分对象部分。为了解决这些问题,我们提出了PartCATSeg框架,它集成了面向对象的部分级别成本聚合、组合损失和来自DINO的结构指导。 我们的方法采用了一种解耦的成本聚合策略,分别处理对象级和部分级别的成本,从而提高了部分分割的精度。此外,我们引入了组合损失来更好地捕捉部分与对象之间的关系,弥补了有限部分注释的问题。从DINO特征中获得的结构化指导则改善了边界划分以及部分间的理解。 在Pascal-Part-116、ADE20K-Part-234和PartImageNet数据集上的广泛实验表明,我们的方法显著优于现有最先进方法,并为未见过的部分类别提供了强大的泛化能力基准。
https://arxiv.org/abs/2501.09688
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
随着深度伪造生成技术的迅速发展,对稳健且准确的人脸伪造检测算法的需求变得愈发重要。近期研究显示,小波分析能够揭示在空间域中难以察觉的细微伪造痕迹。小波变换能有效地捕捉到重要的面部轮廓特征,这些特征往往是纤细、精细和具有全局性的。然而,现有的基于小波的方法未能充分利用这些独特特性,导致了次优的特征提取效果和有限的应用广度。 为了解决这一挑战,我们引入了一种新型的小波基特征提取器——WMamba,它是基于Mamba架构设计的。WMamba通过两个关键创新最大化了小波信息的作用。首先,我们提出了动态轮廓卷积(DCConv),采用特制的可变形核来适应性地建模纤细的面部轮廓;其次,利用Mamba架构的优势,我们的方法能够在线性计算复杂度下捕捉长距离的空间关系。这种效率使得可以从小型图像块中提取出精细、全局性的伪造痕迹。 广泛的实验结果表明,WMamba达到了最先进的性能(SOTA),突显了其在人脸伪造检测中的有效性和优越性。
https://arxiv.org/abs/2501.09617
SLAM is a foundational technique with broad applications in robotics and AR/VR. SLAM simulations evaluate new concepts, but testing on resource-constrained devices, such as VR HMDs, faces challenges: high computational cost and restricted sensor data access. This work proposes a sparse framework using mesh geometry projections as features, which improves efficiency and circumvents direct sensor data access, advancing SLAM research as we demonstrate in VR and through numerical evaluation.
SLAM( simultaneous localization and mapping,即同步定位与地图构建)是一项在机器人技术和AR/VR领域具有广泛应用的基础技术。SLAM模拟可以用来评估新的概念,但将其应用于资源受限的设备上,如VR头显,则面临高昂计算成本和有限传感器数据访问权的问题。这项工作提出了一种稀疏框架,该框架利用网格几何投影作为特征,从而提高效率并绕过直接获取传感器数据的需求。我们在虚拟现实环境中的实验证明了这一点,并通过数值评估展示了其在推进SLAM研究方面的进展。
https://arxiv.org/abs/2501.09600
The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
表面杂质(如水渍、指纹和贴纸)的出现是导致自动化视觉检测系统性能下降的一个经常被提及的问题。同时,用于视觉表面检查的合成数据生成技术主要集中在生成完美示例及缺陷上,而忽略了杂质的影响。本研究强调了在生成合成数据时考虑杂质的重要性,并提出了一种程序化方法来将逼真的水渍纳入合成数据中。我们生成的合成数据集与真实数据集相对应,并进一步用于训练异常检测模型以调查水渍的影响。 高分辨率图像在表面检查中的使用会导致在进行异常检测训练时出现内存瓶颈问题。为了解决这个问题,我们引入了Sequential PatchCore方法——一种顺序构建核心样本集(coresets)的方法,使在大型图像上使用消费级硬件进行训练成为可能。这使得我们可以利用预先在不同数据版本上经过训练的核心集来进行迁移学习。 我们的结果显示,在预训练显式核心异常模型时使用合成数据是有益的,并且通过真实数据对核心集进行微调可以进一步提升性能表现。我们观察到杂质和标签模糊度会降低模型性能,此外还报告了缺陷级别的召回率,以提供一个与工业相关的视角来衡量模型性能。
https://arxiv.org/abs/2501.09579
Conventional 2D human pose estimation methods typically require extensive labeled annotations, which are both labor-intensive and expensive. In contrast, semi-supervised 2D human pose estimation can alleviate the above problems by leveraging a large amount of unlabeled data along with a small portion of labeled data. Existing semi-supervised 2D human pose estimation methods update the network through backpropagation, ignoring crucial historical information from the previous training process. Therefore, we propose a novel semi-supervised 2D human pose estimation method by utilizing a newly designed Teacher-Reviewer-Student framework. Specifically, we first mimic the phenomenon that human beings constantly review previous knowledge for consolidation to design our framework, in which the teacher predicts results to guide the student's learning and the reviewer stores important historical parameters to provide additional supervision signals. Secondly, we introduce a Multi-level Feature Learning strategy, which utilizes the outputs from different stages of the backbone to estimate the heatmap to guide network training, enriching the supervisory information while effectively capturing keypoint relationships. Finally, we design a data augmentation strategy, i.e., Keypoint-Mix, to perturb pose information by mixing different keypoints, thus enhancing the network's ability to discern keypoints. Extensive experiments on publicly available datasets, demonstrate our method achieves significant improvements compared to the existing methods.
传统的2D人体姿态估计方法通常需要大量的标注数据,这既耗时又昂贵。相比之下,半监督的2D人体姿态估计可以通过利用大量未标记的数据和少量已标注的数据来缓解上述问题。现有的半监督2D人体姿态估计方法通过反向传播更新网络,而忽视了之前训练过程中重要的历史信息。因此,我们提出了一种新的半监督2D人体姿态估计方法,采用了一个新颖设计的教师-评审员-学生框架(Teacher-Reviewer-Student framework)。具体来说,首先模仿人类不断回顾以前的知识以巩固记忆的现象来设计我们的框架,在这个框架中,教师预测结果来指导学生的学习,而评审员存储重要的历史参数以提供额外的监督信号。其次,我们引入了一种多级特征学习策略,利用骨干网络不同阶段的输出估计热图(heatmap)来指导网络训练,这不仅丰富了监督信息,还能有效捕捉关键点之间的关系。最后,我们设计了一种数据增强策略,即关键点混合(Keypoint-Mix),通过混合不同的关键点来扰动姿态信息,从而增强了网络区分关键点的能力。在公开可用的数据集上进行的广泛实验表明,我们的方法相较于现有方法取得了显著的改进。
https://arxiv.org/abs/2501.09565
De-identification of medical images is a critical step to ensure privacy during data sharing in research and clinical settings. The initial step in this process involves detecting Protected Health Information (PHI), which can be found in image metadata or imprinted within image pixels. Despite the importance of such systems, there has been limited evaluation of existing AI-based solutions, creating barriers to the development of reliable and robust tools. In this study, we present an AI-based pipeline for PHI detection, comprising three key components: text detection, text extraction, and analysis of PHI content in medical images. By experimenting with exchanging roles of vision and language models within the pipeline, we evaluate the performance and recommend the best setup for the PHI detection task.
医学图像去识别是确保研究和临床环境中数据共享期间隐私保护的关键步骤。此过程的初始阶段涉及检测受保护的健康信息(PHI),这些信息可能存在于图像元数据中或嵌印在图像像素内。尽管此类系统的至关重要性,现有的基于AI的解决方案却很少被评估,这阻碍了可靠且稳健工具的发展。在这项研究中,我们提出了一种用于检测PHI的基于人工智能的流程,包括三个关键组成部分:文本检测、文本提取以及医学图像中PHI内容的分析。通过在管道中交换视觉和语言模型的角色进行实验,我们评估了其性能,并推荐了最适合执行PHI检测任务的最佳配置。
https://arxiv.org/abs/2501.09552
The success of VLMs often relies on the dynamic high-resolution schema that adaptively augments the input images to multiple crops, so that the details of the images can be retained. However, such approaches result in a large number of redundant visual tokens, thus significantly reducing the efficiency of the VLMs. To improve the VLMs' efficiency without introducing extra training costs, many research works are proposed to reduce the visual tokens by filtering the uninformative visual tokens or aggregating their information. Some approaches propose to reduce the visual tokens according to the self-attention of VLMs, which are biased, to result in inaccurate responses. The token reduction approaches solely rely on visual cues are text-agnostic, and fail to focus on the areas that are most relevant to the question, especially when the queried objects are non-salient to the image. In this work, we first conduct experiments to show that the original text embeddings are aligned with the visual tokens, without bias on the tailed visual tokens. We then propose a self-adaptive cross-modality attention mixture mechanism that dynamically leverages the effectiveness of visual saliency and text-to-image similarity in the pre-LLM layers to select the visual tokens that are informative. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art training-free VLM acceleration performance, especially when the reduction rate is sufficiently large.
VLM(视觉语言模型)的成功往往依赖于能够动态生成高分辨率图像的方案,这些方案会自适应地将输入图像分割成多个区域块,以保留图像中的细节信息。然而,这样的方法会产生大量的冗余视觉标记,从而大大降低了VLM的效率。为了在不增加额外训练成本的情况下提高VLM的效率,许多研究工作提出了通过过滤掉无用的视觉标记或聚合它们的信息来减少视觉标记的方法。一些方法提出根据VLM中的自注意力机制来减少视觉标记,但由于这种机制存在偏差,会导致响应不准确。仅仅依赖于视觉线索进行标记减少的方法对文本是不可知的,在处理与问题最相关的区域时会失败,尤其是在查询对象在图像中不太突出的情况下。 在这项工作中,我们首先进行了实验以证明原始文本嵌入与视觉标记对齐,并且对于尾部视觉标记没有偏差。然后,我们提出了一种自我适应的跨模态注意力混合机制,在预训练大语言模型(LLM)层中动态利用视觉显著性和文本到图像相似性的有效性来选择那些信息丰富的视觉标记。广泛的实验表明,所提出的这种方法在无训练成本的情况下实现了最先进的VLM加速性能,尤其是在减少率足够大的情况下尤其有效。
https://arxiv.org/abs/2501.09532
Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
训练深度神经网络需要大量带有标注的数据集。这些数据集的收集和标注不仅成本极高,还面临着法律和隐私问题。这些问题在许多实际应用中构成了重大限制。为了应对这一挑战,我们引入了一种名为HydraMix的新架构,该架构通过混合同一类中的多个不同图像来生成新的图像组合。HydraMix利用基于分割的混合掩码,在特征空间中学习多种图像内容的融合,并通过无监督和对抗性训练进行优化。我们的数据增强方案允许从非常小的数据集中从头开始创建模型。 我们在ciFAIR-10、STL-10和ciFAIR-100上进行了广泛的实验,并且还引入了一种新的文本-图像指标,用于评估扩充后的数据集的泛化能力。实验结果表明,HydraMix在小数据集上的图像分类任务中优于现有的最先进的方法。
https://arxiv.org/abs/2501.09504
Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at this https URL .
最近,大规模生成模型展示了出色的文本到图像生成能力。然而,在生成特定主题的高保真个性化图片时仍面临挑战,尤其是在涉及多个主体的情况下。在本文中,我们提出了AnyStory,这是一种统一的方法来解决个性化的主体生成问题。AnyStory不仅实现了单个主题的高保真个性化,还能处理多主体的情况,同时不牺牲主体的质量。 具体来说,AnyStory以“编码然后路由”的方式建模主体个性化的问题。在编码步骤中,AnyStory利用了一个通用且强大的图像编码器,即ReferenceNet,并结合CLIP视觉编码器来实现对主题特征的高保真编码。在路由步骤中,AnyStory使用了一种解耦的实例感知主题路由器,以准确地识别和预测潜在空间中的对应主体位置,并引导注入主体条件。 详细的实验结果证明了我们的方法在保留主体细节、文本描述对齐以及为多个主体进行个性化方面的出色表现。该项目页面位于此 [URL] (请将[URL]替换为实际的链接地址)。
https://arxiv.org/abs/2501.09503
Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
准确理解情感对于人机交互等领域来说至关重要。由于情绪的复杂性和多模态特性(例如,情绪会受到面部表情和音频的影响),研究人员已经转向使用多模态模型来理解和分析人类情绪,而不是单一模式的方法。然而,目前的视频多模态大语言模型在有效地融合音频数据以及识别细微的面部微表情方面遇到了困难。此外,缺乏详细的多模态情感分析数据集也限制了该领域的发展。 为了解决这些问题,我们引入了一个自我审查的数据集和一个人工审查的数据集,分别包含了24,137个粗粒度样本和3,500个详细标注的情感样本。这些数据集使模型能够从各种场景中学习,并更好地泛化到实际应用中去。 此外,在音频建模之外,我们提议将面部编码模型明确地整合到现有的先进视频多模态大语言模型(Video MLLM)之中,使得该模型能有效地统一音频和细微的面部线索进行情感理解。通过在提出的这些数据集中对特征进行空间上的对齐,并采用指令调优方法,我们的Omni-Emotion系统在情绪识别和推理任务中均达到了当前的最佳性能水平。
https://arxiv.org/abs/2501.09502
Video colorization aims to transform grayscale videos into vivid color representations while maintaining temporal consistency and structural integrity. Existing video colorization methods often suffer from color bleeding and lack comprehensive control, particularly under complex motion or diverse semantic cues. To this end, we introduce VanGogh, a unified multimodal diffusion-based framework for video colorization. VanGogh tackles these challenges using a Dual Qformer to align and fuse features from multiple modalities, complemented by a depth-guided generation process and an optical flow loss, which help reduce color overflow. Additionally, a color injection strategy and luma channel replacement are implemented to improve generalization and mitigate flickering artifacts. Thanks to this design, users can exercise both global and local control over the generation process, resulting in higher-quality colorized videos. Extensive qualitative and quantitative evaluations, and user studies, demonstrate that VanGogh achieves superior temporal consistency and color this http URL page: this https URL.
视频着色的目标是将灰度视频转换为生动的彩色表示,同时保持时间一致性与结构完整性。现有的视频着色方法通常在处理复杂运动或多样语义提示时会出现色彩溢出问题,并且缺乏全面控制。 为此,我们引入了VanGogh——一种统一的多模态扩散框架用于视频着色。VanGogh通过使用Dual Qformer来对齐和融合来自多种模式的特征来解决这些问题,该方法还结合深度引导生成过程以及光流损失,以减少色彩溢出。此外,实施了一种颜色注入策略和亮度通道替换操作,用以提升泛化能力和减轻闪烁伪影。 由于这些设计特点,用户可以在整个过程中进行全局或局部控制,从而生成更高质量的着色视频。大量的定性和定量评估以及用户研究表明,VanGogh在时间一致性与色彩质量方面表现优异。详情请参见[此页面](https://this https URL)。 请注意,原文中的链接格式有误,请将"this https URL"替换为实际的有效网址以供访问。
https://arxiv.org/abs/2501.09499