We introduce SynthLight, a diffusion model for portrait relighting. Our approach frames image relighting as a re-rendering problem, where pixels are transformed in response to changes in environmental lighting conditions. Using a physically-based rendering engine, we synthesize a dataset to simulate this lighting-conditioned transformation with 3D head assets under varying lighting. We propose two training and inference strategies to bridge the gap between the synthetic and real image domains: (1) multi-task training that takes advantage of real human portraits without lighting labels; (2) an inference time diffusion sampling procedure based on classifier-free guidance that leverages the input portrait to better preserve details. Our method generalizes to diverse real photographs and produces realistic illumination effects, including specular highlights and cast shadows, while preserving the subject's identity. Our quantitative experiments on Light Stage data demonstrate results comparable to state-of-the-art relighting methods. Our qualitative results on in-the-wild images showcase rich and unprecedented illumination effects. Project Page: \url{this https URL}
我们介绍了SynthLight,这是一种用于人物肖像重新照明的扩散模型。我们的方法将图像重新照明问题视为一个重渲染过程,在此过程中,像素会根据环境光照条件的变化进行转换。通过基于物理的渲染引擎,我们在不同的光照条件下使用3D头部资产来合成数据集,从而模拟这种光照条件下的转换。 为了弥合合成与真实图像领域的差距,我们提出了两种训练和推理策略:(1)多任务训练方法,利用没有光照标签的真实人类肖像;(2)在推理时间采用基于无分类器引导的扩散采样过程,该过程利用输入肖像来更好地保留细节。我们的方法可以应用于各种真实的照片,并产生现实主义照明效果,包括镜面反射和投影阴影的同时还能保持主体的身份特征。 我们在Light Stage数据上的定量实验显示了与最先进的重新照明方法相当的结果。我们对野外图像的定性结果显示出了丰富且前所未有的照明效果。 项目页面: \url{this https URL}
https://arxiv.org/abs/2501.09756
The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6\% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9\%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.
BioCreative8 Track 3的目标是从电子健康记录(EHR)文本中提取关键的医学表型发现,并将其归一化为人类表型本体论(Human Phenotype Ontology,HPO)术语。然而,由于表型发现表面形式的多样性,准确地将它们归一化到正确的HPO术语上存在挑战性。为了应对这一挑战,我们探索了多种命名实体识别模型,并实施了数据增强技术如同义词边际化来提升归一化的步骤。我们的流水线最终在精确提取和归一化的F1评分方面比所有对挑战做出回应的提交的平均得分高出2.6%。此外,在归一化F1评分上,我们的方法超出了平均水平1.9%。这些发现有助于自动医学数据提取和归一化技术的发展,并展示了未来研究和生物医学领域应用的潜在路径。
https://arxiv.org/abs/2501.09744
This study conducts a systematic assessment of the capabilities of 12 machine learning models and model variations in detecting economic ideology. As an evaluation benchmark, I use manifesto data spanning six elections in the United Kingdom and pre-annotated by expert and crowd coders. The analysis assesses the performance of several generative, fine-tuned, and zero-shot models at the granular and aggregate levels. The results show that generative models such as GPT-4o and Gemini 1.5 Flash consistently outperform other models against all benchmarks. However, they pose issues of accessibility and resource availability. Fine-tuning yielded competitive performance and offers a reliable alternative through domain-specific optimization. But its dependency on training data severely limits scalability. Zero-shot models consistently face difficulties with identifying signals of economic ideology, often resulting in negative associations with human coding. Using general knowledge for the domain-specific task of ideology scaling proved to be unreliable. Other key findings include considerable within-party variation, fine-tuning benefiting from larger training data, and zero-shot's sensitivity to prompt content. The assessments include the strengths and limitations of each model and derive best-practices for automated analyses of political content.
这项研究对12种机器学习模型及其变体在检测经济意识形态方面的能力进行了系统的评估。作为评价标准,我使用了跨越英国六次选举的宣言数据,并由专家和众包编码者预先标注。该分析评估了几种生成式、微调型和零样本模型在颗粒级和汇总级上的表现。 研究结果表明,像GPT-4o和Gemini 1.5 Flash这样的生成式模型,在所有基准测试中都持续优于其他模型。然而,这些模型面临可访问性和资源可用性的问题。微调可以产生具有竞争力的性能,并通过特定领域的优化提供可靠的替代方案。但是,它对训练数据的依赖严重限制了其扩展能力。零样本模型在识别经济意识形态信号方面经常遇到困难,往往与人类编码的结果存在负面关联。 使用通用知识来执行特定领域(如意识形态量表)的任务被证明是不可靠的。其他关键发现包括政党内部差异较大、微调从更大规模的数据中受益更多以及零样本模型对提示内容敏感性高。评估涵盖了每种模型的优势和局限,并得出了自动化分析政治内容的最佳实践。
https://arxiv.org/abs/2501.09719
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Electroencephalogram (EEG) signals have emerged as a promising modality for biometric identification. While previous studies have explored the use of imagined speech with semantically meaningful words for subject identification, most have relied on additional visual or auditory cues. In this study, we introduce a cueless EEG-based imagined speech paradigm, where subjects imagine the pronunciation of semantically meaningful words without any external cues. This innovative approach addresses the limitations of prior methods by requiring subjects to select and imagine words from a predefined list naturally. The dataset comprises over 4,350 trials from 11 subjects across five sessions. We assess a variety of classification methods, including traditional machine learning techniques such as Support Vector Machines (SVM) and XGBoost, as well as time-series foundation models and deep learning architectures specifically designed for EEG classification, such as EEG Conformer and Shallow ConvNet. A session-based hold-out validation strategy was employed to ensure reliable evaluation and prevent data leakage. Our results demonstrate outstanding classification accuracy, reaching 97.93%. These findings highlight the potential of cueless EEG paradigms for secure and reliable subject identification in real-world applications, such as brain-computer interfaces (BCIs).
脑电图(EEG)信号已作为生物识别的一种有前途的模式出现。尽管先前的研究已经探索了使用具有语义意义词汇的想象语言来进行身份识别,但大多数研究依赖于额外的视觉或听觉线索。在这项研究中,我们引入了一种无提示的基于 EEG 的想象言语范式,在这种范式下受试者在没有任何外部提示的情况下想象发音具有语义意义的单词。这种方法通过要求受试者自然地从预定义列表中选择和想象词汇来解决先前方法的局限性。 该数据集包含来自 11 名受试者的超过 4,350 次试验,这些试验分布在五个会话内。我们评估了一系列分类方法,包括传统的机器学习技术(如支持向量机SVM 和 XGBoost),以及专门用于 EEG 分类的时间序列基础模型和深度学习架构(例如 EEG Conformer 和 Shallow ConvNet)。采用基于会话的保留验证策略来确保可靠的评估并防止数据泄漏。我们的研究结果展示了出色的分类准确率,达到了 97.93%。 这些发现突显了无提示 EEG 范式在实际应用中的潜在价值,如用于安全和可靠的身份识别的脑机接口(BCIs)。
https://arxiv.org/abs/2501.09700
Autonomous docking remains one of the most challenging maneuvers in marine robotics, requiring precise control and robust perception in confined spaces. This paper presents a novel approach integrating Model Predictive Path Integral(MPPI) control with real-time LiDAR-based dock detection for autonomous surface vessel docking. Our framework uniquely combines probabilistic trajectory optimization with a multiobjective cost function that simultaneously considers docking precision, safety constraints, and motion efficiency. The MPPI controller generates optimal trajectories by intelligently sampling control sequences and evaluating their costs based on dynamic clearance requirements, orientation alignment, and target position objectives. We introduce an adaptive dock detection pipeline that processes LiDAR point clouds to extract critical geometric features, enabling real-time updates of docking parameters. The proposed method is extensively validated in a physics-based simulation environment that incorporates realistic sensor noise, vessel dynamics, and environmental constraints. Results demonstrate successful docking from various initial positions while maintaining safe clearances and smooth motion characteristics.
自主对接仍然是海洋机器人技术中最具挑战性的操作之一,要求在狭小空间内进行精确控制和稳健感知。本文提出了一种新颖的方法,将模型预测路径积分(MPPI)控制与实时LiDAR-based船坞检测相结合,用于自主水面船舶的靠泊。我们的框架独特地结合了概率轨迹优化和一个多目标成本函数,同时考虑了对接精度、安全约束以及运动效率。 MPPI控制器通过智能抽样控制序列并根据动态避碰要求、方向对齐及目标位置目标来评估其成本,从而生成最优轨迹。我们引入了一种自适应船坞检测流水线,该流程处理LiDAR点云以提取关键几何特征,使对接参数能够在实时中更新。 所提出的方法在物理基础的仿真环境中进行了广泛的验证,该环境包括了现实传感器噪声、船舶动力学以及环境约束等要素。结果表明,从各种初始位置成功实现靠泊,并且保持安全距离和流畅运动特性。
https://arxiv.org/abs/2501.09668
In many real-world applications, agents must make sequential decisions in environments where conditions are subject to change due to various exogenous factors. These non-stationary environments pose significant challenges to traditional decision-making models, which typically assume stationary dynamics. Non-stationary Markov decision processes (NS-MDPs) offer a framework to model and solve decision problems under such changing conditions. However, the lack of standardized benchmarks and simulation tools has hindered systematic evaluation and advance in this field. We present NS-Gym, the first simulation toolkit designed explicitly for NS-MDPs, integrated within the popular Gymnasium framework. In NS-Gym, we segregate the evolution of the environmental parameters that characterize non-stationarity from the agent's decision-making module, allowing for modular and flexible adaptations to dynamic environments. We review prior work in this domain and present a toolkit encapsulating key problem characteristics and types in NS-MDPs. This toolkit is the first effort to develop a set of standardized interfaces and benchmark problems to enable consistent and reproducible evaluation of algorithms under non-stationary conditions. We also benchmark six algorithmic approaches from prior work on NS-MDPs using NS-Gym. Our vision is that NS-Gym will enable researchers to assess the adaptability and robustness of their decision-making algorithms to non-stationary conditions.
在许多现实世界的应用中,代理必须在一个条件会因各种外生因素而变化的环境中做出一系列决策。这种非平稳环境对传统假设为动态不变的经典决策模型提出了重大挑战。非平稳马尔可夫决策过程(NS-MDPs)提供了一种建模和解决此类条件下决策问题的框架。然而,缺乏标准化的基准测试和模拟工具阻碍了该领域的系统评估和进展。 我们介绍了 NS-Gym,这是第一个专门为 NS-MDP 设计的仿真工具包,并且它被整合到了流行的 Gymnasium 框架中。在 NS-Gym 中,我们将描述环境非平稳性特征的参数变化与代理决策模块分离开来,从而允许对动态环境进行模块化和灵活地适应。 我们回顾了此前的工作并介绍了一个包含 NS-MDP 关键问题特性及类型的工具包。这个工具包是第一个努力开发一系列标准化接口和基准测试问题以实现非平稳条件下算法的一致性和可重复性评估的尝试。我们也使用 NS-Gym 对六种先前文献中提出的关于 NS-MDP 的算法方法进行了基准测试。 我们的愿景是,NS-Gym 将使研究人员能够评估其决策制定算法在面对非平稳条件时的适应能力和鲁棒性。
https://arxiv.org/abs/2501.09646
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
As artificial intelligence (AI) becomes increasingly embedded in healthcare delivery, this chapter explores the critical aspects of developing reliable and ethical Clinical Decision Support Systems (CDSS). Beginning with the fundamental transition from traditional statistical models to sophisticated machine learning approaches, this work examines rigorous validation strategies and performance assessment methods, including the crucial role of model calibration and decision curve analysis. The chapter emphasizes that creating trustworthy AI systems in healthcare requires more than just technical accuracy; it demands careful consideration of fairness, explainability, and privacy. The challenge of ensuring equitable healthcare delivery through AI is stressed, discussing methods to identify and mitigate bias in clinical predictive models. The chapter then delves into explainability as a cornerstone of human-centered CDSS. This focus reflects the understanding that healthcare professionals must not only trust AI recommendations but also comprehend their underlying reasoning. The discussion advances in an analysis of privacy vulnerabilities in medical AI systems, from data leakage in deep learning models to sophisticated attacks against model explanations. The text explores privacy-preservation strategies such as differential privacy and federated learning, while acknowledging the inherent trade-offs between privacy protection and model performance. This progression, from technical validation to ethical considerations, reflects the multifaceted challenges of developing AI systems that can be seamlessly and reliably integrated into daily clinical practice while maintaining the highest standards of patient care and data protection.
随着人工智能(AI)在医疗保健领域应用的日益广泛,本章探讨了开发可靠且符合伦理规范的临床决策支持系统(CDSS)的关键方面。从传统的统计模型过渡到复杂的机器学习方法开始,本书详细研究了严格的验证策略和性能评估方法,包括模型校准和决策曲线分析等关键角色。本章强调,在医疗保健中创建值得信赖的人工智能系统不仅仅需要技术上的准确性,还需要谨慎考虑公平性、可解释性和隐私保护。 确保通过AI实现公正的医疗服务是一项挑战,本章节讨论了识别和缓解临床预测模型中的偏见的方法。随后,本书深入探讨了可解释性作为以人类为中心CDSS的核心要素。这种关注反映了对医疗保健专业人员不仅需要信任AI建议而且必须理解其背后的逻辑这一认识。 接下来,本书分析了医学人工智能系统中的隐私漏洞问题,从深度学习模型的数据泄露到针对模型解释的复杂攻击手段。文中探讨了包括差异隐私和联邦学习在内的隐私保护策略,并承认在隐私保护与模型性能之间存在固有的权衡关系。 这一从技术验证到伦理考虑的过程反映了开发能够在日常临床实践中无缝且可靠地集成的人工智能系统所面临的多方面挑战,同时保持最高标准的患者护理和数据保护。
https://arxiv.org/abs/2501.09628
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
随着深度伪造生成技术的迅速发展,对稳健且准确的人脸伪造检测算法的需求变得愈发重要。近期研究显示,小波分析能够揭示在空间域中难以察觉的细微伪造痕迹。小波变换能有效地捕捉到重要的面部轮廓特征,这些特征往往是纤细、精细和具有全局性的。然而,现有的基于小波的方法未能充分利用这些独特特性,导致了次优的特征提取效果和有限的应用广度。 为了解决这一挑战,我们引入了一种新型的小波基特征提取器——WMamba,它是基于Mamba架构设计的。WMamba通过两个关键创新最大化了小波信息的作用。首先,我们提出了动态轮廓卷积(DCConv),采用特制的可变形核来适应性地建模纤细的面部轮廓;其次,利用Mamba架构的优势,我们的方法能够在线性计算复杂度下捕捉长距离的空间关系。这种效率使得可以从小型图像块中提取出精细、全局性的伪造痕迹。 广泛的实验结果表明,WMamba达到了最先进的性能(SOTA),突显了其在人脸伪造检测中的有效性和优越性。
https://arxiv.org/abs/2501.09617
SLAM is a foundational technique with broad applications in robotics and AR/VR. SLAM simulations evaluate new concepts, but testing on resource-constrained devices, such as VR HMDs, faces challenges: high computational cost and restricted sensor data access. This work proposes a sparse framework using mesh geometry projections as features, which improves efficiency and circumvents direct sensor data access, advancing SLAM research as we demonstrate in VR and through numerical evaluation.
SLAM( simultaneous localization and mapping,即同步定位与地图构建)是一项在机器人技术和AR/VR领域具有广泛应用的基础技术。SLAM模拟可以用来评估新的概念,但将其应用于资源受限的设备上,如VR头显,则面临高昂计算成本和有限传感器数据访问权的问题。这项工作提出了一种稀疏框架,该框架利用网格几何投影作为特征,从而提高效率并绕过直接获取传感器数据的需求。我们在虚拟现实环境中的实验证明了这一点,并通过数值评估展示了其在推进SLAM研究方面的进展。
https://arxiv.org/abs/2501.09600
The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
表面杂质(如水渍、指纹和贴纸)的出现是导致自动化视觉检测系统性能下降的一个经常被提及的问题。同时,用于视觉表面检查的合成数据生成技术主要集中在生成完美示例及缺陷上,而忽略了杂质的影响。本研究强调了在生成合成数据时考虑杂质的重要性,并提出了一种程序化方法来将逼真的水渍纳入合成数据中。我们生成的合成数据集与真实数据集相对应,并进一步用于训练异常检测模型以调查水渍的影响。 高分辨率图像在表面检查中的使用会导致在进行异常检测训练时出现内存瓶颈问题。为了解决这个问题,我们引入了Sequential PatchCore方法——一种顺序构建核心样本集(coresets)的方法,使在大型图像上使用消费级硬件进行训练成为可能。这使得我们可以利用预先在不同数据版本上经过训练的核心集来进行迁移学习。 我们的结果显示,在预训练显式核心异常模型时使用合成数据是有益的,并且通过真实数据对核心集进行微调可以进一步提升性能表现。我们观察到杂质和标签模糊度会降低模型性能,此外还报告了缺陷级别的召回率,以提供一个与工业相关的视角来衡量模型性能。
https://arxiv.org/abs/2501.09579
Training deep neural networks requires datasets with a large number of annotated examples. The collection and annotation of these datasets is not only extremely expensive but also faces legal and privacy problems. These factors are a significant limitation for many real-world applications. To address this, we introduce HydraMix, a novel architecture that generates new image compositions by mixing multiple different images from the same class. HydraMix learns the fusion of the content of various images guided by a segmentation-based mixing mask in feature space and is optimized via a combination of unsupervised and adversarial training. Our data augmentation scheme allows the creation of models trained from scratch on very small datasets. We conduct extensive experiments on ciFAIR-10, STL-10, and ciFAIR-100. Additionally, we introduce a novel text-image metric to assess the generality of the augmented datasets. Our results show that HydraMix outperforms existing state-of-the-art methods for image classification on small datasets.
训练深度神经网络需要大量带有标注的数据集。这些数据集的收集和标注不仅成本极高,还面临着法律和隐私问题。这些问题在许多实际应用中构成了重大限制。为了应对这一挑战,我们引入了一种名为HydraMix的新架构,该架构通过混合同一类中的多个不同图像来生成新的图像组合。HydraMix利用基于分割的混合掩码,在特征空间中学习多种图像内容的融合,并通过无监督和对抗性训练进行优化。我们的数据增强方案允许从非常小的数据集中从头开始创建模型。 我们在ciFAIR-10、STL-10和ciFAIR-100上进行了广泛的实验,并且还引入了一种新的文本-图像指标,用于评估扩充后的数据集的泛化能力。实验结果表明,HydraMix在小数据集上的图像分类任务中优于现有的最先进的方法。
https://arxiv.org/abs/2501.09504
Understanding emotions accurately is essential for fields like human-computer interaction. Due to the complexity of emotions and their multi-modal nature (e.g., emotions are influenced by facial expressions and audio), researchers have turned to using multi-modal models to understand human emotions rather than single-modality. However, current video multi-modal large language models (MLLMs) encounter difficulties in effectively integrating audio and identifying subtle facial micro-expressions. Furthermore, the lack of detailed emotion analysis datasets also limits the development of multimodal emotion analysis. To address these issues, we introduce a self-reviewed dataset and a human-reviewed dataset, comprising 24,137 coarse-grained samples and 3,500 manually annotated samples with detailed emotion annotations, respectively. These datasets allow models to learn from diverse scenarios and better generalize to real-world applications. Moreover, in addition to the audio modeling, we propose to explicitly integrate facial encoding models into the existing advanced Video MLLM, enabling the MLLM to effectively unify audio and the subtle facial cues for emotion understanding. By aligning these features within a unified space and employing instruction tuning in our proposed datasets, our Omni-Emotion achieves state-of-the-art performance in both emotion recognition and reasoning tasks.
准确理解情感对于人机交互等领域来说至关重要。由于情绪的复杂性和多模态特性(例如,情绪会受到面部表情和音频的影响),研究人员已经转向使用多模态模型来理解和分析人类情绪,而不是单一模式的方法。然而,目前的视频多模态大语言模型在有效地融合音频数据以及识别细微的面部微表情方面遇到了困难。此外,缺乏详细的多模态情感分析数据集也限制了该领域的发展。 为了解决这些问题,我们引入了一个自我审查的数据集和一个人工审查的数据集,分别包含了24,137个粗粒度样本和3,500个详细标注的情感样本。这些数据集使模型能够从各种场景中学习,并更好地泛化到实际应用中去。 此外,在音频建模之外,我们提议将面部编码模型明确地整合到现有的先进视频多模态大语言模型(Video MLLM)之中,使得该模型能有效地统一音频和细微的面部线索进行情感理解。通过在提出的这些数据集中对特征进行空间上的对齐,并采用指令调优方法,我们的Omni-Emotion系统在情绪识别和推理任务中均达到了当前的最佳性能水平。
https://arxiv.org/abs/2501.09502
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
Neural Radiance Fields (NeRF) often struggle with reconstructing and rendering highly reflective scenes. Recent advancements have developed various reflection-aware appearance models to enhance NeRF's capability to render specular reflections. However, the robust reconstruction of highly reflective scenes is still hindered by the inherent shape ambiguity on specular surfaces. Existing methods typically rely on additional geometry priors to regularize the shape prediction, but this can lead to oversmoothed geometry in complex scenes. Observing the critical role of surface normals in parameterizing reflections, we introduce a transmittance-gradient-based normal estimation technique that remains robust even under ambiguous shape conditions. Furthermore, we propose a dual activated densities module that effectively bridges the gap between smooth surface normals and sharp object boundaries. Combined with a reflection-aware appearance model, our proposed method achieves robust reconstruction and high-fidelity rendering of scenes featuring both highly specular reflections and intricate geometric structures. Extensive experiments demonstrate that our method outperforms existing state-of-the-art methods on various datasets.
神经辐射场(NeRF)在重建和渲染高度反射场景时往往面临挑战。近期的研究发展了各种考虑反射的外观模型,以增强NeRF对镜面反射的渲染能力。然而,由于镜面表面上固有的形状模糊性,高度反射场景的稳健重建仍然受到阻碍。现有的方法通常依赖于额外的几何先验来正则化形状预测,但这可能导致复杂场景中的过度平滑几何效果。 注意到表面法线在参数化反射中扮演的关键角色,我们引入了一种基于透射梯度的法线估计技术,即使在形状模糊的情况下也能保持稳健性。此外,我们提出了一种双激活密度模块,有效连接了光滑的表面法线和锐利的对象边界之间的差距。 结合考虑反射的外观模型,我们的方法实现了具有高度镜面反射和复杂几何结构场景的稳健重建和高保真渲染。广泛的实验表明,在各种数据集上,我们的方法优于现有的最先进方法。
https://arxiv.org/abs/2501.09460
The synthesis of high-quality 3D assets from textual or visual inputs has become a central objective in modern generative modeling. Despite the proliferation of 3D generation algorithms, they frequently grapple with challenges such as multi-view inconsistency, slow generation times, low fidelity, and surface reconstruction problems. While some studies have addressed some of these issues, a comprehensive solution remains elusive. In this paper, we introduce \textbf{CaPa}, a carve-and-paint framework that generates high-fidelity 3D assets efficiently. CaPa employs a two-stage process, decoupling geometry generation from texture synthesis. Initially, a 3D latent diffusion model generates geometry guided by multi-view inputs, ensuring structural consistency across perspectives. Subsequently, leveraging a novel, model-agnostic Spatially Decoupled Attention, the framework synthesizes high-resolution textures (up to 4K) for a given geometry. Furthermore, we propose a 3D-aware occlusion inpainting algorithm that fills untextured regions, resulting in cohesive results across the entire model. This pipeline generates high-quality 3D assets in less than 30 seconds, providing ready-to-use outputs for commercial applications. Experimental results demonstrate that CaPa excels in both texture fidelity and geometric stability, establishing a new standard for practical, scalable 3D asset generation.
从文本或视觉输入生成高质量的三维(3D)资产已成为现代生成模型的核心目标。尽管出现了许多3D生成算法,但它们仍然面临着多视角不一致、生成时间长、保真度低以及表面重构问题等挑战。虽然一些研究已经解决了一部分这些问题,但仍缺乏一个全面的解决方案。在这篇文章中,我们介绍了\textbf{CaPa}(雕刻与绘制框架),这是一个能够高效生成高保真3D资产的系统。 CaPa采用了一个两阶段的过程,将几何生成和纹理合成解耦开来。在第一阶段,使用一个多视角输入引导的3D潜在扩散模型来生成几何结构,确保从不同视角观察时的一致性。第二阶段则利用了一种新颖且不依赖特定模型的空间分离注意力机制(Spatially Decoupled Attention),为给定的几何体合成高分辨率纹理(最高可达4K)。此外,我们还提出了一种3D感知的遮挡修复算法,用于填充未绘制区域,确保整个模型的一致性。该流程可以在不到30秒的时间内生成高质量的3D资产,并提供可以直接应用于商业用途的输出结果。 实验结果显示,CaPa在纹理保真度和几何稳定性方面均表现出色,为实用且可扩展的3D资产生成设定了新的标准。
https://arxiv.org/abs/2501.09433
While large language models (LLMs) present significant potential for supporting numerous real-world applica- tions and delivering positive social impacts, they still face significant challenges in terms of the inherent risk of privacy leakage, hallucinated outputs, and value misalignment, and can be maliciously used for generating toxic content and unethical purposes after been jailbroken. Therefore, in this survey, we present a comprehensive review of recent advancements aimed at mitigating these issues, organized across the four phases of LLM development and usage: data collecting and pre-training, fine-tuning and alignment, prompting and reasoning, and post-processing and auditing. We elaborate on the recent advances for enhancing the performance of LLMs in terms of privacy protection, hallucination reduction, value alignment, toxicity elimination, and jailbreak defenses. In contrast to previous surveys that focus on a single dimension of responsible LLMs, this survey presents a unified framework that encompasses these diverse dimensions, providing a comprehensive view of enhancing LLMs to better serve real-world applications.
尽管大型语言模型(LLMs)在支持众多现实世界应用和带来积极社会影响方面展现出巨大潜力,但它们仍然面临着固有的隐私泄露风险、幻觉输出以及价值不一致等重大挑战,并且在被破解后可能会恶意用于生成有毒内容和不符合伦理的目的。因此,在本次综述中,我们全面回顾了近期为减轻这些问题而取得的进展,这些进展按照LLMs开发和使用过程中的四个阶段进行组织:数据收集与预训练、微调与对齐、提示与推理以及后期处理与审计。我们将详细阐述最近在增强LLMs隐私保护性能、幻觉减少、价值一致性和消除毒性以及防破解措施方面的进展。 与之前专注于负责任的LLMs单一维度的综述不同,本综述提出了一种涵盖这些多样维度的统一框架,为如何通过多种方式提升LLMs性能以更好地服务于现实世界应用提供了全面的看法。
https://arxiv.org/abs/2501.09431
In recent years, there has been an increasing interest in image anonymization, particularly focusing on the de-identification of faces and individuals. However, for self-driving applications, merely de-identifying faces and individuals might not provide sufficient privacy protection since street views like vehicles and buildings can still disclose locations, trajectories, and other sensitive information. Therefore, it remains crucial to extend anonymization techniques to street view images to fully preserve the privacy of users, pedestrians, and vehicles. In this paper, we propose a Street View Image Anonymization (SVIA) framework for self-driving applications. The SVIA framework consists of three integral components: a semantic segmenter to segment an input image into functional regions, an inpainter to generate alternatives to privacy-sensitive regions, and a harmonizer to seamlessly stitch modified regions to guarantee visual coherence. Compared to existing methods, SVIA achieves a much better trade-off between image generation quality and privacy protection, as evidenced by experimental results for five common metrics on two widely used public datasets.
近年来,图像匿名化引起了越来越多的兴趣,尤其是在人脸和个体的去识别方面。然而,对于自动驾驶应用来说,仅仅去除面部和个人身份信息可能不足以提供充分的隐私保护,因为街景中的车辆、建筑物等仍能透露位置、轨迹和其他敏感信息。因此,扩展街景图像的匿名化技术以全面保护用户的隐私变得尤为重要。在本文中,我们提出了一种面向自动驾驶应用的街道视图图像匿名化(SVIA)框架。该SVIA框架包含三个核心组件:一个语义分割器用于将输入图像划分为功能区域;一个填充器用于生成隐私敏感区域的替代方案;以及一个调和器用于无缝拼接修改过的区域,以确保视觉连贯性。与现有方法相比,SVIA在五个常用指标上使用两个广泛使用的公共数据集进行了实验验证,证明了其在图像生成质量和隐私保护之间的权衡更佳。
https://arxiv.org/abs/2501.09393
As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous deep-learning-based methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature.
随着虚拟和增强现实应用的流行,全方位图像(ODI)超分辨率变得越来越重要。与在平面上形成的2D平面图像不同,ODIs是投影到球形表面的。因此,将已建立的图像超分辨率方法应用于ODIs需要执行等矩形投影(ERP)以将其映射到平面上。ODI 超分辨率技术需要考虑 ERP 引起的几何失真。然而,如果不考虑 ERP 图像中的几何失真,以前基于深度学习的方法仅利用有限范围内的像素,并且可能轻易错过用于重建的自相似纹理。 在本文中,我们介绍了一种新颖的全方位图像超分辨率几何失真引导变换器(GDGT-OSR)。具体而言,提出了一种集成了可变形自我注意机制的失真调制矩形窗口自我注意机制,以更好地感知失真,并因此涉及更多的自相似纹理。通过利用纬度之间失真的变化性来生成指导信息的新设计失真引导发生器实现了失真调节。此外,我们还提出了动态特征聚合方案以适应地融合来自不同自我注意模块的特性。 我们在公共数据集上展示了广泛的实验结果,并表明新的 GDGT-OSR 方法优于现有文献中的方法。
https://arxiv.org/abs/2406.10869