Paper Reading AI Learner

# The Latest Papers about AI

• ## Analyzing and Overcoming Local Optima in Complex Multi-Objective Optimization by Decomposition-Based Evolutionary Algorithms

2024-04-12 14:29:45
##### Abstract

When addressing the challenge of complex multi-objective optimization problems, particularly those with non-convex and non-uniform Pareto fronts, Decomposition-based Multi-Objective Evolutionary Algorithms (MOEADs) often converge to local optima, thereby limiting solution diversity. Despite its significance, this issue has received limited theoretical exploration. Through a comprehensive geometric analysis, we identify that the traditional method of Reference Point (RP) selection fundamentally contributes to this challenge. In response, we introduce an innovative RP selection strategy, the Weight Vector-Guided and Gaussian-Hybrid method, designed to overcome the local optima issue. This approach employs a novel RP type that aligns with weight vector directions and integrates a Gaussian distribution to combine three distinct RP categories. Our research comprises two main experimental components: an ablation study involving 14 algorithms within the MOEADs framework, spanning from 2014 to 2022, to validate our theoretical framework, and a series of empirical tests to evaluate the effectiveness of our proposed method against both traditional and cutting-edge alternatives. Results demonstrate that our method achieves remarkable improvements in both population diversity and convergence.

##### Abstract (translated)

在解决复杂多目标优化问题的挑战时，特别是那些具有非凸性和非均匀Pareto前沿的优化问题，基于分解的多目标进化算法（MOEADs）通常会收敛到局部最优解，从而限制了解决方案的多样性。尽管这个问题具有重要意义，但这个问题的理论探讨仍然有限。通过全面的几何分析，我们发现参考点（RP）的选择传统方法从根本上造成了这个挑战。为了回应这个问题，我们引入了一种创新性的RP选择策略——权重向量指导和高斯混合方法，旨在克服局部最优解的问题。这种方法采用了一种全新的RP类型，该类型与权重向量方向对齐，并整合了一个高斯分布来结合三种不同的RP类别。我们的研究包括两个主要的实验组成部分：一个涉及2014年至2022年之间14个MOEADs框架中的算法的消融研究，以验证我们的理论框架；以及一系列实验测试，评估我们提出的方法与传统和尖端替代方法的优劣。结果表明，我们的方法在提高种群多样性和收敛方面取得了显著的改善。

##### URL

https://arxiv.org/abs/2404.08501

##### PDF

https://arxiv.org/pdf/2404.08501.pdf

• ## Dataset Reset Policy Optimization for RLHF

2024-04-12 14:25:49
##### Abstract

Reinforcement Learning (RL) from Human Preference-based feedback is a popular paradigm for fine-tuning generative models, which has produced impressive models such as GPT-4 and Claude3 Opus. This framework often consists of two steps: learning a reward model from an offline preference dataset followed by running online RL to optimize the learned reward model. In this work, leveraging the idea of reset, we propose a new RLHF algorithm with provable guarantees. Motivated by the fact that offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), integrates the existing offline preference dataset into the online policy training procedure via dataset reset: it directly resets the policy optimizer to the states in the offline dataset, instead of always starting from the initial state distribution. In theory, we show that DR-PO learns to perform at least as good as any policy that is covered by the offline dataset under general function approximation with finite sample complexity. In experiments, we demonstrate that on both the TL;DR summarization and the Anthropic Helpful Harmful (HH) dataset, the generation from DR-PO is better than that from Proximal Policy Optimization (PPO) and Direction Preference Optimization (DPO), under the metric of GPT4 win-rate. Code for this work can be found at this https URL.

##### Abstract (translated)

基于人类偏好反馈的强化学习（RL）是微调生成模型的流行范式，已经产生了诸如GPT-4和Claude3 Opus等出色的模型。这一框架通常包括两个步骤：从离线偏好数据集中学习奖励模型，然后通过在线强化学习来优化学习到的奖励模型。在本文中，我们利用重置的思想，提出了一种新的强化学习（RLHF）算法，具有可证明的保证。 我们 motivated by the fact that the offline preference dataset provides informative states (i.e., data that is preferred by the labelers), our new algorithm, Dataset Reset Policy Optimization (DR-PO), 将现有的离线偏好数据集整合到在线策略训练过程中，通过数据集重置：它直接将策略优化器重置到离线数据集中的状态，而不是总是从初始状态分布开始。 在理论上，我们证明了DR-PO在一般函数逼近有限样本复杂度的条件下，可以学习到至少与任何被离线数据集覆盖的策略一样好。在实验中，我们证明了在 both TL;DR 摘要和 Anthropic Helpful Harmful (HH) 数据集上，DR-PO 生成的东西比 Proximal Policy Optimization (PPO) 和 Direction Preference Optimization (DPO) 更好，根据 GPT4 赢率度量。 该工作的代码可以在以下链接找到：https://url.com/

##### URL

https://arxiv.org/abs/2404.08495

##### PDF

https://arxiv.org/pdf/2404.08495.pdf

• ## Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

2024-04-12 14:19:16
##### Abstract

Large-scale multilingual Pretrained Language Models (mPLMs) yield impressive performance on cross-language tasks, yet significant performance disparities exist across different languages within the same mPLM. Previous studies endeavored to narrow these disparities by supervise fine-tuning the mPLMs with multilingual data. However, obtaining labeled multilingual data is time-consuming, and fine-tuning mPLM with limited labeled multilingual data merely encapsulates the knowledge specific to the labeled data. Therefore, we introduce ALSACE to leverage the learned knowledge from the well-performing languages to guide under-performing ones within the same mPLM, eliminating the need for additional labeled multilingual data. Experiments show that ALSACE effectively mitigates language-level performance disparity across various mPLMs while showing the competitive performance on different multilingual NLU tasks, ranging from full resource to limited resource settings. The code for our approach is available at this https URL.

##### Abstract (translated)

大规模多语言预训练语言模型（mPLMs）在跨语言任务上的表现令人印象深刻，然而同一mPLM内不同语言之间存在显著的性能差异。以前的研究试图通过用多语言数据对mPLMs进行微调来缩小这些差异。然而，获得带标签的多语言数据需要花费时间，用有限带标签多语言数据微调mPLM只能包含针对标签数据的特定知识。因此，我们引入ALSACE，利用表现良好的语言所学知识引导表现不佳的mPLM，消除需要额外带标签多语言数据的情况。实验表明，ALSACE在各种mPLM上有效地减轻了语言级别性能差异，同时在不同多语言NLU任务上显示出具有竞争力的性能，从完全资源到有限资源设置。您可以在以下链接查看我们的方法的代码。

##### URL

https://arxiv.org/abs/2404.08491

##### PDF

https://arxiv.org/pdf/2404.08491.pdf

• ## SpectralMamba: Efficient Mamba for Hyperspectral Image Classification

2024-04-12 14:12:03
##### Abstract

Recurrent neural networks and Transformers have recently dominated most applications in hyperspectral (HS) imaging, owing to their capability to capture long-range dependencies from spectrum sequences. However, despite the success of these sequential architectures, the non-ignorable inefficiency caused by either difficulty in parallelization or computationally prohibitive attention still hinders their practicality, especially for large-scale observation in remote sensing scenarios. To address this issue, we herein propose SpectralMamba -- a novel state space model incorporated efficient deep learning framework for HS image classification. SpectralMamba features the simplified but adequate modeling of HS data dynamics at two levels. First, in spatial-spectral space, a dynamical mask is learned by efficient convolutions to simultaneously encode spatial regularity and spectral peculiarity, thus attenuating the spectral variability and confusion in discriminative representation learning. Second, the merged spectrum can then be efficiently operated in the hidden state space with all parameters learned input-dependent, yielding selectively focused responses without reliance on redundant attention or imparallelizable recurrence. To explore the room for further computational downsizing, a piece-wise scanning mechanism is employed in-between, transferring approximately continuous spectrum into sequences with squeezed length while maintaining short- and long-term contextual profiles among hundreds of bands. Through extensive experiments on four benchmark HS datasets acquired by satellite-, aircraft-, and UAV-borne imagers, SpectralMamba surprisingly creates promising win-wins from both performance and efficiency perspectives.

##### Abstract (translated)

近年来，循环神经网络（RNN）和Transformer在超分辨率（HS）成像领域取得了主导地位，因为它们能够从频谱序列中捕捉长距离依赖关系。然而，尽管这些序列架构的成功，由并行困难或计算密集型注意引起的不可忽略的低效性仍然阻碍了其实用性，尤其是在遥感场景下的大规模观测。为了解决这个问题，本文提出了一种名为SpectralMamba的新型状态空间模型，该模型采用高效的深度学习框架来解决HS图像分类问题。SpectralMamba在空间-频谱空间特征了HS数据动态的简化但足够的建模。首先，在空间-频谱空间中，通过高效的卷积操作，动态掩码被学习以同时编码空间规范性和频谱特性，从而减弱了频谱变异性和解码器表示学习中的混淆。然后，在合并的频谱中，可以有效地在隐藏状态空间中操作，所有参数都是基于学习输入相关的，从而实现了专注于局部关注点的响应，而无需依赖冗余注意或可扩展的递归。为了探索进一步的计算降压潜力，本文采用了一种分块扫描机制，在数百个频带序列之间转移连续频谱，同时保留短和长时间上下文轮廓。通过对四个卫星、飞机和无人机载荷的HS数据集的广泛实验，SpectralMamba在性能和效率方面都取得了鼓舞人心的双赢。

##### URL

https://arxiv.org/abs/2404.08489

##### PDF

https://arxiv.org/pdf/2404.08489.pdf

• ## Thematic Analysis with Large Language Models: does it work with languages other than English? A targeted test in Italian

2024-04-12 14:10:09
##### Abstract

This paper proposes a test to perform Thematic Analysis (TA) with Large Language Model (LLM) on data which is in a different language than English. While there has been initial promising work on using pre-trained LLMs for TA on data in English, we lack any tests on whether these models can reasonably perform the same analysis with good quality in other language. In this paper a test will be proposed using an open access dataset of semi-structured interviews in Italian. The test shows that a pre-trained model can perform such a TA on the data, also using prompts in Italian. A comparative test shows the model capacity to produce themes which have a good resemblance with those produced independently by human researchers. The main implication of this study is that pre-trained LLMs may thus be suitable to support analysis in multilingual situations, so long as the language is supported by the model used.

##### Abstract (translated)

本文提出了一种在非英语数据上使用大型语言模型（LLM）进行主题分析（TA）的测试。虽然已经有一些在英语数据上使用预训练LLM进行TA的工作，但我们缺乏任何关于这些模型是否可以在其他语言上以高质量进行同样的分析的测试。在本文中，将提出一个使用意大利语半结构化访谈开放访问数据集的测试。测试显示，预训练模型可以对数据进行这样的TA，同时使用意大利语提示。比较测试表明，该模型具有产生与独立于人类研究人员产生的主题具有良好相似性的能力。本研究的主要结论是，预训练LLM因此在多语言环境中进行分析可能是合适的，只要所使用的模型可以支持该语言。

##### URL

https://arxiv.org/abs/2404.08488

##### PDF

https://arxiv.org/pdf/2404.08488.pdf

• ## Decoding AI: The inside story of data analysis in ChatGPT

2024-04-12 13:57:30
##### Abstract

As a result of recent advancements in generative AI, the field of Data Science is prone to various changes. This review critically examines the Data Analysis (DA) capabilities of ChatGPT assessing its performance across a wide range of tasks. While DA provides researchers and practitioners with unprecedented analytical capabilities, it is far from being perfect, and it is important to recognize and address its limitations.

##### Abstract (translated)

由于最近在生成式人工智能方面的进步，数据科学领域容易发生各种变化。本评论对ChatGPT的数据分析（DA）功能进行了批判性审查，评估了其在各种任务上的表现。虽然DA为研究人员和实践者提供了史无前例的分析能力，但距离完美还有很大距离，因此需要认识到并解决其局限性。

##### URL

https://arxiv.org/abs/2404.08480

##### PDF

https://arxiv.org/pdf/2404.08480.pdf

• ## Swing-Up of a Weakly Actuated Double Pendulum via Nonlinear Normal Modes

2024-04-12 13:55:29
##### Abstract

We identify the nonlinear normal modes spawning from the stable equilibrium of a double pendulum under gravity, and we establish their connection to homoclinic orbits through the unstable upright position as energy increases. This result is exploited to devise an efficient swing-up strategy for a double pendulum with weak, saturating actuators. Our approach involves stabilizing the system onto periodic orbits associated with the nonlinear modes while gradually injecting energy. Since these modes are autonomous system evolutions, the required control effort for stabilization is minimal. Even with actuator limitations of less than 1% of the maximum gravitational torque, the proposed method accomplishes the swing-up of the double pendulum by allowing sufficient time.

##### Abstract (translated)

我们通过分析双摆动系统的非线性正交模（NMO）在重力下的稳定平衡状态，确定了它们与同构轨道的连接，并通过不稳定直立位置随能量增加而建立联系。这一结果被用于设计一种有效的双摆动系统弱、饱和式执行器的摆动提升策略。我们的方法包括在能量逐渐注入的同时，通过稳定系统到与非线性模相关的周期轨道来稳定系统。由于这些模是自组织的系统演化，所以所需的控制努力为最小。即使双摆动系统的执行器限制小于1%的最大重力扭矩，所提出的技术也能通过允许足够的时间来实现摆动提升。

##### URL

https://arxiv.org/abs/2404.08478

##### PDF

https://arxiv.org/pdf/2404.08478.pdf

• ## New Efficient Visual OILU Markers

2024-04-12 13:55:05
##### Abstract

Basic patterns are the source of a wide range of more or less complex geometric structures. We will exploit such patterns to develop new efficient visual markers. Besides being projective invariants, the proposed markers allow producing rich panel of unique identifiers, highly required for resource-intensive navigation and augmented reality applications. The spiral topology of our markers permits the validation of an accurate identification scheme, which is based on level set methods. The robustness of the markers against acquisition and geometric distortions is validated by extensive experimental tests.

##### Abstract (translated)

基本图案是构成一系列更复杂或更简单的几何结构的来源。我们将利用这些图案开发新的高效视觉标记。除了投影不变性之外，所提出的标记允许生成独特的标识符，这对资源密集型导航和增强现实应用程序非常重要。我们标记的螺旋拓扑结构允许验证准确的身份方案，该方案基于集合方法。通过对标本的鲁棒性进行广泛的实验测试，验证了标记对抗获取和几何畸变的能力。

##### URL

https://arxiv.org/abs/2404.08477

##### PDF

https://arxiv.org/pdf/2404.08477.pdf

• ## Combining Statistical Depth and Fermat Distance for Uncertainty Quantification

2024-04-12 13:54:21
##### Abstract

We measure the Out-of-domain uncertainty in the prediction of Neural Networks using a statistical notion called Lens Depth'' (LD) combined with Fermat Distance, which is able to capture precisely the depth'' of a point with respect to a distribution in feature space, without any assumption about the form of distribution. Our method has no trainable parameter. The method is applicable to any classification model as it is applied directly in feature space at test time and does not intervene in training process. As such, it does not impact the performance of the original model. The proposed method gives excellent qualitative result on toy datasets and can give competitive or better uncertainty estimation on standard deep learning datasets compared to strong baseline methods.

##### Abstract (translated)

我们使用统计概念“Lens Depth”（LD）与Fermat距离来测量神经网络预测中的离域不确定性。LD能够准确地描述点在特征空间中与分布的关系，而不需要对分布的形式做出任何假设。我们的方法没有训练参数。该方法可以直接在测试时在特征空间中应用，而且不干预训练过程。因此，它不会影响原始模型的性能。在玩具数据集上，所提出的方法给出了出色的定性结果，与强大的基线方法相比，在标准深度学习数据集上具有竞争力的或更好的不确定性估计。

##### URL

https://arxiv.org/abs/2404.08476

##### PDF

https://arxiv.org/pdf/2404.08476.pdf

• ## OTTER: Improving Zero-Shot Classification via Optimal Transport

2024-04-12 13:18:47
##### Abstract

Popular zero-shot models suffer due to artifacts inherited from pretraining. A particularly detrimental artifact, caused by unbalanced web-scale pretraining data, is mismatched label distribution. Existing approaches that seek to repair the label distribution are not suitable in zero-shot settings, as they have incompatible requirements such as access to labeled downstream task data or knowledge of the true label balance in the pretraining distribution. We sidestep these challenges and introduce a simple and lightweight approach to adjust pretrained model predictions via optimal transport. Our technique requires only an estimate of the label distribution of a downstream task. Theoretically, we characterize the improvement produced by our procedure under certain mild conditions and provide bounds on the error caused by misspecification. Empirically, we validate our method in a wide array of zero-shot image and text classification tasks, improving accuracy by 4.8% and 15.9% on average, and beating baselines like Prior Matching -- often by significant margins -- in 17 out of 21 datasets.

##### Abstract (translated)

由于预训练中存在的元数据导致的 artifacts，流行的一零 shot 模型效果不佳。特别有害的元数据是由不平衡的跨网站预训练数据引起的，即不平衡标签分布。现有的试图修复标签分布的方法在零 shot 设置中并不适用，因为它们具有不兼容的要求，如访问已标注的下游任务数据或对预训练分布的真实标签平衡的了解。我们避开了这些挑战，并引入了一种简单而轻量级的通过最优传输调整预训练模型预测的方法。我们的技术只需要预训练任务下游任务的标签分布的估计。从理论上看，我们研究了我们的过程在某些轻度条件下的改进，并提供了由不准确估计引起的误差的上界。在实证研究中，我们在广泛的零 shot图像和文本分类任务中验证了我们的方法，平均提高了 4.8% 的准确率，并且在 21 个数据集中的基线（如 Prior Matching）中击败了像这样具有显著优势的基线。

##### URL

https://arxiv.org/abs/2404.08461

##### PDF

https://arxiv.org/pdf/2404.08461.pdf

• ## On the Independence Assumption in Neurosymbolic Learning

2024-04-12 13:09:48
##### Abstract

State-of-the-art neurosymbolic learning systems use probabilistic reasoning to guide neural networks towards predictions that conform to logical constraints over symbols. Many such systems assume that the probabilities of the considered symbols are conditionally independent given the input to simplify learning and reasoning. We study and criticise this assumption, highlighting how it can hinder optimisation and prevent uncertainty quantification. We prove that loss functions bias conditionally independent neural networks to become overconfident in their predictions. As a result, they are unable to represent uncertainty over multiple valid options. Furthermore, we prove that these loss functions are difficult to optimise: they are non-convex, and their minima are usually highly disconnected. Our theoretical analysis gives the foundation for replacing the conditional independence assumption and designing more expressive neurosymbolic probabilistic models.

##### Abstract (translated)

最新的神经符号学习系统使用概率推理来指导神经网络，使其预测符合符号的逻辑约束。许多这样的系统假定考虑的符号的概率在给定输入条件下相互独立，以简化学习和推理。我们研究并批评这个假设，揭示了它如何阻碍优化和阻止不确定性量化。我们证明，损失函数会使条件独立的神经网络对自己的预测充满信心。因此，它们无法表示多个有效选项的不确定性。此外，我们证明这些损失函数很难优化：它们是非凸的，且最小值通常高度离散。我们理论分析为用条件独立性假设替换和设计更表达式的神经符号概率模型奠定了基础。

##### URL

https://arxiv.org/abs/2404.08458

##### PDF

https://arxiv.org/pdf/2404.08458.pdf

• ## MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection

2024-04-12 13:02:08
##### Abstract

Deepfakes have recently raised significant trust issues and security concerns among the public. Compared to CNN face forgery detectors, ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance. However, these approaches still exhibit the following limitations: (1). Fully fine-tuning ViT-based models from ImageNet weights demands substantial computational and storage resources; (2). ViT-based methods struggle to capture local forgery clues, leading to model bias and limited generalizability. To tackle these challenges, this work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach. MoE-FFD only updates lightweight Low-Rank Adaptation (LoRA) and Adapter layers while keeping the ViT backbone frozen, thereby achieving parameter-efficient training. Moreover, MoE-FFD leverages the expressivity of transformers and local priors of CNNs to simultaneously extract global and local forgery clues. Additionally, novel MoE modules are designed to scale the model's capacity and select optimal forgery experts, further enhancing forgery detection performance. The proposed MoE learning scheme can be seamlessly adapted to various transformer backbones in a plug-and-play manner. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art face forgery detection performance with reduced parameter overhead. The code will be released upon acceptance.

##### Abstract (translated)

近年来，Deepfakes在公众中引起了显著的信任问题和安全问题。与CNN基于特征检测相比，基于ViT的方法利用了Transformer的表达能力，实现了卓越的检测性能。然而，这些方法仍然存在以下局限性：（1）从ImageNet权重完全 fine-tuning ViT 模型需要大量的计算和存储资源；（2）ViT 方法很难捕捉到局部伪造线索，导致模型偏见和泛化能力受限。为解决这些挑战，本文引入了混合专家模块（MoE-FFD），这是一种参数效率高且通用性强的基于ViT的伪造检测方法。MoE-FFD仅在轻量级低秩适应（LoRA）和适配器层上进行更新，而保留ViT骨干网络，从而实现参数高效的训练。此外，MoE-FFD利用了Transformer的 expressivity 和 CNN 的局部先验，同时提取全局和局部伪造线索。此外，新颖的MoE模块被设计为扩展模型的容量并选择最优的伪造专家，进一步加强了伪造检测性能。所提出的MoE学习方案可以灵活地应用于各种Transformer骨干网络。大量实验结果表明，所提出的方法通过减少参数开销实现了最先进的伪造检测性能。代码将在接受审核后发布。

##### URL

https://arxiv.org/abs/2404.08452

##### PDF

https://arxiv.org/pdf/2404.08452.pdf

• ## Joint Physical-Digital Facial Attack Detection Via Simulating Spoofing Clues

2024-04-12 13:01:22
##### Abstract

Face recognition systems are frequently subjected to a variety of physical and digital attacks of different types. Previous methods have achieved satisfactory performance in scenarios that address physical attacks and digital attacks, respectively. However, few methods are considered to integrate a model that simultaneously addresses both physical and digital attacks, implying the necessity to develop and maintain multiple models. To jointly detect physical and digital attacks within a single model, we propose an innovative approach that can adapt to any network architecture. Our approach mainly contains two types of data augmentation, which we call Simulated Physical Spoofing Clues augmentation (SPSC) and Simulated Digital Spoofing Clues augmentation (SDSC). SPSC and SDSC augment live samples into simulated attack samples by simulating spoofing clues of physical and digital attacks, respectively, which significantly improve the capability of the model to detect "unseen" attack types. Extensive experiments show that SPSC and SDSC can achieve state-of-the-art generalization in Protocols 2.1 and 2.2 of the UniAttackData dataset, respectively. Our method won first place in "Unified Physical-Digital Face Attack Detection" of the 5th Face Anti-spoofing Challenge@CVPR2024. Our final submission obtains 3.75% APCER, 0.93% BPCER, and 2.34% ACER, respectively. Our code is available at this https URL.

##### Abstract (translated)

面部识别系统常常会受到各种类型的物理和数字攻击。之前的方法在处理物理攻击和数字攻击的场景方面都取得了相当不错的性能。然而，很少有方法被认为整合了一个同时处理物理和数字攻击的模型，这表明需要开发和维护多个模型。为了在单个模型中共同检测物理和数字攻击，我们提出了一个创新的方法，可以适应任何网络架构。我们主要包含两种数据增强类型，我们称之为模拟物理 spoofing 线索增强（SPSC）和模拟数字 spoofing 线索增强（SDSC）。SPSC和SDSC通过模拟物理和数字攻击的 spoofing 线索将实时样本转换为模拟攻击样本，显著提高了模型的检测“未见”攻击类型的能力。广泛的实验证明，SPSC和SDSC可以在UniAttackData数据集中的Protocol 2.1和2.2上实现最先进的泛化。我们的方法在2024年CVPR的“统一物理-数字面部攻击检测”挑战中获得第一。我们的最终提交获得了3.75%的APCER，0.93%的BPCER和2.34%的ACER。我们的代码可在此处下载：https://www.thunlock.org/thunlock/。

##### URL

https://arxiv.org/abs/2404.08450

##### PDF

https://arxiv.org/pdf/2404.08450.pdf

• ## OccGaussian: 3D Gaussian Splatting for Occluded Human Rendering

2024-04-12 13:00:06
##### Abstract

Rendering dynamic 3D human from monocular videos is crucial for various applications such as virtual reality and digital entertainment. Most methods assume the people is in an unobstructed scene, while various objects may cause the occlusion of body parts in real-life scenarios. Previous method utilizing NeRF for surface rendering to recover the occluded areas, but it requiring more than one day to train and several seconds to render, failing to meet the requirements of real-time interactive applications. To address these issues, we propose OccGaussian based on 3D Gaussian Splatting, which can be trained within 6 minutes and produces high-quality human renderings up to 160 FPS with occluded input. OccGaussian initializes 3D Gaussian distributions in the canonical space, and we perform occlusion feature query at occluded regions, the aggregated pixel-align feature is extracted to compensate for the missing information. Then we use Gaussian Feature MLP to further process the feature along with the occlusion-aware loss functions to better perceive the occluded area. Extensive experiments both in simulated and real-world occlusions, demonstrate that our method achieves comparable or even superior performance compared to the state-of-the-art method. And we improving training and inference speeds by 250x and 800x, respectively. Our code will be available for research purposes.

##### Abstract (translated)

渲染动态 3D 人类从单目视频中的渲染对各种应用（如虚拟现实和数字娱乐）至关重要。大多数方法假定人们处于一个无遮挡的场景中，而在现实场景中，各种物体可能会遮挡身体部分。之前使用 NeRF 进行表面渲染以恢复遮挡区域的方法，但训练需要超过一天，渲染需要几秒钟，无法满足实时交互应用程序的要求。为了应对这些问题，我们提出了基于 3D 高斯平铺的 OccGaussian 方法，该方法可以在 6 分钟内训练，并产生最高可达 160 FPS 的具有遮挡输入的高质量人类渲染。OccGaussian 在经典空间中初始化 3D 高斯分布，我们在遮挡区域进行遮挡特征查询，并通过聚合像素对齐特征来弥补缺失的信息。然后我们使用高斯特征 MLP 进一步处理特征，并与遮挡感知损失函数一起进行处理，以更好地感知遮挡区域。在模拟和现实世界的遮挡实验中，我们的方法实现了与最先进方法相当甚至更好的性能。而且我们通过分别将训练和推理速度提高 250 倍和 800 倍来提高速度。我们的代码将供研究之用。

##### URL

https://arxiv.org/abs/2404.08449

##### PDF

https://arxiv.org/pdf/2404.08449.pdf

• ## An improved tabular data generator with VAE-GMM integration

2024-04-12 12:31:06
##### Abstract

The rising use of machine learning in various fields requires robust methods to create synthetic tabular data. Data should preserve key characteristics while addressing data scarcity challenges. Current approaches based on Generative Adversarial Networks, such as the state-of-the-art CTGAN model, struggle with the complex structures inherent in tabular data. These data often contain both continuous and discrete features with non-Gaussian distributions. Therefore, we propose a novel Variational Autoencoder (VAE)-based model that addresses these limitations. Inspired by the TVAE model, our approach incorporates a Bayesian Gaussian Mixture model (BGM) within the VAE architecture. This avoids the limitations imposed by assuming a strictly Gaussian latent space, allowing for a more accurate representation of the underlying data distribution during data generation. Furthermore, our model offers enhanced flexibility by allowing the use of various differentiable distributions for individual features, making it possible to handle both continuous and discrete data types. We thoroughly validate our model on three real-world datasets with mixed data types, including two medically relevant ones, based on their resemblance and utility. This evaluation demonstrates significant outperformance against CTGAN and TVAE, establishing its potential as a valuable tool for generating synthetic tabular data in various domains, particularly in healthcare.

##### Abstract (translated)

在各种领域的机器学习应用中，生成对抗网络（GAN）等方法生成合成表格数据需要稳健的方法来创建具有合成特征的数据。数据应保持关键特征，同时解决数据稀缺性的挑战。基于生成对抗网络（GAN）的方法，如最先进的CTGAN模型，在表格数据的复杂结构上遇到了困难。这些数据通常包含连续和离散特征，且具有非高斯分布。因此，我们提出了一个基于VAE的新模型，以克服这些限制。受到TVAE模型的启发，我们的方法在VAE架构中引入了贝叶斯高斯混合模型（BGM）。这避免了在假设严格高斯分布的假设上施加的限制，使得在数据生成过程中更准确地表示底层数据分布。此外，通过允许使用各种不同的可导分布为单个特征，我们的模型提供了更大的灵活性，使得可以处理连续和离散数据类型。我们在包括两个医学相关的真实世界数据集在内的三个真实世界数据集上充分验证了我们的模型，基于它们的相似性和实用性。这项评估表明，我们的模型在CTGAN和TVAE等方法上具有显著的优越性能，这表明它成为在各个领域生成合成表格数据的有价值的工具，特别是在医疗保健领域。

##### URL

https://arxiv.org/abs/2404.08434

##### PDF

https://arxiv.org/pdf/2404.08434.pdf

• ## MSSTNet: A Multi-Scale Spatio-Temporal CNN-Transformer Network for Dynamic Facial Expression Recognition

2024-04-12 12:30:48
##### Abstract

Unlike typical video action recognition, Dynamic Facial Expression Recognition (DFER) does not involve distinct moving targets but relies on localized changes in facial muscles. Addressing this distinctive attribute, we propose a Multi-Scale Spatio-temporal CNN-Transformer network (MSSTNet). Our approach takes spatial features of different scales extracted by CNN and feeds them into a Multi-scale Embedding Layer (MELayer). The MELayer extracts multi-scale spatial information and encodes these features before sending them into a Temporal Transformer (T-Former). The T-Former simultaneously extracts temporal information while continually integrating multi-scale spatial information. This process culminates in the generation of multi-scale spatio-temporal features that are utilized for the final classification. Our method achieves state-of-the-art results on two in-the-wild datasets. Furthermore, a series of ablation experiments and visualizations provide further validation of our approach's proficiency in leveraging spatio-temporal information within DFER.

##### Abstract (translated)

与典型的视频动作识别不同，动态面部表情识别（DFER）不涉及明显的运动目标，而是依赖于面部肌肉的局部变化。为了应对这一独特的属性，我们提出了一个多尺度时空卷积神经网络-Transformer网络（MSSTNet）。我们的方法通过提取由CNN生成的不同尺度的空间特征，将它们输入到多尺度嵌入层（MELayer）。MELayer提取多尺度空间信息并编码这些特征，然后将它们输入到时间Transformer（T-Former）。T-Former在同时提取时间信息的同时，持续整合多尺度空间信息。这一过程导致生成多尺度时空特征，这些特征用于最后的分类。我们的方法在两个野外数据集上实现了最先进的性能。此外，一系列消融实验和可视化进一步验证了我们的方法在DFER中充分利用空间时间信息的能力。

##### URL

https://arxiv.org/abs/2404.08433

##### PDF

https://arxiv.org/pdf/2404.08433.pdf

• ## Comparing Apples to Oranges: LLM-powered Multimodal Intention Prediction in an Object Categorization Task

2024-04-12 12:15:14
##### Abstract

Intention-based Human-Robot Interaction (HRI) systems allow robots to perceive and interpret user actions to proactively interact with humans and adapt to their behavior. Therefore, intention prediction is pivotal in creating a natural interactive collaboration between humans and robots. In this paper, we examine the use of Large Language Models (LLMs) for inferring human intention during a collaborative object categorization task with a physical robot. We introduce a hierarchical approach for interpreting user non-verbal cues, like hand gestures, body poses, and facial expressions and combining them with environment states and user verbal cues captured using an existing Automatic Speech Recognition (ASR) system. Our evaluation demonstrates the potential of LLMs to interpret non-verbal cues and to combine them with their context-understanding capabilities and real-world knowledge to support intention prediction during human-robot interaction.

##### Abstract (translated)

基于意图的人机交互（HRI）系统允许机器人感知和解释用户的动作，从而主动与人类互动并适应其行为。因此，意图预测在创建人类与机器人之间自然互动的重要性不言而喻。在本文中，我们研究了使用大型语言模型（LLMs）在合作物体分类任务中推断人类意图的方法。我们引入了一种分层的解释用户非语言线索的方法，包括手势、身体姿势和面部表情，并将其与环境和用户口头线索捕获的现有自动语音识别（ASR）系统相结合。我们的评估表明，LLMs具有解释非语言线索的能力，并将其与上下文理解能力和现实世界的知识相结合，支持在人类与机器人交互过程中进行意图预测。

##### URL

https://arxiv.org/abs/2404.08424

##### PDF

https://arxiv.org/pdf/2404.08424.pdf

• ## Adapting the Segment Anything Model During Usage in Novel Situations

2024-04-12 12:10:53
##### Abstract

The interactive segmentation task consists in the creation of object segmentation masks based on user interactions. The most common way to guide a model towards producing a correct segmentation consists in clicks on the object and background. The recently published Segment Anything Model (SAM) supports a generalized version of the interactive segmentation problem and has been trained on an object segmentation dataset which contains 1.1B masks. Though being trained extensively and with the explicit purpose of serving as a foundation model, we show significant limitations of SAM when being applied for interactive segmentation on novel domains or object types. On the used datasets, SAM displays a failure rate $\text{FR}_{30}@90$ of up to $72.6 \%$. Since we still want such foundation models to be immediately applicable, we present a framework that can adapt SAM during immediate usage. For this we will leverage the user interactions and masks, which are constructed during the interactive segmentation process. We use this information to generate pseudo-labels, which we use to compute a loss function and optimize a part of the SAM model. The presented method causes a relative reduction of up to $48.1 \%$ in the $\text{FR}_{20}@85$ and $46.6 \%$ in the $\text{FR}_{30}@90$ metrics.

##### Abstract (translated)

交互分割任务的关键在于根据用户交互创建物体分割掩码。将模型引导到正确分割的最常见方法是在物体和背景上点击。最近发表的交互分割模型（SAM）支持广义版本的交互分割问题，并在包含1.1B个掩码的物体分割数据集上进行了训练。尽管在训练过程中进行了广泛的训练，并明确表示作为基础模型，但我们将SAM应用于新领域或物体类型时仍然存在重大局限性。在使用的数据集上，SAM的失败率在90%时高达72.6%。由于我们仍然希望这样的基础模型能够立即应用，我们将提供一个适应即用环境的框架。为此，我们利用交互分割过程中创建的用户交互和掩码。我们使用这些信息生成伪标签，我们将其用于计算损失函数并优化SAM模型的一部分。所提出的方法导致在FR20@85指标上相对减少48.1%，在FR30@90指标上相对减少46.6%。

##### URL

https://arxiv.org/abs/2404.08421

##### PDF

https://arxiv.org/pdf/2404.08421.pdf

• ## Direct May Not Be the Best: An Incremental Evolution View of Pose Generation

2024-04-12 12:08:06
##### Abstract

Pose diversity is an inherent representative characteristic of 2D images. Due to the 3D to 2D projection mechanism, there is evident content discrepancy among distinct pose images. This is the main obstacle bothering pose transformation related researches. To deal with this challenge, we propose a fine-grained incremental evolution centered pose generation framework, rather than traditional direct one-to-one in a rush. Since proposed approach actually bypasses the theoretical difficulty of directly modeling dramatic non-linear variation, the incurred content distortion and blurring could be effectively constrained, at the same time the various individual pose details, especially clothes texture, could be precisely maintained. In order to systematically guide the evolution course, both global and incremental evolution constraints are elaborately designed and merged into the overall frame?work. And a novel triple-path knowledge fusion structure is worked out to take full advantage of all available valuable knowledge to conduct high-quality pose synthesis. In addition, our framework could generate a series of valuable byproducts, namely the various intermediate poses. Extensive experiments have been conducted to verify the effectiveness of the proposed approach. Code is available at this https URL.

##### Abstract (translated)

姿势多样性是2D图像固有的代表性特征。由于3D到2D投影机制，不同姿态图像之间显然存在内容差异。这是困扰姿势变换相关研究的主要障碍。为解决这个挑战，我们提出了一个关注姿态生成的细粒度递进框架，而不是传统的匆忙一對一。由于所提出的方法实际上绕过了直接建模戏剧非线性变化的理论难题，因此所引入的内容扭曲和模糊可以有效地约束，同时保留各种单独姿态细节，特别是衣服纹理。为了系统地引导进化过程，我们精心设计并合并了全局和增量进化约束，构建了整个框架。还提出了一个用于充分利用所有可用知识进行高质量姿态合成的全新三路径知识融合结构。此外，我们的框架可以生成一系列有价值的中间姿态。已经进行了大量实验来验证所提出方法的有效性。代码可在此处访问：https://www.url。

##### URL

https://arxiv.org/abs/2404.08419

##### PDF

https://arxiv.org/pdf/2404.08419.pdf

• ## AdapterSwap: Continuous Training of LLMs with Data Removal and Access-Control Guarantees

2024-04-12 12:06:02
##### Abstract

Large language models (LLMs) are increasingly capable of completing knowledge intensive tasks by recalling information from a static pretraining corpus. Here we are concerned with LLMs in the context of evolving data requirements. For instance: batches of new data that are introduced periodically; subsets of data with user-based access controls; or requirements on dynamic removal of documents with guarantees that associated knowledge cannot be recalled. We wish to satisfy these requirements while at the same time ensuring a model does not forget old information when new data becomes available. To address these issues, we introduce AdapterSwap, a training and inference scheme that organizes knowledge from a data collection into a set of low-rank adapters, which are dynamically composed during inference. Our experiments demonstrate AdapterSwap's ability to support efficient continual learning, while also enabling organizations to have fine-grained control over data access and deletion.

##### Abstract (translated)

大语言模型（LLMs）通过从静态预训练语料库中回忆信息，逐渐具备完成知识密集型任务的潜能。在这里，我们关注 evolving data requirements 中的 LLMs。例如：定期引入的新数据批次；基于用户访问控制的数据子集；或对动态删除文档的保证，保证相关知识无法回忆。我们希望满足这些要求，同时确保在新数据出现时，模型不会忘记旧信息。为解决这些问题，我们引入 AdapterSwap，一种训练和推理方案，将数据集合的知识组织成一系列低秩适配器，这些适配器在推理过程中动态构建。我们的实验证明 AdapterSwap 能够支持高效的持续学习，同时让组织对数据访问和删除具有细粒度控制。

##### URL

https://arxiv.org/abs/2404.08417

##### PDF

https://arxiv.org/pdf/2404.08417.pdf

Tags