Despite remarkable advancements, mainstream gaze estimation techniques, particularly appearance-based methods, often suffer from performance degradation in uncontrolled environments due to variations in illumination and individual facial attributes. Existing domain adaptation strategies, limited by their need for target domain samples, may fall short in real-world applications. This letter introduces Branch-out Auxiliary Regularization (BAR), an innovative method designed to boost gaze estimation's generalization capabilities without requiring direct access to target domain data. Specifically, BAR integrates two auxiliary consistency regularization branches: one that uses augmented samples to counteract environmental variations, and another that aligns gaze directions with positive source domain samples to encourage the learning of consistent gaze features. These auxiliary pathways strengthen the core network and are integrated in a smooth, plug-and-play manner, facilitating easy adaptation to various other models. Comprehensive experimental evaluations on four cross-dataset tasks demonstrate the superiority of our approach.
尽管在可见的进步中,主流的视差估计技术(特别是以外观为基础的方法)在未受控的环境中往往性能下降,因为照明和个体面部属性的变化会导致性能下降。现有的领域自适应策略,由于需要目标领域样本,可能在其现实应用中不够有效。本文介绍了一种名为Branch-out Auxiliary Regularization(BAR)的创新方法,旨在提高视差估计的泛化能力,而无需直接访问目标领域数据。具体来说,BAR结合了两个辅助一致性正则化分支:一个使用增强样本来对抗环境变化,另一个将目光方向与积极源域样本对齐,以促进学习一致的视差特征。这些辅助通道加强了核心网络,以一种平滑、可插拔的方式集成,便于轻松适应各种其他模型。在四个跨数据集任务的综合实验评估中,证明了我们的方法具有优越性。
https://arxiv.org/abs/2405.01439
Training deep neural networks (DNNs) from noisy labels is an important and challenging task. However, most existing approaches focus on the corrupted labels and ignore the importance of inherent data structure. To bridge the gap between noisy labels and data, inspired by the concept of potential energy in physics, we propose a novel Potential Energy based Mixture Model (PEMM) for noise-labels learning. We innovate a distance-based classifier with the potential energy regularization on its class centers. Embedding our proposed classifier with existing deep learning backbones, we can have robust networks with better feature representations. They can preserve intrinsic structures from the data, resulting in a superior noisy tolerance. We conducted extensive experiments to analyze the efficiency of our proposed model on several real-world datasets. Quantitative results show that it can achieve state-of-the-art performance.
训练深度神经网络(DNNs)从嘈杂标签是一个重要而具有挑战性的任务。然而,大多数现有方法都关注于嘈杂标签,并忽略了固有数据结构的重要性。为了在嘈杂标签和数据之间搭建一座桥梁,受到物理学中势能概念的启发,我们提出了一个基于势能的噪音标签学习的新模型。我们在其类中心上应用了势能 regularization 的距离基于分类器。将我们所提出的分类器与现有的深度学习骨干嵌入,我们可以获得具有更好特征表示的稳健网络。它们可以保留数据中的固有结构,从而具有卓越的嘈杂容忍性。我们对多个现实世界数据集进行了广泛的实验,以分析我们提出的模型的效率。定量的结果表明,它可以实现最先进的性能。
https://arxiv.org/abs/2405.01186
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
准确检测和跟踪诸如引导导管等设备在活体X光成像中的操作,是进行内窥镜心脏干预的必要前提。这一信息用于指导操作,例如指导支架植入。为了确保操作的安全性和有效性,需要在跟踪过程中具备高鲁棒性,以避免失败。为了实现这一目标,需要有效地解决一些挑战,例如:对比剂或其他外部设备或导线对设备的遮挡,视野或成像角度的变化,以及由于心脏和呼吸运动而产生的连续运动。为了克服上述挑战,我们提出了一个新颖的方法,从超过1600万干预X光帧的大型数据集中学取空间-时间特征,通过自监督图像序列数据进行图像序列数据。我们的方法基于遮罩图像建模技术,利用基于重构的帧插值学习细粒度时间对应关系。在得到的模型中编码的特征经过下游微调。我们的方法在超优化参考解决方案(使用多级特征融合、多任务和流 regularization)方面实现了最先进的性能和鲁棒性。实验结果表明,我们的方法将最大跟踪误差减少了66.31%(使用流 regularization 时,降低了23.20%);在每秒42帧的推理速度下,实现了97.95%的成功率(在GPU上)。结果鼓励将我们的方法应用于各种需要在操作图像分析中有效理解空间-时间语义的各种其他任务。
https://arxiv.org/abs/2405.01156
Communication bottlenecks hinder the scalability of distributed neural network training, particularly on distributed-memory computing clusters. To significantly reduce this communication overhead, we introduce AB-training, a novel data-parallel training method that decomposes weight matrices into low-rank representations and utilizes independent group-based training. This approach consistently reduces network traffic by 50% across multiple scaling scenarios, increasing the training potential on communication-constrained systems. Our method exhibits regularization effects at smaller scales, leading to improved generalization for models like VGG16, while achieving a remarkable 44.14 : 1 compression ratio during training on CIFAR-10 and maintaining competitive accuracy. Albeit promising, our experiments reveal that large batch effects remain a challenge even in low-rank training regimes.
通信瓶颈阻碍了分布式神经网络训练的可扩展性,特别是在分布式内存计算集群上。为了显著降低这种通信开销,我们引入了AB-训练,一种新颖的数据并行训练方法,将权重矩阵分解为低秩表示,并利用独立组基训练。这种方法在多个缩放场景下,将网络流量减少50%,增加了在通信受限系统上的训练潜力。我们的方法在较小的尺度上表现出正则化效应,使得像VGG16这样的模型具有更好的泛化能力,而在CIFAR-10上训练时,取得了惊人的44.14 : 1的压缩比,保持竞争力的准确性。尽管前景看好,但我们的实验表明,在低秩训练环境中,大批效应仍然是一个挑战。
https://arxiv.org/abs/2405.01067
Surgical scene simulation plays a crucial role in surgical education and simulator-based robot learning. Traditional approaches for creating these environments with surgical scene involve a labor-intensive process where designers hand-craft tissues models with textures and geometries for soft body simulations. This manual approach is not only time-consuming but also limited in the scalability and realism. In contrast, data-driven simulation offers a compelling alternative. It has the potential to automatically reconstruct 3D surgical scenes from real-world surgical video data, followed by the application of soft body physics. This area, however, is relatively uncharted. In our research, we introduce 3D Gaussian as a learnable representation for surgical scene, which is learned from stereo endoscopic video. To prevent over-fitting and ensure the geometrical correctness of these scenes, we incorporate depth supervision and anisotropy regularization into the Gaussian learning process. Furthermore, we apply the Material Point Method, which is integrated with physical properties, to the 3D Gaussians to achieve realistic scene deformations. Our method was evaluated on our collected in-house and public surgical videos datasets. Results show that it can reconstruct and simulate surgical scenes from endoscopic videos efficiently-taking only a few minutes to reconstruct the surgical scene-and produce both visually and physically plausible deformations at a speed approaching real-time. The results demonstrate great potential of our proposed method to enhance the efficiency and variety of simulations available for surgical education and robot learning.
手术场景模拟在手术教育和基于模拟器的机器人学习中发挥着关键作用。传统的方法创建这些环境需要设计师花费大量的时间手工制作组织模型,纹理和几何数据,以实现软身体仿真。这种手动方法不仅费时,而且可扩展性和现实性有限。相比之下,数据驱动模拟提供了令人兴奋的替代方案。它有可能自动从现实世界的手术视频数据中重构3D手术场景,然后应用软身体物理学。然而,这个领域仍然相对未知。在我们的研究中,我们将3D高斯作为一个可学习表示手术场景的模型,从立体内窥镜视频中学到。为了防止过拟合并确保场景的几何正确性,我们将深度监督和各向同性正则化引入到高斯学习过程中。此外,我们将材料点方法应用于3D高斯,以实现逼真的场景变形。我们对内部和公共手术视频数据集进行了评估。结果表明,该方法可以高效地重构和模拟手术场景,仅用几分钟就可以重构手术场景,并产生几乎实时可观和物理变形。结果证明了我们对所提出方法的提高效率和多样性的潜力。
https://arxiv.org/abs/2405.00956
Relational database management systems (RDBMS) are widely used for the storage and retrieval of structured data. To derive insights beyond statistical aggregation, we typically have to extract specific subdatasets from the database using conventional database operations, and then apply deep neural networks (DNN) training and inference on these respective subdatasets in a separate machine learning system. The process can be prohibitively expensive, especially when there are a combinatorial number of subdatasets extracted for different analytical purposes. This calls for efficient in-database support of advanced analytical methods In this paper, we introduce LEADS, a novel SQL-aware dynamic model slicing technique to customize models for subdatasets specified by SQL queries. LEADS improves the predictive modeling of structured data via the mixture of experts (MoE) technique and maintains inference efficiency by a SQL-aware gating network. At the core of LEADS is the construction of a general model with multiple expert sub-models via MoE trained over the entire database. This SQL-aware MoE technique scales up the modeling capacity, enhances effectiveness, and preserves efficiency by activating only necessary experts via the gating network during inference. Additionally, we introduce two regularization terms during the training process of LEADS to strike a balance between effectiveness and efficiency. We also design and build an in-database inference system, called INDICES, to support end-to-end advanced structured data analytics by non-intrusively incorporating LEADS onto PostgreSQL. Our extensive experiments on real-world datasets demonstrate that LEADS consistently outperforms baseline models, and INDICES delivers effective in-database analytics with a considerable reduction in inference latency compared to traditional solutions.
关系数据库管理系统(RDBMS)广泛用于存储和检索结构化数据。为了获得超越统计聚合的见解,通常需要使用传统数据库操作从数据库中提取特定子数据集,并在另一个机器学习系统中在这些子数据集上应用深度神经网络(DNN)的训练和推理。这个过程通常非常昂贵,尤其是在提取不同分析目的的多个子数据集时。因此,本文提出了LEADS,一种新颖的SQL感知动态模型切削技术,用于根据SQL查询指定的子数据集定制模型。LEADS通过混合专家(MoE)技术改善了结构化数据的预测建模,通过SQL感知门网络保持推理效率。LEADS的核心是通过MoE在整个数据库上训练构建通用模型,并仅在推理时激活必要专家。此外,在LEADS训练过程中引入了两个正则化项,以在效果和效率之间取得平衡。我们还设计并构建了一个支持端到端高级结构数据分析的内部推理系统,称为INDices,通过在PostgreSQL中非侵入性地将LEADS集成来支持结构化数据的分析。我们在现实世界的数据集上进行的大量实验证明,LEADS持续优于基线模型,而INDices通过显著减少推理延迟,实现了有效的高速数据库分析。
https://arxiv.org/abs/2405.00568
Data augmentation serves as a popular regularization technique to combat overfitting challenges in neural networks. While automatic augmentation has demonstrated success in image classification tasks, its application to time-series problems, particularly in long-term forecasting, has received comparatively less attention. To address this gap, we introduce a time-series automatic augmentation approach named TSAA, which is both efficient and easy to implement. The solution involves tackling the associated bilevel optimization problem through a two-step process: initially training a non-augmented model for a limited number of epochs, followed by an iterative split procedure. During this iterative process, we alternate between identifying a robust augmentation policy through Bayesian optimization and refining the model while discarding suboptimal runs. Extensive evaluations on challenging univariate and multivariate forecasting benchmark problems demonstrate that TSAA consistently outperforms several robust baselines, suggesting its potential integration into prediction pipelines.
数据增强是一种流行的正则化技术,用于解决神经网络中的过拟合挑战。虽然自动增强在图像分类任务中已经取得了成功,但其在时间序列问题上的应用,尤其是在长期预测方面,受到了相对较少的关注。为了填补这一空白,我们引入了一种名为TSAA的时间序列自动增强方法,它既高效又易于实现。解决方案涉及通过两个步骤解决相关双层优化问题:首先,对有限个 epoch 进行非增强模型训练,然后进行迭代拆分过程。在迭代过程中,我们通过贝叶斯优化识别出鲁棒增强策略,同时丢弃次优运行结果。对具有挑战性的单变量和多变量预测基准问题进行广泛的评估表明,TSAA始终优于几个稳健的基线,表明其可能被集成到预测管道中。
https://arxiv.org/abs/2405.00319
We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model's entire output distribution rather than limiting it to the preference dataset. Although SPO does not require the assumption of an existing underlying reward model, we demonstrate that, under the Bradley-Terry (BT) model assumption, it converges to a softmax of scaled rewards, with the distribution's "softness" adjustable via the softmax exponent, an algorithm parameter. We showcase SPO's methodology, its theoretical foundation, and its comparative advantages in simplicity, computational efficiency, and alignment precision.
我们提出了软偏好优化(SPO)方法,这是一种将生成模型(如大型语言模型(LLMs))与人类偏好对齐的方法,而无需奖励模型。SPO通过一个自然损失函数直接优化模型输出,该损失函数将偏好损失与模型输出分布的规范化 term 集成在一起,而不是将其限制在偏好数据集中。尽管SPO不需要现有奖励模型的假设,但我们证明了,在布雷德利 - 特里模型假设下,它收敛到缩放的奖励的软分布,该分布的“柔软性”可以通过软指数可调,这是一个算法参数。我们展示了SPO的策略、其理论基础以及其在简单性、计算效率和对齐精确性方面的比较优势。
https://arxiv.org/abs/2405.00747
3D visual grounding is a challenging task that often requires direct and dense supervision, notably the semantic label for each object in the scene. In this paper, we instead study the naturally supervised setting that learns from only 3D scene and QA pairs, where prior works underperform. We propose the Language-Regularized Concept Learner (LARC), which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g., a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding, and demonstrates a wide range of 3D visual reasoning capabilities-from zero-shot composition, to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors, for learning in settings without dense supervision.
3D视觉 groundeding 是一个具有挑战性的任务,通常需要直接和密集的监督,特别是场景中的每个对象的语义标签。在本文中,我们研究了一个自然监督设置,该设置仅从3D场景和QA对中学习,而之前的工作在这些设置上表现不佳。我们提出了 Language-Regularized Concept Learner (LARC),它使用语言约束作为正则化,显著提高了自然监督设置中神经符号学习者的准确性。我们的方法基于两个核心见解:语言约束(例如,一个单词与其他单词的关系)可以作为对结构化表示的有效正则化;第二个是,我们可以向大型语言模型查询,从中提取这样的约束从语言属性中。我们证明了 LARC 能够提高之前在自然监督3D视觉 groundeding 中的工作的性能,并展示了广泛的3D视觉推理能力-从零散的组合到数据效率和可转移性。我们的方法代表了一个有前景的步骤,将语言基于先验的视觉推理框架 regularize,以在缺乏密集监督的学习环境中学习。
https://arxiv.org/abs/2404.19696
While Reinforcement Learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations' and LLM's rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation. We show the effectiveness of RCfD on three language tasks, which achieves comparable performance to carefully tuned baselines while mitigating ROO.
虽然强化学习(RL)已被证明是调整大型语言模型(LLMs)的关键,但它可能导致奖励过度优化(ROO)。现有的方法通过添加KL正则化来解决ROO问题,但这需要计算密集的参数调整。此外,KL正则化仅关注正则化语言策略,而忽视了一个可能的奖励函数的来源:本身。受到演示引导的RL的启发,我们在这里引入了从演示中调节奖励(RCfD)的概念,它利用人类演示和奖励模型重新调整奖励目标。正式地说,给定提示,RCfD目标最小化演示和LLM奖励之间的距离,而不是直接最大化奖励函数。这种目标转换避免了激励LLM利用奖励模型,同时促进了更自然和多样化的语言生成。我们在三个语言任务上展示了RCfD的有效性,这些任务在谨慎调整基线的同时减轻了ROO。
https://arxiv.org/abs/2404.19409
This paper presents new and effective algorithms for learning kernels. In particular, as shown by our empirical results, these algorithms consistently outperform the so-called uniform combination solution that has proven to be difficult to improve upon in the past, as well as other algorithms for learning kernels based on convex combinations of base kernels in both classification and regression. Our algorithms are based on the notion of centered alignment which is used as a similarity measure between kernels or kernel matrices. We present a number of novel algorithmic, theoretical, and empirical results for learning kernels based on our notion of centered alignment. In particular, we describe efficient algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP and discuss a one-stage algorithm for learning both a kernel and a hypothesis based on that kernel using an alignment-based regularization. Our theoretical results include a novel concentration bound for centered alignment between kernel matrices, the proof of the existence of effective predictors for kernels with high alignment, both for classification and for regression, and the proof of stability-based generalization bounds for a broad family of algorithms for learning kernels based on centered alignment. We also report the results of experiments with our centered alignment-based algorithms in both classification and regression.
本文提出了一种学习核的新且有效的算法。特别地,如图我们通过实证结果所示,这些算法 consistently 优于过去难以改进的所谓的均匀组合解决方案,以及基于凸组合的核学习算法。我们的算法基于基于中心对齐的概念,该概念作为核或核矩阵之间的相似度度量。我们给出了基于我们概念的中心对齐学习核的新颖算法、理论结果和实证结果。特别地,我们描述了通过将问题简化为一个简单的 QP 来学习最大对齐核的高效算法,并讨论了使用基于对齐的 regularization 学习核和假设的单阶段算法。我们的理论结果包括关于核矩阵中心对齐的 new 紧缩 bound,证明高对齐核的有效预测器,以及分类和回归中基于对齐的核学习的稳定扩展 bound。我们还报告了基于我们中心对齐算法的实验结果。
https://arxiv.org/abs/1203.0550
Neural Radiance Fields (NeRF) show impressive performance in photo-realistic free-view rendering of scenes. Recent improvements on the NeRF such as TensoRF and ZipNeRF employ explicit models for faster optimization and rendering, as compared to the NeRF that employs an implicit representation. However, both implicit and explicit radiance fields require dense sampling of images in the given scene. Their performance degrades significantly when only a sparse set of views is available. Researchers find that supervising the depth estimated by a radiance field helps train it effectively with fewer views. The depth supervision is obtained either using classical approaches or neural networks pre-trained on a large dataset. While the former may provide only sparse supervision, the latter may suffer from generalization issues. As opposed to the earlier approaches, we seek to learn the depth supervision by designing augmented models and training them along with the main radiance field. Further, we aim to design a framework of regularizations that can work across different implicit and explicit radiance fields. We observe that certain features of these radiance field models overfit to the observed images in the sparse-input scenario. Our key finding is that reducing the capability of the radiance fields with respect to positional encoding, the number of decomposed tensor components or the size of the hash table, constrains the model to learn simpler solutions, which estimate better depth in certain regions. By designing augmented models based on such reduced capabilities, we obtain better depth supervision for the main radiance field. We achieve state-of-the-art view-synthesis performance with sparse input views on popular datasets containing forward-facing and 360$^\circ$ scenes by employing the above regularizations.
神经辐射场(NeRF)在真实感 free-view 渲染场景中表现出出色的性能。 TensoRF 和 ZipNeRF 这样的最新改进采用明确的模型来进行快速的优化和渲染,而与采用隐式表示的 NeRF 相比。然而,无论是隐式还是显式的辐射场都需要在给定的场景中进行图像的密集采样。当仅有一个稀疏的视图集合可用时,其性能会显著下降。研究人员发现,通过监督辐射场估计的深度来指导训练可以帮助它有效地利用更少的视图。深度监督可以通过经典方法或在大数据集上预训练的神经网络实现。然而,前者的深度监督可能只有稀疏的监督,而后者可能会面临泛化问题。与之前的方法相比,我们试图通过设计增强模型并将其与主辐射场一起训练来学习深度监督。此外,我们还旨在设计一个通用的正则框架,可以应用于不同的隐式和显式辐射场。我们观察到,这些辐射场模型的某些特征在稀疏输入场景中的观察图像上过拟合。我们关键的发现是,通过降低辐射场相对于位置编码的能力、分解张量的数量或哈希表的大小,约束模型学习更简单的解决方案,这些解决方案在某些区域上估计的深度更好。基于这种减少能力的增强模型,我们获得了更好的主辐射场深度监督。我们在包含前进面和 360$^\circ$ 场景的流行数据集上通过使用上述正则化方法实现了最先进的视图合成性能。
https://arxiv.org/abs/2404.19015
The scarcity of labeled data in real-world scenarios is a critical bottleneck of deep learning's effectiveness. Semi-supervised semantic segmentation has been a typical solution to achieve a desirable tradeoff between annotation cost and segmentation performance. However, previous approaches, whether based on consistency regularization or self-training, tend to neglect the contextual knowledge embedded within inter-pixel relations. This negligence leads to suboptimal performance and limited generalization. In this paper, we propose a novel approach IPixMatch designed to mine the neglected but valuable Inter-Pixel information for semi-supervised learning. Specifically, IPixMatch is constructed as an extension of the standard teacher-student network, incorporating additional loss terms to capture inter-pixel relations. It shines in low-data regimes by efficiently leveraging the limited labeled data and extracting maximum utility from the available unlabeled data. Furthermore, IPixMatch can be integrated seamlessly into most teacher-student frameworks without the need of model modification or adding additional components. Our straightforward IPixMatch method demonstrates consistent performance improvements across various benchmark datasets under different partitioning protocols.
在现实场景中,有标签数据的稀缺是深度学习效果的一个关键瓶颈。半监督语义分割是一种常见的解决方案,以实现注释成本和分割性能之间的理想平衡。然而,以前的方法,无论是基于一致性正则化还是自训练,都倾向于忽视内部像素关系中固有的上下文知识。这种疏忽导致 suboptimal 的性能和有限的泛化能力。在本文中,我们提出了一种名为 IPixMatch 的新颖方法,旨在通过半监督学习挖掘被忽视但有益的跨像素信息。具体来说,IPixMatch 是一个标准的老师-学生网络的扩展,包括额外的损失项来捕捉跨像素关系。它在低数据量的情况下通过有效地利用有限的标记数据并从可用未标记数据中挖掘最大效用来闪耀。此外,IPixMatch 可以无缝地集成到大多数老师-学生框架中,而无需对模型进行修改或添加额外组件。我们直接使用 IPixMatch 的方法在不同的分片协议下展示了 consistent 的性能提升。
https://arxiv.org/abs/2404.18891
In this paper, we present a different way to use two modalities, in which either one modality or the other is seen by a single model. This can be useful when adapting an unimodal model to leverage more information while respecting a limited computational budget. This would mean having a single model that is able to deal with any modalities. To describe this, we coined the term anymodal learning. An example of this, is a use case where, surveillance in a room when the lights are off would be much more valuable using an infrared modality while a visible one would provide more discriminative information when lights are on. This work investigates how to efficiently leverage visible and infrared/thermal modalities for transformer-based object detection backbone to create an anymodal architecture. Our work does not create any inference overhead during the testing while exploring an effective way to exploit the two modalities during the training. To accomplish such a task, we introduce the novel anymodal training technique: Mixed Patches (MiPa), in conjunction with a patch-wise domain agnostic module, which is responsible of learning the best way to find a common representation of both modalities. This approach proves to be able to balance modalities by reaching competitive results on individual modality benchmarks with the alternative of using an unimodal architecture on three different visible-infrared object detection datasets. Finally, our proposed method, when used as a regularization for the strongest modality, can beat the performance of multimodal fusion methods while only requiring a single modality during inference. Notably, MiPa became the state-of-the-art on the LLVIP visible/infrared benchmark. Code: this https URL
在本文中,我们提出了另一种使用两种方式的方法,其中一种方式是让一个模型看到一种模式,而另一种方式是让一个模型看到另一种模式。当适应一个单一模态的模型以利用更多的信息,同时遵守有限计算预算时,这种方法可以很有用。这意味着要有一个能够处理任何模态的模型。为了描述这一点,我们定义了一个术语:多模态学习。一个这种多模态学习的例子是在房间里有灯光熄灭时进行监控,使用红外模式会比使用可见模式提供更有价值的监控信息,而灯光打开时,可见模式会提供更有区分性的信息。本文研究了如何有效地将可见和红外/热模态用于基于Transformer的对象检测骨干网络以创建多模态架构。我们的工作在测试过程中没有产生任何推理开销,同时探索了在训练过程中有效利用两种模态的最佳方法。为了实现这一目标,我们引入了新的多模态训练技术:Mixed Patches(MiPa),并与一个补丁域无关的模块相结合,该模块负责学习找到两种模态之间共同表示的最佳方式。这种方法通过在单个模态基准上实现竞争性的结果,证明了能够平衡模态,同时使用三个不同的可见-红外物体检测数据集上的单一模态架构。最后,当我们将该方法用作最强的模态的正则化时,可以在仅需要一个模态的情况下击败多模态融合方法的性能。值得注意的是,MiPa在LLVIP可见/红外基准上达到了最先进的水平。代码:https:// this URL
https://arxiv.org/abs/2404.18849
Large language models (LLMs) with one or more fine-tuning phases have become a necessary step to unlock various capabilities, enabling LLMs to follow natural language instructions or align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. In this paper, we find that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce Half Fine-Tuning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks while the other half are frozen to remain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.
大型语言模型(LLMs)通过一次或多次微调阶段已成为解锁各种功能的有必要的一步,使LLM能够遵循自然语言指令或与人类偏好对齐。然而,在序列训练过程中,它们可能面临灾难性遗忘的风险,以前阶段学到的参数或能力可能会被传入的训练数据所压倒。在本文中,我们发现,通过定期重置部分参数,LLM可以恢复一些原始知识。受到这一启发,我们引入了半微调(HFT)用于LLM,作为完全微调(FFT)的替代方案,以减轻遗忘问题,其中一半参数用于学习新任务,而另一半参数则保持不变以保留先前的知识。我们从优化的角度进行了可行性分析,并将参数选择操作解释为正则化项。在没有改变模型架构的情况下,HFT可以轻松地融入现有的微调框架。对监督微调、直接偏好优化和持续学习的大规模实验和分析都证实了HFT的有效性、稳健性和效率。与FFT相比,HFT不仅显著减轻了遗忘问题,而且在一系列下游基准测试中实现了最佳性能,训练时间约减少30%。
https://arxiv.org/abs/2404.18466
Accurately simulating diverse behaviors of heterogeneous agents in various scenarios is fundamental to autonomous driving simulation. This task is challenging due to the multi-modality of behavior distribution, the high-dimensionality of driving scenarios, distribution shift, and incomplete information. Our first insight is to leverage state-matching through differentiable simulation to provide meaningful learning signals and achieve efficient credit assignment for the policy. This is demonstrated by revealing the existence of gradient highways and interagent gradient pathways. However, the issues of gradient explosion and weak supervision in low-density regions are discovered. Our second insight is that these issues can be addressed by applying dual policy regularizations to narrow the function space. Further considering diversity, our third insight is that the behaviors of heterogeneous agents in the dataset can be effectively compressed as a series of prototype vectors for retrieval. These lead to our model-based reinforcement-imitation learning framework with temporally abstracted mixture-of-codebooks (MRIC). MRIC introduces the open-loop modelbased imitation learning regularization to stabilize training, and modelbased reinforcement learning (RL) regularization to inject domain knowledge. The RL regularization involves differentiable Minkowskidifference-based collision avoidance and projection-based on-road and traffic rule compliance rewards. A dynamic multiplier mechanism is further proposed to eliminate the interference from the regularizations while ensuring their effectiveness. Experimental results using the largescale Waymo open motion dataset show that MRIC outperforms state-ofthe-art baselines on diversity, behavioral realism, and distributional realism, with large margins on some key metrics (e.g., collision rate, minSADE, and time-to-collision JSD).
准确地模拟不同场景中异质代理人的多样行为是自动驾驶模拟的基本任务。由于行为分布的多维度性、驾驶场景的高维度性、分布变化和信息不完整,这个任务具有挑战性。我们的第一个洞察是通过可导模拟通过状态匹配提供有意义的学习信号,实现策略的有效分配。这可以通过揭示梯度高速公路和异质代理人的梯度通道来证明。然而,在低密度区域中发现了梯度爆炸和弱监督的问题。我们的第二个洞察是,通过应用双策略 regularization 缩小函数空间可以解决这些问题。再考虑多样性,我们第三个洞察是,数据集中的异质代理人的行为可以用一系列原型向量有效地压缩检索。这导致基于模型的强化模仿学习框架(MRIC)。MRIC 引入了开环模型基于仿真的 regularization 以稳定训练,以及基于模型的强化学习 (RL) 基于领域知识的 regularization。RL regularization 涉及可导的 Minkowskidifference-based 碰撞避免和基于投影的道路和交通规则遵守奖励。还进一步提出了动态乘数机制,消除 regularization 的干扰,同时确保其有效性。使用大型 Waymo 开放运动数据集进行实验研究,结果表明 MRIC 在多样性、行为真实性和分布真实性方面超过了最先进的基线,在某些关键指标(如碰撞率、minSADE 和时间到碰撞 JSD)上具有很大的优势(e.g., collision rate, minSADE, and time-to-collision JSD)。
https://arxiv.org/abs/2404.18464
MTL is a learning paradigm that effectively leverages both task-specific and shared information to address multiple related tasks simultaneously. In contrast to STL, MTL offers a suite of benefits that enhance both the training process and the inference efficiency. MTL's key advantages encompass streamlined model architecture, performance enhancement, and cross-domain generalizability. Over the past twenty years, MTL has become widely recognized as a flexible and effective approach in various fields, including CV, NLP, recommendation systems, disease prognosis and diagnosis, and robotics. This survey provides a comprehensive overview of the evolution of MTL, encompassing the technical aspects of cutting-edge methods from traditional approaches to deep learning and the latest trend of pretrained foundation models. Our survey methodically categorizes MTL techniques into five key areas: regularization, relationship learning, feature propagation, optimization, and pre-training. This categorization not only chronologically outlines the development of MTL but also dives into various specialized strategies within each category. Furthermore, the survey reveals how the MTL evolves from handling a fixed set of tasks to embracing a more flexible approach free from task or modality constraints. It explores the concepts of task-promptable and -agnostic training, along with the capacity for ZSL, which unleashes the untapped potential of this historically coveted learning paradigm. Overall, we hope this survey provides the research community with a comprehensive overview of the advancements in MTL from its inception in 1997 to the present in 2023. We address present challenges and look ahead to future possibilities, shedding light on the opportunities and potential avenues for MTL research in a broad manner. This project is publicly available at this https URL.
MTL是一种学习范式,有效利用任务特定和共享信息来同时解决多个相关任务。与STL相比,MTL提供了一系列增强训练过程和推理效率的优势。MTL的关键优势包括简化模型架构、性能提升和跨领域泛化。在过去的二十年中,MTL已经成为许多领域广泛认可的灵活有效的解决方案,包括CV、自然语言处理、推荐系统、疾病预后和诊断、以及机器人领域。本次调查全面回顾了MTL的发展历程,从传统方法的尖端技术到深度学习的最新趋势,以及预训练基础模型的最新趋势。我们的调查系统地分类MTL技术为五个关键领域:正则化、关系学习、特征传播、优化和预训练。这种分类不仅按时间顺序描述了MTL的发展,还深入研究了每个领域的各种专业策略。此外,调查揭示了MTL如何从处理固定任务转变为更加灵活的方法,摆脱了任务或模型约束。它探讨了任务提示的和无条件的训练概念,以及ZSL(零样本学习)的能力,揭示了这一历史悠久的值得称赞的学习范式所蕴含的潜力。总的来说,我们希望通过这次调查为研究社区提供MTL从1997年创立到2023年的全面概述。我们关注当前的挑战,展望未来的机遇,以一种全面的方式揭示MTL研究在各个领域的机会和潜在途径。这个项目在https://这个URL上公开可用。
https://arxiv.org/abs/2404.18961
Multi-robot simultaneous localization and mapping (SLAM) enables a robot team to achieve coordinated tasks relying on a common map. However, centralized processing of robot observations is undesirable because it creates a single point of failure and requires pre-existing infrastructure and significant multi-hop communication throughput. This paper formulates multi-robot object SLAM as a variational inference problem over a communication graph. We impose a consensus constraint on the objects maintained by different nodes to ensure agreement on a common map. To solve the problem, we develop a distributed mirror descent algorithm with a regularization term enforcing consensus. Using Gaussian distributions in the algorithm, we derive a distributed multi-state constraint Kalman filter (MSCKF) for multi-robot object SLAM. Experiments on real and simulated data show that our method improves the trajectory and object estimates, compared to individual-robot SLAM, while achieving better scaling to large robot teams, compared to centralized multi-robot SLAM. Code is available at this https URL.
多机器人同时定位与映射(SLAM)使得机器人团队能够依靠共同的地图实现协同任务。然而,集中式处理机器人观测是一个不愉快的特点,因为它创造了一个单点故障,并需要依赖预先存在的设施和显著的多跳通信带宽。本文将多机器人对象SLAM建模为通信图上的变分推理问题。我们在不同节点维护的物体之间施加共识约束,以确保对共同地图的一致同意。为了解决这个问题,我们开发了一个具有正则化项的分布式镜像下降算法。使用高斯分布算法,我们推导出多机器人对象SLAM的分布式多状态约束Kalman滤波器(MSCKF)。在真实和模拟数据上的实验表明,与单独机器人SLAM相比,我们的方法提高了轨迹和物体估计,同时实现了更好的对大型机器人团队的比例扩展。代码可在此处访问:https://www.xxx.com/
https://arxiv.org/abs/2404.18331
This paper serves to introduce the Align, Minimize and Diversify (AMD) method, a Source-Free Unsupervised Domain Adaptation approach for Handwritten Text Recognition (HTR). This framework decouples the adaptation process from the source data, thus not only sidestepping the resource-intensive retraining process but also making it possible to leverage the wealth of pre-trained knowledge encoded in modern Deep Learning architectures. Our method explicitly eliminates the need to revisit the source data during adaptation by incorporating three distinct regularization terms: the Align term, which reduces the feature distribution discrepancy between source and target data, ensuring the transferability of the pre-trained representation; the Minimize term, which encourages the model to make assertive predictions, pushing the outputs towards one-hot-like distributions in order to minimize prediction uncertainty, and finally, the Diversify term, which safeguards against the degeneracy in predictions by promoting varied and distinctive sequences throughout the target data, preventing informational collapse. Experimental results from several benchmarks demonstrated the effectiveness and robustness of AMD, showing it to be competitive and often outperforming DA methods in HTR.
本文旨在介绍一种名为Align、Minimize和Diversify(AMD)的方法,一种用于手写文本识别(HTR)的源自由无监督领域适应方法。该框架将适应过程与源数据解耦,从而不仅避免了资源密集的重新训练过程,而且还可以利用现代深度学习架构中编码的丰富知识。我们的方法明确消除在适应过程中重新访问源数据的需求,通过引入三个不同的正则化项实现:Align项,该项减少了源数据和目标数据之间的特征分布差异,确保了预训练表示的转移性;Minimize项,该项鼓励模型做出果断的预测,将输出推向one-hot-like分布,以最小化预测不确定性;最后,Diversify项,通过促进目标数据中多样且具有差异的序列,防止信息坍塌,从而保护预测的可靠性。来自多个基准测试的实验结果表明,AMD的有效性和鲁棒性,使其在HTR领域与DA方法竞争, often outperforming them。
https://arxiv.org/abs/2404.18260
Continual learning (CL) remains one of the long-standing challenges for deep neural networks due to catastrophic forgetting of previously acquired knowledge. Although rehearsal-based approaches have been fairly successful in mitigating catastrophic forgetting, they suffer from overfitting on buffered samples and prior information loss, hindering generalization under low-buffer regimes. Inspired by how humans learn using strong inductive biases, we propose IMEX-Reg to improve the generalization performance of experience rehearsal in CL under low buffer regimes. Specifically, we employ a two-pronged implicit-explicit regularization approach using contrastive representation learning (CRL) and consistency regularization. To further leverage the global relationship between representations learned using CRL, we propose a regularization strategy to guide the classifier toward the activation correlations in the unit hypersphere of the CRL. Our results show that IMEX-Reg significantly improves generalization performance and outperforms rehearsal-based approaches in several CL scenarios. It is also robust to natural and adversarial corruptions with less task-recency bias. Additionally, we provide theoretical insights to support our design decisions further.
持续学习(CL)是深度神经网络中一个长期存在的挑战,因为之前学习的知识会因为梯度消失而丢失。尽管基于练习的方法在减轻梯度消失方面相当成功,但它们在缓冲样本和先验信息损失方面过于拟合,阻碍了在低缓冲 regime下的泛化能力。受到人类使用强归纳偏见学习的方式启发,我们提出了一种基于对比学习(CRL)的低缓冲时经验回放(IMEX-Reg)方法,以提高在低缓冲 regime 下 CL 的泛化性能。具体来说,我们采用对比表示学习(CRL)中的双峰隐式-显式正则化方法,并结合一致性正则化。为了更好地利用使用 CRL 学习到的表示之间的关系,我们提出了一种引导分类器朝 CRL 中单位球体激活关联的方向的规范化策略。我们的结果表明,IMEX-Reg 显著提高了泛化性能,在多个 CL 场景中超过了基于练习的方法。它还对于自然和对抗性污染具有较低的任务晚期偏见。此外,我们还提供了进一步支持我们设计决策的理论和实验洞察。
https://arxiv.org/abs/2404.18161