Event-based data are commonly encountered in edge computing environments where efficiency and low latency are critical. To interface with such data and leverage their rich temporal features, we propose a causal spatiotemporal convolutional network. This solution targets efficient implementation on edge-appropriate hardware with limited resources in three ways: 1) deliberately targets a simple architecture and set of operations (convolutions, ReLU activations) 2) can be configured to perform online inference efficiently via buffering of layer outputs 3) can achieve more than 90% activation sparsity through regularization during training, enabling very significant efficiency gains on event-based processors. In addition, we propose a general affine augmentation strategy acting directly on the events, which alleviates the problem of dataset scarcity for event-based systems. We apply our model on the AIS 2024 event-based eye tracking challenge, reaching a score of 0.9916 p10 accuracy on the Kaggle private testset.
基于事件的数据显示在需要高效率和低延迟的边缘计算环境中非常常见。为了与这样的数据进行交互并充分利用其丰富的时序特征,我们提出了一个因果时序卷积网络。这个解决方案通过以下三种方式针对边缘适配硬件资源有限的问题:1)故意选择简单的架构和操作(卷积,ReLU激活);2)可以通过层输出缓冲进行在线推理的高效配置;3)在训练过程中通过正则化实现超过90%的激活稀疏度,从而在事件处理芯片上实现显著的高效提升。此外,我们提出了一种直接对事件进行加权的增强策略,缓解了事件基于系统数据量少的問題。我们在AIS 2024基于事件的眼动跟踪挑战中应用我们的模型,在Kaggle私有测试集上的分数达到0.9916 p10。
https://arxiv.org/abs/2404.08858
To make sense of their surroundings, intelligent systems must transform complex sensory inputs to structured codes that are reduced to task-relevant information such as object category. Biological agents achieve this in a largely autonomous manner, presumably via self-\allowbreak super-\allowbreak vised learning. Whereas previous attempts to model the underlying mechanisms were largely discriminative in nature, there is ample evidence that the brain employs a generative model of the world. Here, we propose that eye movements, in combination with the focused nature of primate vision, constitute a generative, self-supervised task of predicting and revealing visual information. We construct a proof-of-principle model starting from the framework of masked image modeling (MIM), a common approach in deep representation learning. To do so, we analyze how core components of MIM such as masking technique and data augmentation influence the formation of category-specific representations. This allows us not only to better understand the principles behind MIM, but to then reassemble a MIM more in line with the focused nature of biological perception. From a theoretical angle, we find that MIM disentangles neurons in latent space, a property that has been suggested to structure visual representations in primates, without explicit regulation. Together with previous findings of invariance learning, this highlights an interesting connection of MIM to latent regularization approaches for self-supervised learning. The source code is available under this https URL
为了理解他们周围的环境,智能系统必须将复杂的感官输入转换为结构化的代码,以便精简为任务相关的信息,如物体类别。生物智能体在很大程度上是自适应的,可能是通过自监督学习中的self-allowbreak super-allowbreak visibility学习来实现的。而之前尝试建模底层机制的尝试在很大程度上是歧视性的,证据表明,大脑采用了一种生成型的世界模型。在这里,我们提出,眼动和灵长类视觉集中精力的事实构成了一个生成、自监督的任务,预测和揭示视觉信息。我们从一个通用的图像建模(MIM)框架开始构建证明原则的模型,这是深度表示学习中的常见方法。为此,我们分析MIM中核心组件如遮罩技术和数据增强如何影响类别特定表示的形成。这使我们不仅能够更好地理解MIM的原理,而且能够重新构建一个更符合生物感知聚焦特点的MIM。从理论角度来看,我们发现MIM解离了潜在空间中的神经元,这种性质在灵长类动物中建议了视觉表示的结构。结合之前的惯性学习发现,这揭示了MIM与自监督学习中潜在规范方法之间的有趣联系。源代码可以在https://这个 URL上找到。
https://arxiv.org/abs/2404.08526
We study the problem of self-supervised 3D scene flow estimation from real large-scale raw point cloud sequences, which is crucial to various tasks like trajectory prediction or instance segmentation. In the absence of ground truth scene flow labels, contemporary approaches concentrate on deducing optimizing flow across sequential pairs of point clouds by incorporating structure based regularization on flow and object rigidity. The rigid objects are estimated by a variety of 3D spatial clustering methods. While state-of-the-art methods successfully capture overall scene motion using the Neural Prior structure, they encounter challenges in discerning multi-object motions. We identified the structural constraints and the use of large and strict rigid clusters as the main pitfall of the current approaches and we propose a novel clustering approach that allows for combination of overlapping soft clusters as well as non-overlapping rigid clusters representation. Flow is then jointly estimated with progressively growing non-overlapping rigid clusters together with fixed size overlapping soft clusters. We evaluate our method on multiple datasets with LiDAR point clouds, demonstrating the superior performance over the self-supervised baselines reaching new state of the art results. Our method especially excels in resolving flow in complicated dynamic scenes with multiple independently moving objects close to each other which includes pedestrians, cyclists and other vulnerable road users. Our codes will be publicly available.
我们研究从真实大型规模的点云序列中自监督3D场景流估计的问题,这对各种任务如轨迹预测或实例分割至关重要。在没有地面真实场景流标签的情况下,当代方法集中精力通过在流动和物体刚性上基于结构的正则化来推断序列点云之间的优化流动。刚性物体通过多种3D空间聚类方法估计。尽管最先进的方法能够通过神经先验结构成功捕捉整体场景运动,但他们在区分多对象运动时遇到了挑战。我们确定了当前方法的结构性约束以及大型严格刚性簇的使用是主要缺陷,并提出了一种新的聚类方法,允许结合重叠的软簇和非重叠的刚性簇表示。然后与固定大小的重叠软簇一起,使用共同增长的非重叠刚性簇估计流。我们在多个带有激光雷达点云的数据集上评估我们的方法,证明了自监督基线与自监督基线的优越性能,达到了新颖的最好结果。我们的方法在解决具有多个独立移动对象且彼此靠近的复杂动态场景中的流方面尤为出色。我们的代码将公开可用。
https://arxiv.org/abs/2404.08363
Neural implicit k-space representations have shown promising results for dynamic MRI at high temporal resolutions. Yet, their exclusive training in k-space limits the application of common image regularization methods to improve the final reconstruction. In this work, we introduce the concept of parallel imaging-inspired self-consistency (PISCO), which we incorporate as novel self-supervised k-space regularization enforcing a consistent neighborhood relationship. At no additional data cost, the proposed regularization significantly improves neural implicit k-space reconstructions on simulated data. Abdominal in-vivo reconstructions using PISCO result in enhanced spatio-temporal image quality compared to state-of-the-art methods. Code is available at this https URL.
神经隐式k空间表示在动态高分辨率MRI中显示出有希望的结果。然而,它们在k空间中的独家训练限制了通常图像正则化方法的适用,以提高最终重构。在这项工作中,我们引入了并行成像启发的自一致性(PISCO)的概念,将其作为新颖的自监督k空间正则化,确保一致的邻域关系。与额外数据成本无关,所提出的正则化在模拟数据上显著改善了神经隐式k空间重构。使用PISCO进行 abdominal in-vivo 重建比最先进的methods 产生更出色的时空图像质量。PISCO的代码可在此处访问。
https://arxiv.org/abs/2404.08350
The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for multiview stereo (MVS) benchmarks such as ETH3D. In this paper, we aim to create 3D models that provide accurate geometry and view synthesis, partially closing the large geometric performance gap between NeRF and traditional MVS methods. We propose a patch-based approach that effectively leverages monocular surface normal and relative depth predictions. The patch-based ray sampling also enables the appearance regularization of normalized cross-correlation (NCC) and structural similarity (SSIM) between randomly sampled virtual and training views. We further show that "density restrictions" based on sparse structure-from-motion points can help greatly improve geometric accuracy with a slight drop in novel view synthesis metrics. Our experiments show 4x the performance of RegNeRF and 8x that of FreeNeRF on average F1@2cm for ETH3D MVS benchmark, suggesting a fruitful research direction to improve the geometric accuracy of NeRF-based models, and sheds light on a potential future approach to enable NeRF-based optimization to eventually outperform traditional MVS.
最新的正规化神经辐射场(NeRF)方法在多视角立体(MVS)基准测试中产生几何和视图扩展效果较差。在本文中,我们的目标是创建3D模型,提供准确的几何和视图合成,在一定程度上缩小NeRF和传统MVS方法之间的巨大几何性能差距。我们提出了基于补丁的方法,有效利用单目表面法线和相对深度预测。基于补丁的 ray sampling 还使得随机采样虚拟和训练视图之间的标准化交叉象限(NCC)和结构相似性(SSIM)的显现 regularization。我们进一步证明了基于稀疏结构运动点 的 "密度限制" 可以极大地改善几何准确性,而稍有降低新颖视图合成指标。我们的实验结果表明,RegNeRF和FreeNeRF在ETH3D MVS基准上的平均F1@2cm分别是4倍和8倍。这表明有前途的研究方向是提高基于NeRF的模型的几何准确性,并阐明一种最终使基于NeRF的优化能够超越传统MVS方法的潜在方法。
https://arxiv.org/abs/2404.08252
Learning-based neural network (NN) control policies have shown impressive empirical performance in a wide range of tasks in robotics and control. However, formal (Lyapunov) stability guarantees over the region-of-attraction (ROA) for NN controllers with nonlinear dynamical systems are challenging to obtain, and most existing approaches rely on expensive solvers such as sums-of-squares (SOS), mixed-integer programming (MIP), or satisfiability modulo theories (SMT). In this paper, we demonstrate a new framework for learning NN controllers together with Lyapunov certificates using fast empirical falsification and strategic regularizations. We propose a novel formulation that defines a larger verifiable region-of-attraction (ROA) than shown in the literature, and refines the conventional restrictive constraints on Lyapunov derivatives to focus only on certifiable ROAs. The Lyapunov condition is rigorously verified post-hoc using branch-and-bound with scalable linear bound propagation-based NN verification techniques. The approach is efficient and flexible, and the full training and verification procedure is accelerated on GPUs without relying on expensive solvers for SOS, MIP, nor SMT. The flexibility and efficiency of our framework allow us to demonstrate Lyapunov-stable output feedback control with synthesized NN-based controllers and NN-based observers with formal stability guarantees, for the first time in literature. Source code at this https URL.
基于学习的神经网络(NN)控制策略在机器人学和控制领域取得了令人印象深刻的实证性能。然而,对于具有非线性动力系统的NN控制器的区域吸引域(ROA)的正式(Lyapunov)稳定性保证是具有挑战性的,而且大多数现有方法依赖于昂贵的求解器,如求和平方(SOS)、混合整数规划(MIP)或满足模度理论(SMT)。在本文中,我们提出了一种新的方法,使用快速的经验性否定和战略正则化学习神经网络控制器与Lyapunov证书相结合。我们提出了一个定义了比文献中更大的可验证区域吸引域(ROA)的新颖公式,并改进了传统的Lyapunov导数的限制,仅关注可验证的ROA。后验Lyapunov条件通过基于分支和边界的不确定性规划与扩展线性边界传播技术进行严谨的验证。该方法具有高效和灵活的特点,而且在GPU上不需要使用昂贵的求解器(如SOS、MIP或SMT)进行求解。由于我们的框架的灵活性和效率,我们能够首次在文献中实现基于合成NN控制器和具有形式稳定性的NN观察器的外部反馈控制,具有Lyapunov稳定性。源代码位于此链接:https://www.osac.org/open-source-control-systems/。
https://arxiv.org/abs/2404.07956
The emergence of various adapters, including Low-Rank Adaptation (LoRA) applied from the field of natural language processing, has allowed diffusion models to personalize image generation at a low cost. However, due to the various challenges including limited datasets and shortage of regularization and computation resources, adapter training often results in unsatisfactory outcomes, leading to the corruption of the backbone model's prior knowledge. One of the well known phenomena is the loss of diversity in object generation, especially within the same class which leads to generating almost identical objects with minor variations. This poses challenges in generation capabilities. To solve this issue, we present Contrastive Adapter Training (CAT), a simple yet effective strategy to enhance adapter training through the application of CAT loss. Our approach facilitates the preservation of the base model's original knowledge when the model initiates adapters. Furthermore, we introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to keep the former information. We qualitatively and quantitatively compare CAT's improvement. Finally, we mention the possibility of CAT in the aspects of multi-concept adapter and optimization.
各种适配器的出现,包括自然语言处理领域中的低秩适应(LoRA)应用,使得扩散模型在低成本下可以个性化图像生成。然而,由于各种挑战,包括有限的数据集和缺乏 regularization 和计算资源,适配器训练通常导致不满意的成果,导致基础知识模型的先验知识受到污染。一个著名的现象是在同一类别的物体生成中失去了多样性,尤其是在同一类中生成几乎相同的物体,给生成能力带来了挑战。为了解决这个问题,我们提出了 Contrastive Adapter Training (CAT),一种简单而有效的策略通过应用 CAT损失来增强适配器训练。我们的方法在模型启动适配器时保留基础模型的原始知识。此外,我们还引入了知识保留分数(KPS)来评估 CAT保留先信息的能力。我们定性和定量地比较了 CAT的改进。最后,我们还提到了 CAT在多概念适配器和优化方面的可能性。
https://arxiv.org/abs/2404.07554
Downsampling operators break the shift invariance of convolutional neural networks (CNNs) and this affects the robustness of features learned by CNNs when dealing with even small pixel-level shift. Through a large-scale correlation analysis framework, we study shift invariance of CNNs by inspecting existing downsampling operators in terms of their maximum-sampling bias (MSB), and find that MSB is negatively correlated with shift invariance. Based on this crucial insight, we propose a learnable pooling operator called Translation Invariant Polyphase Sampling (TIPS) and two regularizations on the intermediate feature maps of TIPS to reduce MSB and learn translation-invariant representations. TIPS can be integrated into any CNN and can be trained end-to-end with marginal computational overhead. Our experiments demonstrate that TIPS results in consistent performance gains in terms of accuracy, shift consistency, and shift fidelity on multiple benchmarks for image classification and semantic segmentation compared to previous methods and also leads to improvements in adversarial and distributional robustness. TIPS results in the lowest MSB compared to all previous methods, thus explaining our strong empirical results.
降低采样操作会破坏卷积神经网络(CNNs)的平移不变性,从而影响CNNs在处理小像素级平移时的鲁棒性。通过一个大规模的相关性分析框架,我们研究了CNNs的平移不变性,通过观察现有降低采样操作的最大采样偏差(MSB),发现MSB与平移不变性呈负相关。根据这一关键性的见解,我们提出了一个可学习聚类操作,称为平移不变多相采样(TIPS),以及两个平移不变的中间特征图的 regularization,以减少 MSB 并学习平移不变的表示。TIPS可以集成到任何CNN中,并且可以与 marginally computational overhead 一起进行端到端的训练。我们的实验结果表明,与之前的方法相比,TIPS在图像分类和语义分割基准上实现了准确性和平移一致性的显着提高,同时也提高了对抗性和分布鲁棒性。TIPS的MSB比所有之前方法都要低,因此我们的实证结果得到了很好的解释。
https://arxiv.org/abs/2404.07410
The increasing demand for virtual reality applications has highlighted the significance of crafting immersive 3D assets. We present a text-to-3D 360$^{\circ}$ scene generation pipeline that facilitates the creation of comprehensive 360$^{\circ}$ scenes for in-the-wild environments in a matter of minutes. Our approach utilizes the generative power of a 2D diffusion model and prompt self-refinement to create a high-quality and globally coherent panoramic image. This image acts as a preliminary "flat" (2D) scene representation. Subsequently, it is lifted into 3D Gaussians, employing splatting techniques to enable real-time exploration. To produce consistent 3D geometry, our pipeline constructs a spatially coherent structure by aligning the 2D monocular depth into a globally optimized point cloud. This point cloud serves as the initial state for the centroids of 3D Gaussians. In order to address invisible issues inherent in single-view inputs, we impose semantic and geometric constraints on both synthesized and input camera views as regularizations. These guide the optimization of Gaussians, aiding in the reconstruction of unseen regions. In summary, our method offers a globally consistent 3D scene within a 360$^{\circ}$ perspective, providing an enhanced immersive experience over existing techniques. Project website at: this http URL
虚拟现实应用程序的需求不断增加,凸显了创建沉浸式的3D资产的重要性。我们提出了一个文本到3D 360$^{\circ}$场景生成管道,用于在几分钟内创建全面的360$^{\circ}$场景。我们的方法利用了2D扩散模型的生成能力以及提示自优化来创建高质量和高全球一致性的全景图像。这个图像作为初步的"平"(2D)场景表示。接着,它被提升到3D高斯分布中,利用插值技术实现实时探索。为了产生一致的3D几何,我们的管道通过将2D单目深度对齐到一个全局优化点云中来构建空间一致的结构。这个点云作为3D高斯圆心的初始状态。为了解决单视图输入中固有的可见问题,我们对合成视图和输入视图施加语义和几何约束作为正则化。这些指导Gaussians的优化,有助于重构未见区域。总之,我们的方法在360$^{\circ}$的视角内提供了一个全球一致的3D场景,提高了现有技术的沉浸体验。项目网站:http:// this http URL
https://arxiv.org/abs/2404.06903
Previous work in Neural Loss Function Search (NLFS) has shown a lack of correlation between smaller surrogate functions and large convolutional neural networks with massive regularization. We expand upon this research by revealing another disparity that exists, correlation between different types of image augmentation techniques. We show that different loss functions can perform well on certain image augmentation techniques, while performing poorly on others. We exploit this disparity by performing an evolutionary search on five types of image augmentation techniques in the hopes of finding image augmentation specific loss functions. The best loss functions from each evolution were then taken and transferred to WideResNet-28-10 on CIFAR-10 and CIFAR-100 across each of the five image augmentation techniques. The best from that were then taken and evaluated by fine-tuning EfficientNetV2Small on the CARS, Oxford-Flowers, and Caltech datasets across each of the five image augmentation techniques. Multiple loss functions were found that outperformed cross-entropy across multiple experiments. In the end, we found a single loss function, which we called the inverse bessel logarithm loss, that was able to outperform cross-entropy across the majority of experiments.
之前在神经损失函数搜索(NLFS)方面的研究表明,较小代理函数与具有大量正则化的巨大卷积神经网络之间缺乏相关性。我们通过揭示另一种存在于不同类型图像增强技术之间的差异来拓展这一研究。我们证明了不同损失函数可以在某些图像增强技术上表现良好,而在其他技术上表现不佳。我们利用这一差异进行进化搜索,旨在找到特定于图像增强的损失函数。对每个进化,最优秀的损失函数被选取并转移到CIFAR-10和CIFAR-100上。然后将这些最优秀的损失函数应用于WideResNet-28-10,在每种图像增强技术上进行微调。然后在每个图像增强技术上,最优秀的损失函数被用于在CARS、牛津鲜花和加州理工(Caltech)数据集上进行微调的EfficientNetV2Small。我们发现了多个在多个实验中优于交叉熵的损失函数。最后,我们发现了一个单一的损失函数,我们称之为反贝塞尔对数损失,它在大多数实验中能够优于交叉熵。
https://arxiv.org/abs/2404.06633
Offline Reinforcement Learning (RL) faces distributional shift and unreliable value estimation, especially for out-of-distribution (OOD) actions. To address this, existing uncertainty-based methods penalize the value function with uncertainty quantification and demand numerous ensemble networks, posing computational challenges and suboptimal outcomes. In this paper, we introduce a novel strategy employing diverse randomized value functions to estimate the posterior distribution of $Q$-values. It provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of $Q$-values. By applying moderate value penalties for OOD actions, our method fosters a provably pessimistic approach. We also emphasize on diversity within randomized value functions and enhance efficiency by introducing a diversity regularization method, reducing the requisite number of networks. These modules lead to reliable value estimation and efficient policy learning from offline data. Theoretical analysis shows that our method recovers the provably efficient LCB-penalty under linear MDP assumptions. Extensive empirical results also demonstrate that our proposed method significantly outperforms baseline methods in terms of performance and parametric efficiency.
离线强化学习(RL)面临着分布平移和不可靠的值估计,尤其是在离散(OD)动作中。为解决这个问题,现有的基于不确定性的方法通过不确定性量化对价值函数进行惩罚,并要求大量的集成网络,导致计算复杂度和次优结果。在本文中,我们引入了一种新颖的方法,利用多样化的随机价值函数估计后验分布。它提供了稳健的不确定性量化并估计了 $Q$ 值的较低置信度边界(LCB)。通过为离散动作应用适度价值惩罚,我们的方法培养了一种可证明的悲观方法。我们还强调随机价值函数内的多样性,通过引入多样性正则化方法,减少了网络的数量。这些模块使得从离线数据进行可靠的值估计和有效的策略学习成为可能。理论分析表明,在线性MDP假设下,我们 method 恢复了可证明的有效的LCB惩罚。大量的实证结果还表明,与基线方法相比,我们提出的方法在表现和参数效率方面显著优于基线方法。
https://arxiv.org/abs/2404.06188
Within colorectal cancer diagnostics, conventional colonoscopy techniques face critical limitations, including a limited field of view and a lack of depth information, which can impede the detection of precancerous lesions. Current methods struggle to provide comprehensive and accurate 3D reconstructions of the colonic surface which can help minimize the missing regions and reinspection for pre-cancerous polyps. Addressing this, we introduce 'Gaussian Pancakes', a method that leverages 3D Gaussian Splatting (3D GS) combined with a Recurrent Neural Network-based Simultaneous Localization and Mapping (RNNSLAM) system. By introducing geometric and depth regularization into the 3D GS framework, our approach ensures more accurate alignment of Gaussians with the colon surface, resulting in smoother 3D reconstructions with novel viewing of detailed textures and structures. Evaluations across three diverse datasets show that Gaussian Pancakes enhances novel view synthesis quality, surpassing current leading methods with a 18% boost in PSNR and a 16% improvement in SSIM. It also delivers over 100X faster rendering and more than 10X shorter training times, making it a practical tool for real-time applications. Hence, this holds promise for achieving clinical translation for better detection and diagnosis of colorectal cancer.
在直肠癌诊断中,传统的结肠镜检查技术面临着关键的限制,包括视野有限和深度信息缺乏,这可能阻碍了癌前病变的检测。目前的 methods 很难提供完整的和准确的 3D 结肠表面重建,这可以帮助最小化遗漏区域和重新评估癌前结肠癌。为了解决这个问题,我们引入了“Gaussian Pancakes”方法,这是一种利用 3D Gaussian Splatting(3D GS)与基于循环神经网络的同时定位与映射(RNNSLAM)系统相结合的方法。通过将几何和深度规范化到 3D GS 框架中,我们的方法确保了 Gaussian 与结肠表面的更准确对齐,从而实现了更平滑的 3D 重建,并重新观察到了详细纹理和结构的全新视角。在三个不同的数据集上的评估显示,Gaussian Pancakes 提高了新颖视角合成质量,超过现有领先方法,PSNR 提高了 18%,SSIM 提高了 16%。它还实现了超过 100X 的快速渲染和超过 10X 的训练时间,使得它成为实时应用的实用工具。因此,这有望为实现临床转化和改进直肠癌的检测和诊断带来希望。
https://arxiv.org/abs/2404.06128
In recent years, Graph Neural Networks (GNNs) have made significant advancements, particularly in tasks such as node classification, link prediction, and graph representation. However, challenges arise from biases that can be hidden not only in the node attributes but also in the connections between entities. Therefore, ensuring fairness in graph neural network learning has become a critical problem. To address this issue, we propose a novel model for training fairness-aware GNN, which enhances the Counterfactual Augmented Fair Graph Neural Network Framework (CAF). Our approach integrates Supervised Contrastive Loss and Environmental Loss to enhance both accuracy and fairness. Experimental validation on three real datasets demonstrates the superiority of our proposed model over CAF and several other existing graph-based learning methods.
近年来,图神经网络(GNNs)在节点分类、边预测和图表示等任务方面取得了显著进展。然而,挑战在于偏见可能不仅存在于节点属性中,而且存在于实体之间的连接中。因此,在图形神经网络学习中确保公平性成为一个关键问题。为了解决这个问题,我们提出了一个用于训练公平性感知GNN的新模型,该模型增强了反事实增强公平图神经网络框架(CAF)。我们对三个真实数据集的实验验证表明,与CAF和其他基于图的学习方法相比,我们提出的模型具有优越性。
https://arxiv.org/abs/2404.06090
This study aims to establish a computer-aided diagnosis system for endobronchial ultrasound (EBUS) surgery to assist physicians in the preliminary diagnosis of metastatic cancer. This involves arranging immediate examinations for other sites of metastatic cancer after EBUS surgery, eliminating the need to wait for reports, thereby shortening the waiting time by more than half and enabling patients to detect other cancers earlier, allowing for early planning and implementation of treatment plans. Unlike previous studies on cell image classification, which have abundant datasets for training, this study must also be able to make effective classifications despite the limited amount of case data for lung metastatic cancer. In the realm of small data set classification methods, Few-shot learning (FSL) has become mainstream in recent years. Through its ability to train on small datasets and its strong generalization capabilities, FSL shows potential in this task of lung metastatic cell image classification. This study will adopt the approach of Few-shot learning, referencing existing proposed models, and designing a model architecture for classifying lung metastases cell images. Batch Spectral Regularization (BSR) will be incorporated as a loss update parameter, and the Finetune method of PMF will be modified. In terms of test results, the addition of BSR and the modified Finetune method further increases the accuracy by 8.89% to 65.60%, outperforming other FSL methods. This study confirms that FSL is superior to supervised and transfer learning in classifying metastatic cancer and demonstrates that using BSR as a loss function and modifying Finetune can enhance the model's capabilities.
这项研究旨在建立一个计算机辅助诊断系统,用于帮助医生在EBUS手术初步诊断转移性癌症。这包括在EBUS手术后立即安排对其他转移性癌症的检查,消除等待报告的需要,从而将等待时间缩短一半,并使患者更早地发现其他癌症,实现早期规划和实施治疗计划。与之前关于细胞图像分类的研究相比,这项研究需要在没有丰富数据集的情况下进行有效的分类。在小型数据集分类方法领域,近年来 Few-shot 学习(FSL)已经成为主流。通过其能够在小数据集上训练以及其强大的泛化能力,FSL 在肺癌转移性细胞图像分类任务中具有潜在的应用价值。这项研究将采用FSL的方法,参考现有的模型并提出一个分类肺癌转移性细胞图像的模型架构。Batch Spectral Regularization(BSR)将作为损失更新参数引入,并修改 PMF 的 Finetune 方法。在测试结果方面,添加 BSR 和修改后的 Finetune 方法进一步增加了 8.89% 至 65.60% 的准确率,超过了其他FSL方法。这项研究证实,FSL在分类转移性癌症方面优于监督学习和迁移学习,并表明使用 BSR作为损失函数并修改 Finetune 可以增强模型的能力。
https://arxiv.org/abs/2404.06080
With the rapid development of XR, 3D generation and editing are becoming more and more important, among which, stylization is an important tool of 3D appearance editing. It can achieve consistent 3D artistic stylization given a single reference style image and thus is a user-friendly editing way. However, recent NeRF-based 3D stylization methods face efficiency issues that affect the actual user experience and the implicit nature limits its ability to transfer the geometric pattern styles. Additionally, the ability for artists to exert flexible control over stylized scenes is considered highly desirable, fostering an environment conducive to creative exploration. In this paper, we introduce StylizedGS, a 3D neural style transfer framework with adaptable control over perceptual factors based on 3D Gaussian Splatting (3DGS) representation. The 3DGS brings the benefits of high efficiency. We propose a GS filter to eliminate floaters in the reconstruction which affects the stylization effects before stylization. Then the nearest neighbor-based style loss is introduced to achieve stylization by fine-tuning the geometry and color parameters of 3DGS, while a depth preservation loss with other regularizations is proposed to prevent the tampering of geometry content. Moreover, facilitated by specially designed losses, StylizedGS enables users to control color, stylized scale and regions during the stylization to possess customized capabilities. Our method can attain high-quality stylization results characterized by faithful brushstrokes and geometric consistency with flexible controls. Extensive experiments across various scenes and styles demonstrate the effectiveness and efficiency of our method concerning both stylization quality and inference FPS.
随着XR技术的快速发展,3D生成和编辑变得越来越重要,其中,塑性是一种重要的3D外观编辑工具。通过给定单一定式风格图像,它可以实现一致的3D艺术塑性,从而成为一种用户友好的编辑方式。然而,基于NeRF的3D塑性方法面临效率问题,影响了实际用户体验,并且隐含的拓扑学限制了其将几何图案样式转移的能力。此外,艺术家对塑性场景的灵活控制被认为是高度渴望的,促进了创意探索的环境。在本文中,我们引入了StylizedGS,一种基于3D高斯平滑(3DGS)表示的3D神经风格转移框架。3DGS带来了高效率的优势。我们提出了GS滤波器来消除在重构过程中影响塑性效果的浮点。然后,基于最近邻的样式损失来实现通过对3DGS的形和色参数的微调来实现塑性,同时引入了基于其他正则化的深度保持损失,以防止对几何内容进行篡改。此外,通过专门设计的损失功能,StylizedGS使用户能够在塑性过程中控制颜色、塑性比例和区域,具有自定义功能。我们的方法可以实现高质量、具有忠实笔触和几何一致性的艺术塑性,具有灵活的控制。在各种场景和风格的大量实验中,我们证明了我们的方法在塑性和推理每秒帧数方面的有效性和效率。
https://arxiv.org/abs/2404.05220
The advancement of deep learning has led to the emergence of Mixture-of-Experts (MoEs) models, known for their dynamic allocation of computational resources based on input. Despite their promise, MoEs face challenges, particularly in terms of memory requirements. To address this, our work introduces SEER-MoE, a novel two-stage framework for reducing both the memory footprint and compute requirements of pre-trained MoE models. The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss and reduce the number of activated experts during inference. Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
深度学习的进步导致了混合专家(MoEs)模型的出现,这些模型以基于输入的动态分配计算资源而闻名。尽管它们具有很大的潜力,但MoEs在内存需求方面面临挑战。为了应对这个问题,我们的工作引入了SEER-MoE,一种新的两阶段框架,用于减少预训练MoEs模型的内存足迹和计算需求。第一阶段涉及使用重叠抽头计数指导对总专家数量进行截断,而第二阶段采用基于正则化的微调策略来恢复准确性损失和减少推理过程中的激活专家数量。我们的实证研究证明了我们的方法的有效性,从而实现了一个稀疏的MoEs模型,在保持推理效率的同时最小化准确度损失。
https://arxiv.org/abs/2404.05089
Adapting a medical image segmentation model to a new domain is important for improving its cross-domain transferability, and due to the expensive annotation process, Unsupervised Domain Adaptation (UDA) is appealing where only unlabeled images are needed for the adaptation. Existing UDA methods are mainly based on image or feature alignment with adversarial training for regularization, and they are limited by insufficient supervision in the target domain. In this paper, we propose an enhanced Filtered Pseudo Label (FPL+)-based UDA method for 3D medical image segmentation. It first uses cross-domain data augmentation to translate labeled images in the source domain to a dual-domain training set consisting of a pseudo source-domain set and a pseudo target-domain set. To leverage the dual-domain augmented images to train a pseudo label generator, domain-specific batch normalization layers are used to deal with the domain shift while learning the domain-invariant structure features, generating high-quality pseudo labels for target-domain images. We then combine labeled source-domain images and target-domain images with pseudo labels to train a final segmentor, where image-level weighting based on uncertainty estimation and pixel-level weighting based on dual-domain consensus are proposed to mitigate the adverse effect of noisy pseudo labels. Experiments on three public multi-modal datasets for Vestibular Schwannoma, brain tumor and whole heart segmentation show that our method surpassed ten state-of-the-art UDA methods, and it even achieved better results than fully supervised learning in the target domain in some cases.
将医学图像分割模型适应新领域非常重要,以提高其跨领域可迁移性。由于昂贵的注释过程,无监督域自适应(UDA)方法在仅需要未标注图像的适应性方面具有吸引力。现有的UDA方法主要基于图像或特征与对抗训练进行对齐来解决正则化问题,但它们在目标域中的监督不足。在本文中,我们提出了一个增强的带滤波伪标签(FPL+)-based UDA方法,用于3D医学图像分割。它首先使用跨域数据增强将源域中的带标签图像转换为由伪源域集和伪目标域集组成的双域训练集。为了利用双域增强的图像来训练伪标签生成器,我们使用领域特定的批归一化层来处理领域漂移,同时学习目标域不变的结构特征,为目标域图像生成高质量伪标签。然后,我们将带标签源域图像和目标域图像与伪标签结合,用于训练最终分割器。我们提出基于不确定度估计的图像级加权和基于双域共识的像素级加权来减轻噪声伪标签的负面影响。在三个公开的多模态数据集(Vestibular Schwannoma、脑肿瘤和整个心脏分割)上进行实验,我们的方法超越了10个最先进的UDA方法,在某些情况下,甚至取得了比完全监督学习在目标域中更好的结果。
https://arxiv.org/abs/2404.04971
To align mobile robot navigation policies with user preferences through reinforcement learning from human feedback (RLHF), reliable and behavior-diverse user queries are required. However, deterministic policies fail to generate a variety of navigation trajectory suggestions for a given navigation task configuration. We introduce EnQuery, a query generation approach using an ensemble of policies that achieve behavioral diversity through a regularization term. For a given navigation task, EnQuery produces multiple navigation trajectory suggestions, thereby optimizing the efficiency of preference data collection with fewer queries. Our methodology demonstrates superior performance in aligning navigation policies with user preferences in low-query regimes, offering enhanced policy convergence from sparse preference queries. The evaluation is complemented with a novel explainability representation, capturing full scene navigation behavior of the mobile robot in a single plot.
通过人反馈强化学习(RLHF)将移动机器人导航策略与用户偏好对齐,需要可靠且行为多样化的用户查询。然而,确定性策略无法为给定导航任务配置生成多样化的导航轨迹建议。我们引入了EnQuery,一种使用策略集合生成的查询方法,通过正则化项实现行为多样性。对于给定的导航任务,EnQuery生成多个导航轨迹建议,从而通过更少的查询优化偏好数据的收集效率。我们的方法在低查询模式下表现出优越的将导航策略与用户偏好对齐的能力,提供来自稀疏偏好查询的增强策略收敛。评估中还补充了一个新的可解释性表示,捕捉了移动机器人在单个图中的完整场景导航行为。
https://arxiv.org/abs/2404.04852
Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However, standard gradient-based interpretation maps, including the simple gradient and integrated gradient algorithms, often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A frequently used approach to inducing sparsity structures into gradient-based saliency maps is to alter the simple gradient scheme using sparsification or norm-based regularization. A drawback with such post-processing methods is their frequently-observed significant loss in fidelity to the original simple gradient map. In this work, we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. We show a duality relation between the regularized norms of the adversarial perturbations and gradient-based maps, based on which we design adversarial training loss functions promoting sparsity and group-sparsity properties in simple gradient maps. We present several numerical results to show the influence of our proposed norm-based adversarial training methods on the standard gradient-based maps of standard neural network architectures on benchmark image datasets.
基于梯度的显着性图已被广泛用于解释深度神经网络分类器的决策。然而,标准的基于梯度的解释图,包括简单的梯度和集成梯度算法,通常缺乏其在现实世界计算机视觉模型上的所需的结构,如稀疏性和连通性。一种常用的将稀疏结构诱导到基于梯度的显着性图的方法是使用稀疏化或基于规范的 Regularization。然而,这种后处理方法经常观察到对原始简单梯度图的保真度显著下降。在本文中,我们将 adversarial 训练作为一种加工方案应用于具有结构化简单梯度图的神经网络的训练中。我们基于规范的梯度扰动的有界性和梯度-基于地图的稀疏性和群稀疏性性质,设计了一种促进简单梯度图稀疏性和群稀疏性特性的 adversarial 训练损失函数。我们提供了几个数值结果,以展示我们的基于规范的 adversarial 训练方法对标准神经网络架构标准梯度-基于地图的显着性图的影响。
https://arxiv.org/abs/2404.04647
3D semantic occupancy prediction is a pivotal task in the field of autonomous driving. Recent approaches have made great advances in 3D semantic occupancy predictions on a single modality. However, multi-modal semantic occupancy prediction approaches have encountered difficulties in dealing with the modality heterogeneity, modality misalignment, and insufficient modality interactions that arise during the fusion of different modalities data, which may result in the loss of important geometric and semantic information. This letter presents a novel multi-modal, i.e., LiDAR-camera 3D semantic occupancy prediction framework, dubbed Co-Occ, which couples explicit LiDAR-camera feature fusion with implicit volume rendering regularization. The key insight is that volume rendering in the feature space can proficiently bridge the gap between 3D LiDAR sweeps and 2D images while serving as a physical regularization to enhance LiDAR-camera fused volumetric representation. Specifically, we first propose a Geometric- and Semantic-aware Fusion (GSFusion) module to explicitly enhance LiDAR features by incorporating neighboring camera features through a K-nearest neighbors (KNN) search. Then, we employ volume rendering to project the fused feature back to the image planes for reconstructing color and depth maps. These maps are then supervised by input images from the camera and depth estimations derived from LiDAR, respectively. Extensive experiments on the popular nuScenes and SemanticKITTI benchmarks verify the effectiveness of our Co-Occ for 3D semantic occupancy prediction. The project page is available at this https URL.
3D语义占用预测是自动驾驶领域中的一个关键任务。最近的方法已经在单模态3D语义占用预测方面取得了很大的进展。然而,在处理不同模态数据融合时出现的模态异质性、模态对齐不足和缺乏模态交互等问题上,多模态语义占用预测方法遇到了困难,这可能导致重要几何和语义信息的丢失。本文提出了一种新颖的多模态框架,即LiDAR相机3D语义占用预测框架Co-Occ,该框架将显式LiDAR相机特征融合与隐式体积渲染正则化相结合。关键见解在于,在特征空间中的体积渲染可以有效地将3D LiDAR扫描和2D图像之间的差距连结起来,同时作为对LiDAR相机融合体积表示的物理正则化,以增强LiDAR相机融合的语义信息。具体来说,我们首先提出了一种Geometric-和Semantic-aware Fusion(GSFusion)模块,通过通过K-近邻(KNN)搜索将相邻相机特征融入LiDAR特征中,从而增强LiDAR特征。然后,我们使用体积渲染将融合后的特征投影回到图像平面上,用于重建颜色和深度图。这些图然后分别由相机输入图像和从LiDAR获得的深度估计进行监督。在流行的 nuScenes 和 SemanticKITTI 基准测试中,我们对Co-Occ在3D语义占用预测方面的有效性进行了广泛的实验验证。项目页面可以通过这个链接获得:https://www.google.com/intent/https://www.acm.org/dl/2022.acm.ch/Co-Occ/index.html。
https://arxiv.org/abs/2404.04561