Early diagnosis of Alzheimer's Disease (AD) is very important for following medical treatments, and eye movements under special visual stimuli may serve as a potential non-invasive biomarker for detecting cognitive abnormalities of AD patients. In this paper, we propose an Depth-induced saliency comparison network (DISCN) for eye movement analysis, which may be used for diagnosis the Alzheimers disease. In DISCN, a salient attention module fuses normal eye movements with RGB and depth maps of visual stimuli using hierarchical salient attention (SAA) to evaluate comprehensive saliency maps, which contain information from both visual stimuli and normal eye movement behaviors. In addition, we introduce serial attention module (SEA) to emphasis the most abnormal eye movement behaviors to reduce personal bias for a more robust result. According to our experiments, the DISCN achieves consistent validity in classifying the eye movements between the AD patients and normal controls.
早诊断阿尔茨海默病(AD)对后续医疗治疗非常重要,而特殊视觉刺激下的眼动可能成为非侵入性检测AD患者认知异常的潜在指标。在本文中,我们提出了一个深度引导的显著性比较网络(DISCN)用于眼动分析,该网络可用于诊断AD。在DISCN中,显著性关注模块将正常眼动与视觉刺激的RGB和深度地图融合在一起,使用分层显著性注意(SAA)评估全面显著性图,包含来自视觉刺激和正常眼动行为的更多信息。此外,我们还引入了序列注意模块(SEA)来强调AD患者中最异常的眼动行为,以减少个人偏见,获得更稳健的结果。根据我们的实验结果,DISCN在将AD患者与正常控制者的眼动之间进行分类时具有一致的准确性。
https://arxiv.org/abs/2403.10124
While we enjoy the richness and informativeness of multimodal data, it also introduces interference and redundancy of information. To achieve optimal domain interpretation with limited resources, we propose CSDNet, a lightweight \textbf{C}ross \textbf{S}hallow and \textbf{D}eep Perception \textbf{Net}work designed to integrate two modalities with less coherence, thereby discarding redundant information or even modality. We implement our CSDNet for Salient Object Detection (SOD) task in robotic perception. The proposed method capitalises on spatial information prescreening and implicit coherence navigation across shallow and deep layers of the depth-thermal (D-T) modality, prioritising integration over fusion to maximise the scene interpretation. To further refine the descriptive capabilities of the encoder for the less-known D-T modalities, we also propose SAMAEP to guide an effective feature mapping to the generalised feature space. Our approach is tested on the VDT-2048 dataset, leveraging the D-T modality outperforms those of SOTA methods using RGB-T or RGB-D modalities for the first time, achieves comparable performance with the RGB-D-T triple-modality benchmark method with 5.97 times faster at runtime and demanding 0.0036 times fewer FLOPs. Demonstrates the proposed CSDNet effectively integrates the information from the D-T modality. The code will be released upon acceptance.
虽然我们享受多模态数据的丰富性和信息性,但它也引入了信息干扰和冗余。为了在有限的资源下实现最优领域解释,我们提出了CSDNet,一种轻量级的Cross-Shallow和Deep Perception Network,旨在将两种模态集成在一起,即使它们的连贯性较低,从而丢弃冗余信息甚至模态。我们在机器人感知中实现CSDNet以进行突出物体检测(SOD)任务。所提出的方法利用了深度热(D-T)模态中浅层和深层层次的隐含一致性和空间信息预筛选,将集成优先于融合以最大化场景解释。为了进一步优化编码器对更不熟悉D-T模态的描述能力,我们还提出了SAMAEP来引导有效的特征映射到一般特征空间。我们的方法在VDT-2048数据集上进行了测试,利用D-T模态比肩或超越了SOTA方法,使用RGB-T或RGB-D模态进行测试时,是前所未有的。在运行时,它比RGB-D-T三模态基准方法快了5.97倍,需要0.0036倍的FLOPs。证明了所提出的CSDNet有效地将D-T模态的信息整合起来。代码将在接受审查时发布。
https://arxiv.org/abs/2403.10104
Neural network quantization is an essential technique for deploying models on resource-constrained devices. However, its impact on model perceptual fields, particularly regarding class activation maps (CAMs), remains a significant area of investigation. In this study, we explore how quantization alters the spatial recognition ability of the perceptual field of vision models, shedding light on the alignment between CAMs and visual saliency maps across various architectures. Leveraging a dataset of 10,000 images from ImageNet, we rigorously evaluate six diverse foundational CNNs: VGG16, ResNet50, EfficientNet, MobileNet, SqueezeNet, and DenseNet. We uncover nuanced changes in CAMs and their alignment with human visual saliency maps through systematic quantization techniques applied to these models. Our findings reveal the varying sensitivities of different architectures to quantization and underscore its implications for real-world applications in terms of model performance and interpretability. The primary contribution of this work revolves around deepening our understanding of neural network quantization, providing insights crucial for deploying efficient and interpretable models in practical settings.
神经网络量化是一种在资源受限设备上部署模型的重要技术。然而,其在模型感知领域(特别是关于分类激活图(CAMs)的影响)仍有待进行深入研究。在这项研究中,我们探讨了量化如何改变视觉模型感知领域的空间识别能力,阐明各种架构中CAM与视觉显著性图之间的对齐关系。利用ImageNet数据集中的10,000个图像,我们严格评估了六个具有代表性的CNN:VGG16、ResNet50、EfficientNet、MobileNet、SqueezeNet和DenseNet。我们通过应用系统量化技术来研究这些模型对量化的敏感性,揭示了CAM和人类视觉显著性图之间的新颖变化。我们的研究结果表明,不同架构对量化的敏感程度不同,这对其在现实场景中的模型性能和可解释性具有重大影响。本工作的主要贡献在于加深我们对神经网络量化的理解,为实际场景中的高效和可解释模型部署提供了宝贵的洞见。
https://arxiv.org/abs/2403.09939
Event camera, a novel bio-inspired vision sensor, has drawn a lot of attention for its low latency, low power consumption, and high dynamic range. Currently, overfitting remains a critical problem in event-based classification tasks for Spiking Neural Network (SNN) due to its relatively weak spatial representation capability. Data augmentation is a simple but efficient method to alleviate overfitting and improve the generalization ability of neural networks, and saliency-based augmentation methods are proven to be effective in the image processing field. However, there is no approach available for extracting saliency maps from SNNs. Therefore, for the first time, we present Spiking Layer-Time-wise Relevance Propagation rule (SLTRP) and Spiking Layer-wise Relevance Propagation rule (SLRP) in order for SNN to generate stable and accurate CAMs and saliency maps. Based on this, we propose EventRPG, which leverages relevance propagation on the spiking neural network for more efficient augmentation. Our proposed method has been evaluated on several SNN structures, achieving state-of-the-art performance in object recognition tasks including N-Caltech101, CIFAR10-DVS, with accuracies of 85.62% and 85.55%, as well as action recognition task SL-Animals with an accuracy of 91.59%. Our code is available at this https URL.
事件相机,一种新颖的生物启发式视觉传感器,因低延迟、低功耗和高动态范围而引起了大量关注。目前,在基于事件的分类任务中,由于其相对较弱的空间表示能力,过拟合仍然是一个关键问题。数据增强是一种简单而有效的减轻过拟合并提高神经网络泛化能力的方法,而在图像处理领域,基于激性的增强方法已经被证明是有效的。然而,还没有方法可以从SNN中提取激性图。因此,我们首次提出了Spiking Layer-Time-wise Relevance Propagation rule (SLTRP)和Spiking Layer-wise Relevance Propagation rule (SLRP),以便SNN生成稳定和准确的CAMs和激性图。基于此,我们提出了EventRPG,它利用了SNN中的相关传播进行更有效的增强。我们的方法已经在多个SNN结构上进行了评估,在包括N-Caltech101、CIFAR10-DVS在内的物体识别任务中取得了最先进的性能,精度分别为85.62%和85.55%,以及在SL-Animals动作识别任务中的91.59%精度。我们的代码可以从该链接处获取。
https://arxiv.org/abs/2403.09274
Templates serve as a good starting point to implement a design (e.g., banner, slide) but it takes great effort from designers to manually create. In this paper, we present Desigen, an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images, a background image should preserve enough non-salient space for the overlaying layout elements. To equip existing advanced diffusion-based models with stronger spatial control, we propose two simple but effective techniques to constrain the saliency distribution and reduce the attention weight in desired regions during the background generation process. Then conditioned on the background, we synthesize the layout with a Transformer-based autoregressive generator. To achieve a more harmonious composition, we propose an iterative inference strategy to adjust the synthesized background and layout in multiple rounds. We constructed a design dataset with more than 40k advertisement banners to verify our approach. Extensive experiments demonstrate that the proposed pipeline generates high-quality templates comparable to human designers. More than a single-page design, we further show an application of presentation generation that outputs a set of theme-consistent slides. The data and code are available at this https URL.
模板作为实现设计(如banner,slide)的一个好起点,但设计师手动创建它们需要付出巨大的努力。在本文中,我们提出了Desigen,一个自动模板创建管道,它可以生成背景图像以及背景下的和谐布局元素。与自然图像不同,背景图像应保留足够的非显著空间给覆盖布局元素。为了给现有的扩散模型的空间控制带来更大的加强,我们提出了两个简单而有效的技术来约束显著分布并在背景生成过程中减少目标区域的关注权重。然后根据背景生成条件,我们使用基于Transformer的自回归生成器合成布局。为了实现更和谐的创作,我们提出了一个多轮迭代推理策略来调整生成的背景和布局。我们使用超过40k个广告标语的数据集来验证我们的方法。丰富的实验证明,与人类设计师相比,所提出的管道可以生成高质量的主题一致的模板。不仅是单页设计,我们还进一步展示了展示生成,可以输出一系列主题一致的幻灯片。数据和代码可在此链接处获取:https://www.example.com/
https://arxiv.org/abs/2403.09093
Convolutional Neural Networks (CNNs) are nowadays the model of choice in Computer Vision, thanks to their ability to automatize the feature extraction process in visual tasks. However, the knowledge acquired during training is fully subsymbolic, and hence difficult to understand and explain to end users. In this paper, we propose a new technique called HOLMES (HOLonym-MEronym based Semantic inspection) that decomposes a label into a set of related concepts, and provides component-level explanations for an image classification model. Specifically, HOLMES leverages ontologies, web scraping and transfer learning to automatically construct meronym (parts)-based detectors for a given holonym (class). Then, it produces heatmaps at the meronym level and finally, by probing the holonym CNN with occluded images, it highlights the importance of each part on the classification output. Compared to state-of-the-art saliency methods, HOLMES takes a step further and provides information about both where and what the holonym CNN is looking at, without relying on densely annotated datasets and without forcing concepts to be associated to single computational units. Extensive experimental evaluation on different categories of objects (animals, tools and vehicles) shows the feasibility of our approach. On average, HOLMES explanations include at least two meronyms, and the ablation of a single meronym roughly halves the holonym model confidence. The resulting heatmaps were quantitatively evaluated using the deletion/insertion/preservation curves. All metrics were comparable to those achieved by GradCAM, while offering the advantage of further decomposing the heatmap in human-understandable concepts, thus highlighting both the relevance of meronyms to object classification, as well as HOLMES ability to capture it. The code is available at this https URL.
卷积神经网络(CNNs)如今已经成为计算机视觉领域的首选模型,因为它们能够自动在视觉任务中自动化特征提取过程。然而,在训练过程中获得的全部知识都是符号化的,因此对于最终用户来说很难理解和解释。在本文中,我们提出了一种新的技术称为HOLMES(基于HOLonym的语义检查),它将标签分解为一组相关概念,并为图像分类模型提供组件级别的解释。具体来说,HOLMES利用本体论、网络爬取和迁移学习自动构建了针对给定holonym(类别)的聚类(部分)检测器。然后,它在聚类层级产生热力图,最后,通过向聚类CNN添加遮罩图像,它突出了每个部分在分类输出中的重要性。与最先进的显著性方法相比,HOLMES更进一步,提供了关于holonym CNN在做什么以及看什么的信息,而没有依赖密集标注的数据集,也没有强制将概念与单计算单元相关联。在不同的对象类别(动物、工具和车辆)上的广泛实验评估表明,我们的方法是可行的。平均情况下,HOLMES的解释包括至少两个聚类,而单个聚类的消融大约一半了holonym模型的信心。通过删除/插入/保留曲线对热力图进行定量评估。所有指标都与GradCAM获得的指标相当,但提供了对人类可理解概念的进一步分解,从而突出了聚类对于物体分类的相关性以及HOLMES捕获它的能力。代码可在此处下载:https://www.acm.org/dl/doi/10.1145/28886.28887
https://arxiv.org/abs/2403.08536
In this work we focus on learning facial representations that can be adapted to train effective face recognition models, particularly in the absence of labels. Firstly, compared with existing labelled face datasets, a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover, motivated by one recent finding, that is, the face saliency area is critical for face recognition, in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining, we utilize patches localized by extracted facial landmarks. This enables our method - namely LAndmark-based Facial Self-supervised learning LAFS), to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations, we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks, especially on more challenging few-shot scenarios.
在这项工作中,我们专注于学习可以在训练有效面部识别模型时进行自适应的面部表示。特别是在没有标签的情况下。首先,与现有的带有标签的人脸数据集相比,现实世界中存在大量未标记的脸。我们通过自监督预训练来探索这些未标记面部图像的学习策略,以转移通用的面部识别性能。此外,受到最近的一个发现的影响,即脸部显著区域对于面部识别至关重要,我们使用通过提取面部特征点来定位的补丁作为自监督学习中的标签。这使得我们的方法 - 基于LAndmark的特征点自监督学习LAFS) - 可以学习更具关键性的面部表示。我们还引入了两种特定特征点的自监督增强,以进一步规范化学习。通过学习基于特征点的面部表示,我们进一步通过正则化来缓解特征点位置的变化。我们的方法在多个面部识别基准测试中都实现了显著的改进,尤其是在更具挑战性的几拍场景中。
https://arxiv.org/abs/2403.08161
Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects -- not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data.
人类通过专注于与导航相关的特定视觉区域,有效地在人群中避开碰撞。然而,大多数机器人视觉导航方法依赖于在视觉任务上预训练的深度学习模型,这些模型优先考虑引人注目的物体,而不是与导航相关的物体,甚至可能误导。另一种方法是从 scratch 训练专门用于导航的模型,需要大量的计算。另一方面,自监督学习在计算机视觉和自然语言处理领域取得了革命性的突破,但由于定义有效的自监督信号的困难,其应用于机器人导航仍然鲜被探索。为了满足这些观察结果,本文提出了一种自监督视觉-动作模型(VANP)用于视觉导航预训练。 与检测对分类或检测有益的显眼物体不同,VANP 学会了仅关注与导航任务相关的特定视觉区域。为实现这一目标,VANP 使用视觉观察历史、未来动作和目标图像进行自监督,并使用两个小 Transformer 编码器将它们嵌入。然后,VANP 通过互信息最大化目标函数最大化嵌入之间的信息。我们证明了 VANP 提取的特征与人类导航直觉非常相似。VANP 的性能与仅使用一半训练时间且在 ImageNet 等大型完全监督数据集上训练的模型相当,即0.08%的数据。
https://arxiv.org/abs/2403.08109
While previous studies have demonstrated successful 3D object shape completion with a sufficient number of points, they often fail in scenarios when a few points, e.g. tens of points, are observed. Surprisingly, via entropy analysis, we find that even a few points, e.g. 64 points, could retain substantial information to help recover the 3D shape of the object. To address the challenge of shape completion with very sparse point clouds, we then propose Few-point Shape Completion (FSC) model, which contains a novel dual-branch feature extractor for handling extremely sparse inputs, coupled with an extensive branch for maximal point utilization with a saliency branch for dynamic importance assignment. This model is further bolstered by a two-stage revision network that refines both the extracted features and the decoder output, enhancing the detail and authenticity of the completed point cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from a few points. The proposed FSC (FSC) model outperforms previous methods on both few-point inputs and many-point inputs, and shows good generalizability to different object categories.
虽然以前的研究已经证明了通过足够的点来成功完成3D物体形状,但它们通常在观察到几十个点的情况下失败。令人惊讶的是,通过熵分析,我们发现,即使是几个点(例如64个点),也可能保留大量信息来帮助恢复物体的3D形状。为了解决与非常稀疏点云形状完成相关的挑战,我们 then 提出了 Few-point Shape Completion (FSC) 模型,该模型包含一个新颖的双分支特征提取器来处理极其稀疏的输入,以及一个具有高利用率的分支来解决动态重要性分配的分支。这个模型进一步通过两级修订网络强化,该网络精炼提取到的特征和解码器输出,增强完成点云的细节和真实性。我们的实验证明了从几个点上恢复3D形状是可能的。与以前的方法相比,FSC(FSC)模型在几点输入和多点输入上都表现出色,并且对不同物体类别的适应性很好。
https://arxiv.org/abs/2403.07359
The use of generative AI to create text descriptions from graphs has mostly focused on knowledge graphs, which connect concepts using facts. In this work we explore the capability of large pretrained language models to generate text from causal graphs, where salient concepts are represented as nodes and causality is represented via directed, typed edges. The causal reasoning encoded in these graphs can support applications as diverse as healthcare or marketing. Using two publicly available causal graph datasets, we empirically investigate the performance of four GPT-3 models under various settings. Our results indicate that while causal text descriptions improve with training data, compared to fact-based graphs, they are harder to generate under zero-shot settings. Results further suggest that users of generative AI can deploy future applications faster since similar performances are obtained when training a model with only a few examples as compared to fine-tuning via a large curated dataset.
翻译:将生成式AI用于从图形中创建文本描述的主要集中在知识图谱上,通过事实连接概念。在这项工作中,我们探讨了大型预训练语言模型在因果图上生成文本的能力,其中显著的概念用节点表示,因果关系通过有向、带标签的边表示。这些图形中的因果推理可以支持各种应用,如医疗保健或市场营销。使用两个公开可用的因果图数据集,我们通过不同设置下的实验,研究了四种GPT-3模型的性能。我们的结果表明,随着训练数据的增加,因果文本描述的改善,与基于事实的图形相比,它们在零散设置下生成文本较为困难。结果进一步表明,使用生成式AI的用户可以更快地部署未来的应用,因为与通过大型编辑数据集训练模型相比,获得类似性能时只需要几个示例。
https://arxiv.org/abs/2403.07118
Preference-based learning aims to align robot task objectives with human values. One of the most common methods to infer human preferences is by pairwise comparisons of robot task trajectories. Traditional comparison-based preference labeling systems seldom support labelers to digest and identify critical differences between complex trajectories recorded in videos. Our formative study (N = 12) suggests that individuals may overlook non-salient task features and establish biased preference criteria during their preference elicitation process because of partial observations. In addition, they may experience mental fatigue when given many pairs to compare, causing their label quality to deteriorate. To mitigate these issues, we propose FARPLS, a Feature-Augmented Robot trajectory Preference Labeling System. FARPLS highlights potential outliers in a wide variety of task features that matter to humans and extracts the corresponding video keyframes for easy review and comparison. It also dynamically adjusts the labeling order according to users' familiarities, difficulties of the trajectory pair, and level of disagreements. At the same time, the system monitors labelers' consistency and provides feedback on labeling progress to keep labelers engaged. A between-subjects study (N = 42, 105 pairs of robot pick-and-place trajectories per person) shows that FARPLS can help users establish preference criteria more easily and notice more relevant details in the presented trajectories than the conventional interface. FARPLS also improves labeling consistency and engagement, mitigating challenges in preference elicitation without raising cognitive loads significantly
基于偏好的学习旨在将机器任务目标与人类价值观对齐。推断人类偏好的最常见方法是对机器人任务轨迹进行成对比较。传统的基于比较的偏好标注系统很少支持标签者消化和识别视频中记录的复杂轨迹之间的关键差异。我们的小规模研究(N = 12)表明,个人可能因为部分观察而忽视非显著任务特征,并在偏好激发过程中建立有偏偏好标准。此外,当给定许多对比较时,他们可能会感到精神疲劳,导致标签质量下降。为了减轻这些问题,我们提出了FARPLS,一种特征增强机器人轨迹偏好标签系统。FARPLS突出了对人类重要的是各种各样的任务特征的潜在异常情况,并提取了相应的视频关键帧,方便回顾和比较。它还根据用户的熟悉度、轨迹对之间的困难程度和分歧水平动态调整标签顺序。同时,系统会监控标签者的一致性,并提供反馈来保持标签者的参与。一个平行试验(N = 42,105对机器人捡拾和放置轨迹)显示,FARPLS可以帮助用户更容易地建立偏好标准,并且能够比传统界面更关注轨迹中呈现的详细信息。FARPLS还提高了标签一致性和参与度,减轻了偏好激发过程中没有提高认知负担的挑战。
https://arxiv.org/abs/2403.06267
As a bio-inspired vision sensor, the spike camera emulates the operational principles of the fovea, a compact retinal region, by employing spike discharges to encode the accumulation of per-pixel luminance intensity. Leveraging its high temporal resolution and bio-inspired neuromorphic design, the spike camera holds significant promise for advancing computer vision applications. Saliency detection mimics the behavior of human beings and captures the most salient region from the scenes. In this paper, we investigate the visual saliency in the continuous spike stream for the first time. To effectively process the binary spike stream, we propose a Recurrent Spiking Transformer (RST) framework, which is based on a full spiking neural network. Our framework enables the extraction of spatio-temporal features from the continuous spatio-temporal spike stream while maintaining low power consumption. To facilitate the training and validation of our proposed model, we build a comprehensive real-world spike-based visual saliency dataset, enriched with numerous light conditions. Extensive experiments demonstrate the superior performance of our Recurrent Spiking Transformer framework in comparison to other spike neural network-based methods. Our framework exhibits a substantial margin of improvement in capturing and highlighting visual saliency in the spike stream, which not only provides a new perspective for spike-based saliency segmentation but also shows a new paradigm for full SNN-based transformer models. The code and dataset are available at \url{this https URL}.
作为一种生物启发的视觉传感器, spike 相机通过利用尖峰放电来模拟 fovea(一个紧凑的视网膜区域)的操作原理,采用尖峰放电来编码每像素亮度强度的累积。凭借其高时间分辨率和高生物启发式神经网络设计, spike 相机在推动计算机视觉应用方面具有很大的潜力。显著性检测模仿了人类行为,并从场景中捕捉到最显著的区域。在本文中,我们首次研究了连续尖峰流中的视觉显著性。为了有效处理二进制尖峰流,我们提出了一个基于全尖峰神经网络的 Recurrent Spiking Transformer (RST) 框架。我们的框架能够在保持低功耗的同时,提取连续时空尖峰流中的空间时间特征。为了促进所提出的模型的训练和验证,我们构建了一个全面的现实世界尖峰流视觉显著性数据集,并对其进行了丰富。大量的实验证明,与基于尖峰神经网络的其他方法相比,我们的 Recurrent Spiking Transformer 框架在尖峰流中捕获和突显视觉显著性方面具有显著的优越性。我们的框架在尖峰流中捕获和突显视觉显著性的性能差异显著,这不仅为尖峰流基于显著性分割提供了新的视角,而且也展示了基于全尖峰神经网络的变换模型的全新范式。代码和数据集可在此处访问:https://this URL。
https://arxiv.org/abs/2403.06233
The proliferation of mobile devices and social media has revolutionized content dissemination, with short-form video becoming increasingly prevalent. This shift has introduced the challenge of video reframing to fit various screen aspect ratios, a process that highlights the most compelling parts of a video. Traditionally, video reframing is a manual, time-consuming task requiring professional expertise, which incurs high production costs. A potential solution is to adopt some machine learning models, such as video salient object detection, to automate the process. However, these methods often lack generalizability due to their reliance on specific training data. The advent of powerful large language models (LLMs) open new avenues for AI capabilities. Building on this, we introduce Reframe Any Video Agent (RAVA), a LLM-based agent that leverages visual foundation models and human instructions to restructure visual content for video reframing. RAVA operates in three stages: perception, where it interprets user instructions and video content; planning, where it determines aspect ratios and reframing strategies; and execution, where it invokes the editing tools to produce the final video. Our experiments validate the effectiveness of RAVA in video salient object detection and real-world reframing tasks, demonstrating its potential as a tool for AI-powered video editing.
移动设备的普及和社交媒体的流行已经彻底颠覆了内容传播方式,短形式视频变得越来越普遍。这种转变引入了视频重构的挑战,即适应各种屏幕比例,突出视频中最引人注目的部分。传统上,视频重构是一个费时且需要专业知识的手动任务,导致生产成本高昂。可能的解决方案是采用一些机器学习模型,如视频显着物体检测,来自动化这个过程。然而,由于它们依赖特定的训练数据,这些方法通常缺乏泛化性。大型语言模型的出现为AI能力开辟了新途径。在此基础上,我们引入了Reframe Any Video Agent (RAVA),一种基于LLM的代理,它利用视觉基础模型和人类指令重构视频内容。RAVA分为三个阶段:感知,其中它解释用户指令和视频内容;规划,其中它确定比例和重构策略;执行,其中它调用编辑工具产生最终视频。我们的实验验证了RAVA在视频显着物体检测和现实重构任务中的有效性,表明它作为AI驱动视频编辑工具的潜力。
https://arxiv.org/abs/2403.06070
Accurate state estimation plays a critical role in ensuring the robust control of humanoid robots, particularly in the context of learning-based control policies for legged robots. However, there is a notable gap in analytical research concerning estimations. Therefore, we endeavor to further understand how various types of estimations influence the decision-making processes of policies. In this paper, we provide quantitative insight into the effectiveness of learned state estimations, employing saliency analysis to identify key estimation variables and optimize their combination for humanoid locomotion tasks. Evaluations assessing tracking precision and robustness are conducted on comparative groups of policies with varying estimation combinations in both simulated and real-world environments. Results validated that the proposed policy is capable of crossing the sim-to-real gap and demonstrating superior performance relative to alternative policy configurations.
准确的狀態估計在確保人形機器人的稳健控制中發揮了關鍵作用,尤其是在學習 based 的控制策略時。然而,在分析研究方面存在顯著的缺口。因此,我們試圖進一步了解各種估計方法如何影響策略的決策過程。在本文中,我們提供了對學習狀態估計有效性的定量洞察,利用突出分析來識別關鍵的估計變量,並優化其組合以實現人形運動任務。在模擬和現實世界環境中,評估模型的跟踪精度和稳健性。結果驗證了所提出的策略能夠跨越模擬到現實的差距,並與其他策略配置相比展示了卓越的性能。
https://arxiv.org/abs/2403.05868
Deep Neural Networks have often been called the black box because of the complex, deep architecture and non-transparency presented by the inner layers. There is a lack of trust to use Artificial Intelligence in critical and high-precision fields such as security, finance, health, and manufacturing industries. A lot of focused work has been done to provide interpretable models, intending to deliver meaningful insights into the thoughts and behavior of neural networks. In our research, we compare the state-of-the-art methods in the Activation-based methods (ABM) for interpreting predictions of CNN models, specifically in the application of Image Classification. We then extend the same for eight CNN-based architectures to compare the differences in visualization and thus interpretability. We introduced a novel technique Feature CAM, which falls in the perturbation-activation combination, to create fine-grained, class-discriminative visualizations. The resulting saliency maps from our experiments proved to be 3-4 times better human interpretable than the state-of-the-art in ABM. At the same time it reserves machine interpretability, which is the average confidence scores in classification.
深度神经网络常常被称为黑盒子,因为其内部层面对复杂的深度结构和不可透明性的呈现。在关键且高精度的领域如安全、金融、医疗和制造业等行业中,使用人工智能技术存在信任问题。为了提供可解释的模型,并试图提供对神经网络的思维和行为的有意义的洞察,已经进行了很多努力。在我们的研究中,我们比较了基于激活的方法(ABM)中当前最先进的方法,特别是在图像分类应用中的情况。接着扩展到八个基于CNN的架构,以比较可视化和可解释性的差异。我们引入了一种名为特征CAM的新技术,该技术属于扰动-激活组合,以创建细粒度的、分类判定的可视化。我们实验的结果表明,从我们的实验中得到的凸起图是 human interpretable 的3-4倍,同时保留了机器可解释性,这是分类的平均置信度分数。
https://arxiv.org/abs/2403.05658
Biophilia is an innate love for living things and nature itself that has been associated with a positive impact on mental health and well-being. This study explores the application of deep learning methods for the classification of Biophilic artwork, in order to learn and explain the different Biophilic characteristics present in a visual representation of a painting. Using the concept of Biophilia that postulates the deep connection of human beings with nature, we use an artificially intelligent algorithm to recognise the different patterns underlying the Biophilic features in an artwork. Our proposed method uses a lower-dimensional representation of an image and a decoder model to extract salient features of the image of each Biophilic trait, such as plants, water bodies, seasons, animals, etc., based on learnt factors such as shape, texture, and illumination. The proposed classification model is capable of extracting Biophilic artwork that not only helps artists, collectors, and researchers studying to interpret and exploit the effects of mental well-being on exposure to nature-inspired visual aesthetics but also enables a methodical exploration of the study of Biophilia and Biophilic artwork for aesthetic preferences. Using the proposed algorithms, we have also created a gallery of Biophilic collections comprising famous artworks from different European and American art galleries, which will soon be published on the Vieunite@ online community.
生物学家法(Biophilia)是对生命和自然本身的固有热爱,与改善心理健康和幸福感密切相关。本研究探讨了使用深度学习方法对生物学家艺术品分类的应用,以学习和解释一幅画作中所存在的不同生物学家特征。我们使用生物学家法,即认为人类与自然之间存在深刻联系的理论,使用人工智能算法来识别作品中的生物学家特征的不同模式。我们的方法使用图像的低维表示和解码模型来提取图像中每个生物学家特征的高显着特征,例如植物、水域、季节、动物等,基于学习到的形状、纹理和光照等因素。所提出的分类模型能够提取出不仅帮助艺术家、收藏家和研究员解释和利用接触自然启发式视觉美学对心理健康的影响,而且还能使对生物学家和生物学家艺术品的审美偏好进行系统探索的方法。使用所提出的算法,我们还创建了一个由来自不同欧洲和美洲艺术画廊的著名作品组成的生物学家画廊,这些作品不久后将发布在Vieunite@在线社区上。
https://arxiv.org/abs/2403.05394
Adversarial attack methods based on point manipulation for 3D point cloud classification have revealed the fragility of 3D models, yet the adversarial examples they produce are easily perceived or defended against. The trade-off between the imperceptibility and adversarial strength leads most point attack methods to inevitably introduce easily detectable outlier points upon a successful attack. Another promising strategy, shape-based attack, can effectively eliminate outliers, but existing methods often suffer significant reductions in imperceptibility due to irrational deformations. We find that concealing deformation perturbations in areas insensitive to human eyes can achieve a better trade-off between imperceptibility and adversarial strength, specifically in parts of the object surface that are complex and exhibit drastic curvature changes. Therefore, we propose a novel shape-based adversarial attack method, HiT-ADV, which initially conducts a two-stage search for attack regions based on saliency and imperceptibility scores, and then adds deformation perturbations in each attack region using Gaussian kernel functions. Additionally, HiT-ADV is extendable to physical attack. We propose that by employing benign resampling and benign rigid transformations, we can further enhance physical adversarial strength with little sacrifice to imperceptibility. Extensive experiments have validated the superiority of our method in terms of adversarial and imperceptible properties in both digital and physical spaces. Our code is avaliable at: this https URL.
基于点操作的对抗攻击方法为3D点云分类揭示了3D模型的脆弱性,然而它们产生的对抗样本却很容易被感知或防御。在可感知性和对抗强度之间的权衡导致大多数点攻击方法在成功攻击后必然引入容易检测的异常点。另一种有前景的策略是基于形状的攻击方法,可以有效地消除异常点,但现有的方法往往由于不理性的变形导致感知性显著降低。我们发现,在人类视觉感知不受影响的区域隐藏变形扰动可以实现更好的可感知性和对抗强度之间的权衡,特别是在物体表面复杂且表现出剧变弯曲的部分。因此,我们提出了一个新颖的基于形状的对抗攻击方法HiT-ADV,它首先根据可感知性和感知强度评分进行攻击区域的两阶段搜索,然后在每个攻击区域内使用高斯核函数添加变形扰动。此外,HiT-ADV还可以扩展到物理攻击。我们提出,通过使用有益的采样和有益的刚度变换,我们可以进一步增强物理攻击力量,而不会牺牲感知性。大量实验验证了我们在数字和物理空间中对抗性和感知性方面的优越性。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.05247
Recent years have witnessed significant advancement in face recognition (FR) techniques, with their applications widely spread in people's lives and security-sensitive areas. There is a growing need for reliable interpretations of decisions of such systems. Existing studies relying on various mechanisms have investigated the usage of saliency maps as an explanation approach, but suffer from different limitations. This paper first explores the spatial relationship between face image and its deep representation via gradient backpropagation. Then a new explanation approach FGGB has been conceived, which provides precise and insightful similarity and dissimilarity saliency maps to explain the "Accept" and "Reject" decision of an FR system. Extensive visual presentation and quantitative measurement have shown that FGGB achieves superior performance in both similarity and dissimilarity maps when compared to current state-of-the-art explainable face verification approaches.
近年来,在生物识别(FR)技术方面取得了显著的进步,这些技术广泛应用于人们的生活和安全性敏感区域。对于这些系统的决策可靠解释的需求越来越大。现有研究依赖各种机制来调查使用特征图作为解释方法,但存在不同的局限性。本文首先通过梯度反向传播探讨了脸部图像和其深度表示之间的空间关系。然后,一种新的解释方法FGGB被提出,它提供了精确和直观的相似度和差异性特征图来解释FR系统的“接受”和“拒绝”决策。广泛的视觉展示和定量测量证明,与当前最先进的可解释面部验证方法相比,FGGB在相似度和差异度地图上实现了卓越的性能。
https://arxiv.org/abs/2403.04549
Neural Radiance Fields (NeRF) have quickly become the primary approach for 3D reconstruction and novel view synthesis in recent years due to their remarkable performance. Despite the huge interest in NeRF methods, a practical use case of NeRFs has largely been ignored; the exploration of the scene space modelled by a NeRF. In this paper, for the first time in the literature, we propose and formally define the scene exploration framework as the efficient discovery of NeRF model inputs (i.e. coordinates and viewing angles), using which one can render novel views that adhere to user-selected criteria. To remedy the lack of approaches addressing scene exploration, we first propose two baseline methods called Guided-Random Search (GRS) and Pose Interpolation-based Search (PIBS). We then cast scene exploration as an optimization problem, and propose the criteria-agnostic Evolution-Guided Pose Search (EGPS) for efficient exploration. We test all three approaches with various criteria (e.g. saliency maximization, image quality maximization, photo-composition quality improvement) and show that our EGPS performs more favourably than other baselines. We finally highlight key points and limitations, and outline directions for future research in scene exploration.
神经辐射场(NeRF)近年来迅速成为3D建模和新颖视图生成的主要方法,归功于其惊人的表现。尽管NeRF方法受到了极大的关注,但很少有实际应用案例被忽视;探索由NeRF建模的场景空间的场景。在本文中,为了在文献中首次提出,我们正式定义了场景探索框架为通过高效发现NeRF模型输入(即坐标和视角)来探索场景空间,用户可以根据自己的标准来渲染新颖视图。为了解决缺乏场景探索方法的问题,我们首先提出了两种基础方法,称为指导随机搜索(GRS)和基于姿态平滑的搜索(PIBS)。然后将场景探索视为一个优化问题,并提出了具有条件无关的进化引导姿态搜索(EGPS)用于高效的探索。我们用各种标准(例如最大熵、图像质量最大化和照片组合质量改善)测试了所有三种方法,并证明了我们的EGPS表现出色。最后,我们重点强调了关键点和限制,并概述了未来研究在场景探索领域的方向。
https://arxiv.org/abs/2403.04508
The advent of large vision-language models (LVLMs) represents a noteworthy advancement towards the pursuit of artificial general intelligence. However, the extent of their efficacy across both specialized and general tasks warrants further investigation. This article endeavors to evaluate the competency of popular LVLMs in specialized and general tasks, respectively, aiming to offer a comprehensive comprehension of these innovative methodologies. To gauge their efficacy in specialized tasks, we tailor a comprehensive testbed comprising three distinct scenarios: natural, healthcare, and industrial, encompassing six challenging tasks. These tasks include salient, camouflaged, and transparent object detection, as well as polyp and skin lesion detection, alongside industrial anomaly detection. We examine the performance of three recent open-source LVLMs -- MiniGPT-v2, LLaVA-1.5, and Shikra -- in the realm of visual recognition and localization. Moreover, we conduct empirical investigations utilizing the aforementioned models alongside GPT-4V, assessing their multi-modal understanding capacities in general tasks such as object counting, absurd question answering, affordance reasoning, attribute recognition, and spatial relation reasoning. Our investigations reveal that these models demonstrate limited proficiency not only in specialized tasks but also in general tasks. We delve deeper into this inadequacy and suggest several potential factors, including limited cognition in specialized tasks, object hallucination, text-to-image interference, and decreased robustness in complex problems. We hope this study would provide valuable insights for the future development of LVLMs, augmenting their power in coping with both general and specialized applications.
大视图语言模型的出现标志着人工通用智能追求中的一次重要进步。然而,这些模型在专业和通用任务上的有效性仍需要进一步研究。本文旨在评估热门LVLM在专业和通用任务上的能力,以提供对这些创新方法全面理解的综述。为了评估他们在专业任务上的有效性,我们定制了一个包含三个不同场景的全面测试平台:自然、医疗和工业场景,涵盖六个具有挑战性的任务,包括显著、伪装和透明的物体检测,以及皮肤和肿瘤检测,以及工业异常检测。我们检查了三种最近的开源LVLM(MiniGPT-v2、LLaVA-1.5和Shikra)在视觉识别和定位领域的表现。此外,我们使用上述模型与GPT-4V一起进行实证研究,评估了它们在通用任务中的多模态理解能力,如物体计数、荒谬问题回答、手势推理、属性识别和空间关系推理。我们的研究揭示,这些模型在专业和通用任务上的能力都有限。我们深入研究了这种不足,并提出了几个可能的原因,包括在专业任务上认知有限、物体幻觉、文本到图像干扰和复杂问题下降的鲁棒性。我们希望这项研究能为LVLM的未来发展提供宝贵的洞见,增强其在应对通用和专业化应用方面的能力。
https://arxiv.org/abs/2403.04306