In this paper, we investigate an open research task of cross-modal retrieval between 3D shapes and textual descriptions. Previous approaches mainly rely on point cloud encoders for feature extraction, which may ignore key inherent features of 3D shapes, including depth, spatial hierarchy, geometric continuity, etc. To address this issue, we propose COM3D, making the first attempt to exploit the cross-view correspondence and cross-modal mining to enhance the retrieval performance. Notably, we augment the 3D features through a scene representation transformer, to generate cross-view correspondence features of 3D shapes, which enrich the inherent features and enhance their compatibility with text matching. Furthermore, we propose to optimize the cross-modal matching process based on the semi-hard negative example mining method, in an attempt to improve the learning efficiency. Extensive quantitative and qualitative experiments demonstrate the superiority of our proposed COM3D, achieving state-of-the-art results on the Text2Shape dataset.
在本文中,我们研究了在3D形状和文本描述之间进行跨模态检索的开放研究任务。以前的方法主要依赖于点云编码器进行特征提取,这可能会忽略3D形状的关键固有特征,包括深度、空间层次结构、几何连续性等。为了解决这个问题,我们提出了COM3D,这是第一次尝试利用跨视图匹配和跨模态挖掘来提高检索性能。值得注意的是,我们通过场景表示转换器增强3D特征,以生成3D形状的跨视图匹配特征,从而丰富其固有特征并提高其与文本匹配的兼容性。此外,我们还基于半硬负样本挖掘方法优化跨模态匹配过程,试图提高学习效率。大量的定量和定性实验证实了我们提出的COM3D具有优越性,在Text2Shape数据集上取得了最先进的成果。
https://arxiv.org/abs/2405.04103
Recently, video object segmentation (VOS) networks typically use memory-based methods: for each query frame, the mask is predicted by space-time matching to memory frames. Despite these methods having superior performance, they suffer from two issues: 1) Challenging data can destroy the space-time coherence between adjacent video frames. 2) Pixel-level matching will lead to undesired mismatching caused by the noises or distractors. To address the aforementioned issues, we first propose to generate an auxiliary frame between adjacent frames, serving as an implicit short-temporal reference for the query one. Next, we learn a prototype for each video object and prototype-level matching can be implemented between the query and memory. The experiment demonstrated that our network outperforms the state-of-the-art method on the DAVIS 2017, achieving a J&F score of 86.4%, and attains a competitive result 85.0% on YouTube VOS 2018. In addition, our network exhibits a high inference speed of 32+ FPS.
最近,视频对象分割(VOS)网络通常使用基于内存的方法:对于每个查询帧,通过空间时间匹配预测到内存帧的掩码。尽管这些方法具有卓越的性能,但它们却存在两个问题:1)具有挑战性的数据可能破坏相邻视频帧的空间时间一致性;2)像素级别的匹配会导致由噪声或干扰引起的 unwanted 匹配。为了应对前述问题,我们首先提出在相邻帧之间生成辅助帧,作为查询帧的隐含短时参考。然后,我们学习每个视频对象的原型,并且查询和记忆之间的原型级别匹配可以实现。实验证明,我们的网络在DAVIS 2017上优于最先进的方法,达到86.4%的J&F分数,并且在2018年的YouTube VOS上获得了竞争力的结果85.0%。此外,我们的网络具有高达32+ FPS的高推理速度。
https://arxiv.org/abs/2405.04042
Graph condensation, which reduces the size of a large-scale graph by synthesizing a small-scale condensed graph as its substitution, has immediately benefited various graph learning tasks. However, existing graph condensation methods rely on centralized data storage, which is unfeasible for real-world decentralized data distribution, and overlook data holders' privacy-preserving requirements. To bridge the gap, we propose and study the novel problem of federated graph condensation for graph neural networks (GNNs). Specifically, we first propose a general framework for federated graph condensation, in which we decouple the typical gradient matching process for graph condensation into client-side gradient calculation and server-side gradient matching. In this way, the burdensome computation cost in client-side is largely alleviated. Besides, our empirical studies show that under the federated setting, the condensed graph will consistently leak data membership privacy, i.e., the condensed graph during the federated training can be utilized to steal the training data under the membership inference attacks (MIA). To tackle this issue, we innovatively incorporate information bottleneck principles into the federated graph condensation, which only needs to extract partial node features in one local pre-training step and utilize the features during federated training. Extensive experiments on real-world datasets demonstrate that our framework can consistently protect membership privacy during training. Meanwhile, it also achieves comparable and even superior performance against existing centralized graph condensation and federated graph learning methods.
图形收缩,通过合成一个小规模的压缩图作为替代,从而减小大型图的大小,已经立即对各种图学习任务产生了好处。然而,现有的图形收缩方法依赖于集中的数据存储,这并不适用于现实世界去中心化数据分布,并且忽略了数据持有者的隐私保护需求。为了弥合这一差距,我们提出了并研究了为图神经网络(GNNs)求解分布式图形收缩的新问题。具体来说,我们首先提出了一个通用的图形收缩框架,其中我们解耦了图形收缩的典型梯度匹配过程,将其分为客户端侧梯度计算和服务器侧梯度匹配。这样,客户端的计算成本大大减轻。此外,我们的实证研究结果表明,在分布式设置下,压缩图会持续泄露数据成员的隐私,即在成员推断攻击(MIA)下可以利用压缩图窃取训练数据。为了解决这个问题,我们创新地将信息瓶颈原理融入分布式图形收缩,从而在客户端预训练步骤中只需要提取局部节点特征,并在分布式训练过程中利用这些特征。在现实世界数据集上的大量实验证明,我们的框架可以在训练过程中持续保护成员隐私。同时,它还实现了与现有集中式图形收缩和分布式图学习方法的相当甚至卓越的性能。
https://arxiv.org/abs/2405.03911
LLMs have demonstrated proficiency in contextualizing their outputs using human input, often matching or beating human-level performance on a variety of tasks. However, LLMs have not yet been used to characterize synergistic learning in students' collaborative discourse. In this exploratory work, we take a first step towards adopting a human-in-the-loop prompt engineering approach with GPT-4-Turbo to summarize and categorize students' synergistic learning during collaborative discourse. Our preliminary findings suggest GPT-4-Turbo may be able to characterize students' synergistic learning in a manner comparable to humans and that our approach warrants further investigation.
语言模型(LLMs)已经在使用人类输入时展示了对输出进行情境化方面的熟练程度,通常在各种任务上与人类水平相当或者超越人类水平。然而,LLMs尚未被用于描述学生在合作交流中的协同学习。在这项探索性工作中,我们迈出了使用GPT-4-Turbo进行人机交互提示工程的第一步,以总结和分类学生在合作交流中的协同学习。我们的初步发现表明,GPT-4-Turbo可能会以与人类相当或超越人类的方式描述学生的协同学习,并且我们的方法需要进一步研究。
https://arxiv.org/abs/2405.03677
Automatic stress detection using heart rate variability (HRV) features has gained significant traction as it utilizes unobtrusive wearable sensors measuring signals like electrocardiogram (ECG) or blood volume pulse (BVP). However, detecting stress through such physiological signals presents a considerable challenge owing to the variations in recorded signals influenced by factors, such as perceived stress intensity and measurement devices. Consequently, stress detection models developed on one dataset may perform poorly on unseen data collected under different conditions. To address this challenge, this study explores the generalizability of machine learning models trained on HRV features for binary stress detection. Our goal extends beyond evaluating generalization performance; we aim to identify the characteristics of datasets that have the most significant influence on generalizability. We leverage four publicly available stress datasets (WESAD, SWELL-KW, ForDigitStress, VerBIO) that vary in at least one of the characteristics such as stress elicitation techniques, stress intensity, and sensor devices. Employing a cross-dataset evaluation approach, we explore which of these characteristics strongly influence model generalizability. Our findings reveal a crucial factor affecting model generalizability: stressor type. Models achieved good performance across datasets when the type of stressor (e.g., social stress in our case) remains consistent. Factors like stress intensity or brand of the measurement device had minimal impact on cross-dataset performance. Based on our findings, we recommend matching the stressor type when deploying HRV-based stress models in new environments. To the best of our knowledge, this is the first study to systematically investigate factors influencing the cross-dataset applicability of HRV-based stress models.
使用心率变化(HRV)特征进行自动压力检测已经取得了显著的吸引力,因为它利用了不显眼的可穿戴传感器来测量类似心电图(ECG)或血量脉搏(BVP)的信号。然而,通过这些生理信号检测压力存在相当大的挑战,因为记录的信号受到感知压力强度和测量设备等因素的影响。因此,在一个数据集上训练的应激检测模型的性能可能会在不同的条件下表现不佳。为解决这个问题,本研究探讨了基于HRV特征的机器学习模型在二分类应激检测上的可扩展性。我们的目标不仅仅局限于评估泛化性能;我们旨在确定对泛化性能影响最大的数据集特征。我们利用了四个公开可用的应激数据集(WESAD,SWELL-KW,ForDigitStress,VerBIO),这些数据集在至少一个特征上存在差异,例如应激引发技术,应激强度和传感器设备。采用跨数据集评估方法,我们探讨了哪些特征对模型泛化性能有很强影响。我们的研究结果表明,影响模型泛化性能的一个关键因素是应激类型。当应激类型(例如我们情况下的社交应激)保持一致时,模型在各个数据集上的表现都很好。像应激强度或测量设备的品牌等因素对跨数据集性能的影响很小。根据我们的研究结果,我们建议在将基于HRV的应激模型应用于新环境时,应保持应激类型的一致性。据我们所知,这是第一个系统地研究影响HRV-基于应激模型跨数据集适用性的因素的研究。
https://arxiv.org/abs/2405.09563
This paper explores how deep learning techniques can improve visual-based SLAM performance in challenging environments. By combining deep feature extraction and deep matching methods, we introduce a versatile hybrid visual SLAM system designed to enhance adaptability in challenging scenarios, such as low-light conditions, dynamic lighting, weak-texture areas, and severe jitter. Our system supports multiple modes, including monocular, stereo, monocular-inertial, and stereo-inertial configurations. We also perform analysis how to combine visual SLAM with deep learning methods to enlighten other researches. Through extensive experiments on both public datasets and self-sampled data, we demonstrate the superiority of the SL-SLAM system over traditional approaches. The experimental results show that SL-SLAM outperforms state-of-the-art SLAM algorithms in terms of localization accuracy and tracking robustness. For the benefit of community, we make public the source code at this https URL.
本文探讨了深度学习技术如何通过结合深度特征提取和深度匹配方法来提高基于视觉的SLAM在具有挑战性的环境中的性能。通过结合深度特征提取和深度匹配方法,我们引入了一种多功能的混合视觉SLAM系统,旨在增强在具有挑战性的场景中的适应性,例如低光条件、动态照明、弱纹理区和严重抖动。我们的系统支持多种模式,包括单目、双目、单目-惯性和平面-惯性配置。我们还进行了分析,探讨了如何将视觉SLAM与深度学习方法相结合以启发其他研究者。通过在公开数据集和自采样数据上进行广泛的实验,我们证明了SL-SLAM系统与传统方法相比具有优越性。实验结果表明,SL-SLAM在定位精度和跟踪鲁棒性方面优于最先进的SLAM算法。为了造福社区,我们将SL-SLAM的源代码公开在以下链接处:
https://arxiv.org/abs/2405.03413
Image-based retrieval in large Earth observation archives is challenging because one needs to navigate across thousands of candidate matches only with the query image as a guide. By using text as information supporting the visual query, the retrieval system gains in usability, but at the same time faces difficulties due to the diversity of visual signals that cannot be summarized by a short caption only. For this reason, as a matching-based task, cross-modal text-image retrieval often suffers from information asymmetry between texts and images. To address this challenge, we propose a Knowledge-aware Text-Image Retrieval (KTIR) method for remote sensing images. By mining relevant information from an external knowledge graph, KTIR enriches the text scope available in the search query and alleviates the information gaps between texts and images for better matching. Moreover, by integrating domain-specific knowledge, KTIR also enhances the adaptation of pre-trained vision-language models to remote sensing applications. Experimental results on three commonly used remote sensing text-image retrieval benchmarks show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.
大地球观测档案中基于图像的检索具有挑战性,因为需要仅以查询图像为指南穿越数千个候选匹配。通过将文本作为支持视觉查询的信息,检索系统在可用性方面获得了提高,但同时由于视觉信号的多样性无法仅通过短文标题来总结,因此面临着困难。因此,作为一种匹配为基础的任务,跨模态文本-图像检索常常存在文本和图像之间的信息不对称。为了应对这一挑战,我们提出了一个知识引导的文本-图像检索(KTIR)方法来解决遥感图像。通过从外部知识图中挖掘相关信息,KTIR为搜索查询提供了更丰富的文本范围,并减轻了文本和图像之间的信息缺口,从而实现更好的匹配。此外,通过整合领域特定知识,KTIR还增强了预训练视觉语言模型对远程观测应用的适应性。在三个常用的遥感文本-图像检索基准测试中,与最先进的检索方法相比,所提出的知识引导方法产生了各种不同的检索结果,但均具有更好的表现。
https://arxiv.org/abs/2405.03373
The design of the query is crucial for the performance of DETR and its variants. Each query consists of two components: a content part and a positional one. Traditionally, the content query is initialized with a zero or learnable embedding, lacking essential content information and resulting in sub-optimal performance. In this paper, we introduce a novel plug-and-play module, Self-Adaptive Content Query (SACQ), to address this limitation. The SACQ module utilizes features from the transformer encoder to generate content queries via self-attention pooling. This allows candidate queries to adapt to the input image, resulting in a more comprehensive content prior and better focus on target objects. However, this improved concentration poses a challenge for the training process that utilizes the Hungarian matching, which selects only a single candidate and suppresses other similar ones. To overcome this, we propose a query aggregation strategy to cooperate with SACQ. It merges similar predicted candidates from different queries, easing the optimization. Our extensive experiments on the COCO dataset demonstrate the effectiveness of our proposed approaches across six different DETR's variants with multiple configurations, achieving an average improvement of over 1.0 AP.
查询的设计对DETR及其变体的性能至关重要。每个查询由两个组件组成:内容部分和位置部分。传统上,内容查询使用零或可学习嵌入来初始化,缺乏关键内容信息,导致性能低。在本文中,我们引入了一个新的可插拔和可运行模块,自适应内容查询(SACQ),以解决这一限制。SACQ模块利用Transformer编码器的特征通过自注意力池化生成内容查询。这使得候选查询能够适应输入图像,从而产生更全面的内容先验,并更好地关注目标对象。然而,这种改进的集中注意力对使用匈牙利匹配的训练过程提出了挑战,该匹配选择只有一个候选者,并抑制其他类似者。为了克服这一挑战,我们提出了一个查询聚合策略与SACQ合作。它将来自不同查询的相似预测候选者合并,简化优化。我们在COCO数据集上的广泛实验证明,我们提出的方法在六种不同的DETR变体上具有多种配置,实现了平均改善超过1.0 AP。
https://arxiv.org/abs/2405.03318
The estimation of relative motion between spacecraft increasingly relies on feature-matching computer vision, which feeds data into a recursive filtering algorithm. Kalman filters, although efficient in noise compensation, demand extensive tuning of system and noise models. This paper introduces FlexKalmanNet, a novel modular framework that bridges this gap by integrating a deep fully connected neural network with Kalman filter-based motion estimation algorithms. FlexKalmanNet's core innovation is its ability to learn any Kalman filter parameter directly from measurement data, coupled with the flexibility to utilize various Kalman filter variants. This is achieved through a notable design decision to outsource the sequential computation from the neural network to the Kalman filter variant, enabling a purely feedforward neural network architecture. This architecture, proficient at handling complex, nonlinear features without the dependency on recurrent network modules, captures global data patterns more effectively. Empirical evaluation using data from NASA's Astrobee simulation environment focuses on learning unknown parameters of an Extended Kalman filter for spacecraft pose and twist estimation. The results demonstrate FlexKalmanNet's rapid training convergence, high accuracy, and superior performance against manually tuned Extended Kalman filters.
航天器之间相对运动的估计 increasingly依赖于特征匹配计算机视觉,将数据输入到递归滤波算法中。虽然Kalman滤波器在噪声补偿方面非常有效,但需要对系统和噪声模型进行 extensive 的调整。本文介绍了 FlexKalmanNet,一种新颖的模块化框架,通过将深度全连接神经网络与基于Kalman滤波器的行为估计算法相结合,桥接了这一 gap。FlexKalmanNet 的核心创新是其能够从测量数据中直接学习任何Kalman滤波器参数,并具有灵活使用各种Kalman滤波器变体的能力。通过一个显著的设计决策将序列计算从神经网络中委托给Kalman滤波器版本,实现了一个纯粹的反馈神经网络架构。这种架构能够更好地处理复杂、非线性的特征,而不依赖于递归网络模块。使用NASA的Astrobee仿真环境进行实证评估,重点关注扩展Kalman滤波器对航天器姿态和 twist估计的未知参数学习。结果表明,FlexKalmanNet 的训练收敛速度快、准确度高,且对手动调整的扩展Kalman滤波器具有卓越的性能。
https://arxiv.org/abs/2405.03034
Score matching with Langevin dynamics (SMLD) method has been successfully applied to accelerated MRI. However, the hyperparameters in the sampling process require subtle tuning, otherwise the results can be severely corrupted by hallucination artifacts, particularly with out-of-distribution test data. In this study, we propose a novel workflow in which SMLD results are regarded as additional priors to guide model-driven network training. First, we adopted a pretrained score network to obtain samples as preliminary guidance images (PGI) without the need for network retraining, parameter tuning and in-distribution test data. Although PGIs are corrupted by hallucination artifacts, we believe that they can provide extra information through effective denoising steps to facilitate reconstruction. Therefore, we designed a denoising module (DM) in the second step to improve the quality of PGIs. The features are extracted from the components of Langevin dynamics and the same score network with fine-tuning; hence, we can directly learn the artifact patterns. Third, we designed a model-driven network whose training is guided by denoised PGIs (DGIs). DGIs are densely connected with intermediate reconstructions in each cascade to enrich the features and are periodically updated to provide more accurate guidance. Our experiments on different sequences revealed that despite the low average quality of PGIs, the proposed workflow can effectively extract valuable information to guide the network training, even with severely reduced training data and sampling steps. Our method outperforms other cutting-edge techniques by effectively mitigating hallucination artifacts, yielding robust and high-quality reconstruction results.
使用Langevin动力学(SMLD)方法进行分数匹配已经被成功地应用于加速磁共振成像(MRI)。然而,在采样过程中需要精细调整的参数,否则结果可能会受到伪影伪像的严重影响,特别是在非分布测试数据上。在这项研究中,我们提出了一个新颖的工作流程,其中SMLD结果被视为指导模型驱动网络训练的额外 prior。首先,我们采用预训练的分数网络来获取不需要网络重新训练和参数调整的样本作为初步指导图像(PGI)。尽管PGIs受到伪影伪像的污染,但我们认为它们可以通过有效的去噪步骤提供额外的信息,以促进重建。因此,我们在第二步设计了一个去噪模块(DM),以提高PGIs的质量。特征是从Langevin动态的组件中提取的,与同一分数网络进行了微调;因此,我们可以直接学习伪影模式。第三,我们设计了一个指导去噪PGIs(DGIs)的模型驱动网络。DGIs在级联过程中与中间修复相比密度连接,从而丰富特征,并定期更新以提供更准确的指导。我们对不同序列的实验表明,尽管PGIs的平均质量较低,但所提出的工作流程仍可以有效地提取有价值的信息来指导网络训练,即使训练数据和采样步骤严重减少。我们的方法通过有效地减轻伪影伪像,产生 robust 和 high-quality 重建结果,超越了其他尖端技术。
https://arxiv.org/abs/2405.02958
This work presents a drone detector with modified backbone and multiple pyramid feature maps enhancement structure (MDDPE). Novel feature maps improve modules that uses different levels of information to produce more robust and discriminatory features is proposed. These module includes the feature maps supplement function and the feature maps recombination enhancement this http URL effectively handle the drone characteristics, auxiliary supervisions that are implemented in the early stages by employing tailored anchors designed are utilized. To further improve the modeling of real drone detection scenarios and initialization of the regressor, an updated anchor matching technique is introduced to match anchors and ground truth drone as closely as feasible. To show the proposed MDDPE's superiority over the most advanced detectors, extensive experiments are carried out using well-known drone detection benchmarks.
本文提出了一种名为MDDPE的多层金字塔特征图增强结构(无人机检测器)。新颖的特征图改进了使用不同信息水平产生更稳健和具有区分性的模块。这些模块包括特征图补充功能和特征图复合同样增强。通过采用定制锚定器,在无人机特征识别的早期阶段实现了辅助监督,从而有效地处理了无人机特性,辅助监督在无人机检测器中作为定制锚定器被利用。为了进一步改善对真实无人机检测场景的建模和初始化,引入了一种更新的锚定匹配技术,以尽可能地匹配锚点和地面真实无人机。为了展示所提出的MDDPE与最先进的检测器相比的优越性,使用著名的无人机检测基准进行了广泛的实验。
https://arxiv.org/abs/2405.02882
This paper aims to create a deep learning framework that can estimate the deformation vector field (DVF) for directly registering abdominal MRI-CT images. The proposed method assumed a diffeomorphic deformation. By using topology-preserved deformation features extracted from the probabilistic diffeomorphic registration model, abdominal motion can be accurately obtained and utilized for DVF estimation. The model integrated Swin transformers, which have demonstrated superior performance in motion tracking, into the convolutional neural network (CNN) for deformation feature extraction. The model was optimized using a cross-modality image similarity loss and a surface matching loss. To compute the image loss, a modality-independent neighborhood descriptor (MIND) was used between the deformed MRI and CT images. The surface matching loss was determined by measuring the distance between the warped coordinates of the surfaces of contoured structures on the MRI and CT images. The deformed MRI image was assessed against the CT image using the target registration error (TRE), Dice similarity coefficient (DSC), and mean surface distance (MSD) between the deformed contours of the MRI image and manual contours of the CT image. When compared to only rigid registration, DIR with the proposed method resulted in an increase of the mean DSC values of the liver and portal vein from 0.850 and 0.628 to 0.903 and 0.763, a decrease of the mean MSD of the liver from 7.216 mm to 3.232 mm, and a decrease of the TRE from 26.238 mm to 8.492 mm. The proposed deformable image registration method based on a diffeomorphic transformer provides an effective and efficient way to generate an accurate DVF from an MRI-CT image pair of the abdomen. It could be utilized in the current treatment planning workflow for liver radiotherapy.
本文旨在创建一个深度学习框架,可以准确估计直接注册的腹部MRI-CT图像的变形矢量场(DVF)。所提出的方法基于等变形的变形。通过使用概率形态不变的变形特征提取,可以准确获得腹部运动,并用于DVF估计。模型将Swin变换器集成到卷积神经网络(CNN)中,用于变形特征提取。模型使用跨模态图像相似性损失和表面匹配损失进行优化。为了计算图像损失,在MRI和CT图像之间使用了一个模态无关的邻域描述符(MIND)。表面匹配损失通过测量MRI和CT图像上轮廓结构的变形坐标之间的距离来确定。对MRI图像的变形轮廓使用目标注册误差(TRE)、余弦相似度系数(DSC)和平均表面距离(MSD)与手动CT图像的变形轮廓进行比较。与仅刚性注册相比,所提出的方法导致肝脏和门静脉的平均DSC值从0.850和0.628增加至0.903和0.763,肝脏平均MSD从7.216 mm减少至3.232 mm,TRE从26.238 mm减少至8.492 mm。基于等变形的图像注册方法,可以生成准确的可用于腹部MRI-CT图像对中的DVF。它可用于当前的肝脏放射治疗计划工作流程。
https://arxiv.org/abs/2405.02692
Recent few-shot action recognition (FSAR) methods achieve promising performance by performing semantic matching on learned discriminative features. However, most FSAR methods focus on single-scale (e.g., frame-level, segment-level, \etc) feature alignment, which ignores that human actions with the same semantic may appear at different velocities. To this end, we develop a novel Multi-Velocity Progressive-alignment (MVP-Shot) framework to progressively learn and align semantic-related action features at multi-velocity levels. Concretely, a Multi-Velocity Feature Alignment (MVFA) module is designed to measure the similarity between features from support and query videos with different velocity scales and then merge all similarity scores in a residual fashion. To avoid the multiple velocity features deviating from the underlying motion semantic, our proposed Progressive Semantic-Tailored Interaction (PSTI) module injects velocity-tailored text information into the video feature via feature interaction on channel and temporal domains at different velocities. The above two modules compensate for each other to predict query categories more accurately under the few-shot settings. Experimental results show our method outperforms current state-of-the-art methods on multiple standard few-shot benchmarks (i.e., HMDB51, UCF101, Kinetics, and SSv2-small).
近年来,一些几帧动作识别(FSAR)方法通过在学到的区分性特征上进行语义匹配来实现出色的性能。然而,大多数FSAR方法都关注单尺度(例如,帧级别、段级别等)特征对齐,这忽略了人类具有相同语义的动作在不同的速度下可能出现的事实。为此,我们提出了一个名为多速度渐进对齐(MVP-Shot)的新框架,以在多速度级别上逐步学习和对齐语义相关动作特征。具体来说,我们设计了一个多速度特征对齐(MVFA)模块,用于测量不同速度尺度支持视频和查询视频的特征之间的相似性,然后以残差方式合并所有相似度分数。为了避免多个速度特征脱离底层运动语义,我们提出的渐进语义自适应交互(PSTI)模块通过在不同速度下的通道和时域特征交互注入速度相关的文本信息。上述两个模块相互补充,在几帧设置下更准确地预测查询类别。实验结果表明,我们的方法在多个标准几帧基准(即HMDB51、UCF101、Kinetics和SSv2-small)上优于现有技术水平。
https://arxiv.org/abs/2405.02077
With the wide application of knowledge distillation between an ImageNet pre-trained teacher model and a learnable student model, industrial anomaly detection has witnessed a significant achievement in the past few years. The success of knowledge distillation mainly relies on how to keep the feature discrepancy between the teacher and student model, in which it assumes that: (1) the teacher model can jointly represent two different distributions for the normal and abnormal patterns, while (2) the student model can only reconstruct the normal distribution. However, it still remains a challenging issue to maintain these ideal assumptions in practice. In this paper, we propose a simple yet effective two-stage industrial anomaly detection framework, termed as AAND, which sequentially performs Anomaly Amplification and Normality Distillation to obtain robust feature discrepancy. In the first anomaly amplification stage, we propose a novel Residual Anomaly Amplification (RAA) module to advance the pre-trained teacher encoder. With the exposure of synthetic anomalies, it amplifies anomalies via residual generation while maintaining the integrity of pre-trained model. It mainly comprises a Matching-guided Residual Gate and an Attribute-scaling Residual Generator, which can determine the residuals' proportion and characteristic, respectively. In the second normality distillation stage, we further employ a reverse distillation paradigm to train a student decoder, in which a novel Hard Knowledge Distillation (HKD) loss is built to better facilitate the reconstruction of normal patterns. Comprehensive experiments on the MvTecAD, VisA, and MvTec3D-RGB datasets show that our method achieves state-of-the-art performance.
知识蒸馏在工业异常检测中的应用已经取得了显著成就。知识蒸馏的成功主要依赖于如何保持教师和 student模型之间的特征差异,其中它假设:(1)教师模型可以共同表示正常和异常模式的两种不同分布,而(2)学生模型只能重构正常分布。然而,在实践中仍然存在一个具有挑战性的问题,即维持这些理想假设。在本文中,我们提出了一个简单而有效的工业异常检测框架,称为AAND,它分为两个阶段依次执行异常增强和正常分化。在第一个异常增强阶段,我们提出了一个新的残差异常增强(RAA)模块,以提高预训练教师编码器的性能。通过暴露合成异常,它通过残差生成来放大异常,同时保持预训练模型的完整性。它主要由一个匹配引导的残差门和一个属性缩放的残差生成器组成,可以分别确定残差的比率和特征。在第二个正则化分化阶段,我们进一步采用反向蒸馏范式训练学生解码器,其中构建了一种新的硬知识蒸馏(HKD)损失,以更好地促进对正常模式的重建。在MvTecAD、VisA和MvTec3D-RGB数据集上进行全面的实验证明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2405.02068
Multi-agent systems (MAS) need to adaptively cope with dynamic environments, changing agent populations, and diverse tasks. However, most of the multi-agent systems cannot easily handle them, due to the complexity of the state and task space. The social impact theory regards the complex influencing factors as forces acting on an agent, emanating from the environment, other agents, and the agent's intrinsic motivation, referring to the social force. Inspired by this concept, we propose a novel gradient-based state representation for multi-agent reinforcement learning. To non-trivially model the social forces, we further introduce a data-driven method, where we employ denoising score matching to learn the social gradient fields (SocialGFs) from offline samples, e.g., the attractive or repulsive outcomes of each force. During interactions, the agents take actions based on the multi-dimensional gradients to maximize their own rewards. In practice, we integrate SocialGFs into the widely used multi-agent reinforcement learning algorithms, e.g., MAPPO. The empirical results reveal that SocialGFs offer four advantages for multi-agent systems: 1) they can be learned without requiring online interaction, 2) they demonstrate transferability across diverse tasks, 3) they facilitate credit assignment in challenging reward settings, and 4) they are scalable with the increasing number of agents.
多智能体系统(MAS)需要适应性地应对动态环境、变化的人工智能体种和多样化的任务。然而,大多数MAS无法轻松处理这些复杂状态和任务空间。社会影响理论将复杂的影响力因素视为作用于智能体、来自环境、其他智能体以及智能体内在动机的力量,即社会力。受到这个概念的启发,我们提出了一个新颖的基于梯度的多智能体强化学习状态表示。为了非平凡地建模社会力,我们进一步引入了一种数据驱动的方法,其中我们使用去噪评分匹配来从离线样本中学习社会梯度场(SocialGFs),例如每个力的吸引或排斥后果。在交互过程中,智能体根据多维梯度采取行动,以最大化自己的奖励。在实践中,我们将SocialGFs集成到广泛使用的多智能体强化学习算法中,如MAPPO。实证结果表明,SocialGFs对多智能体系统具有以下四个优点:1)无需在线交互即可学习,2)展示了跨多样化任务的传递性,3)有助于在具有挑战性的奖励设置中进行信道分配,4)随着智能体数量的增加,具有可扩展性。
https://arxiv.org/abs/2405.01839
Recent works in dataset distillation seek to minimize training expenses by generating a condensed synthetic dataset that encapsulates the information present in a larger real dataset. These approaches ultimately aim to attain test accuracy levels akin to those achieved by models trained on the entirety of the original dataset. Previous studies in feature and distribution matching have achieved significant results without incurring the costs of bi-level optimization in the distillation process. Despite their convincing efficiency, many of these methods suffer from marginal downstream performance improvements, limited distillation of contextual information, and subpar cross-architecture generalization. To address these challenges in dataset distillation, we propose the ATtentiOn Mixer (ATOM) module to efficiently distill large datasets using a mixture of channel and spatial-wise attention in the feature matching process. Spatial-wise attention helps guide the learning process based on consistent localization of classes in their respective images, allowing for distillation from a broader receptive field. Meanwhile, channel-wise attention captures the contextual information associated with the class itself, thus making the synthetic image more informative for training. By integrating both types of attention, our ATOM module demonstrates superior performance across various computer vision datasets, including CIFAR10/100 and TinyImagenet. Notably, our method significantly improves performance in scenarios with a low number of images per class, thereby enhancing its potential. Furthermore, we maintain the improvement in cross-architectures and applications such as neural architecture search.
近年来在数据蒸馏领域的研究旨在通过生成一个压缩合成数据集来最小化训练成本,该数据集包含了较大真实数据集中的信息。这些方法最终旨在实现与整个原始数据集训练出的模型具有相似的测试准确度。之前在特征匹配和分布匹配方面的研究表明,在蒸馏过程中没有产生双层优化成本,但取得了显著的成果。尽管这些方法在节省训练成本方面具有令人满意的效率,但它们在下游性能方面存在微小的改进,对上下文信息的提取有限,并且模型扩展性较差。为了应对这些挑战,我们在数据蒸馏领域提出了ATtentiOn Mixer(ATOM)模块,利用混合通道和空间注意在特征匹配过程中高效地蒸馏大型数据集。空间注意可以帮助根据类在各自图像上的一致定位来指导学习过程,实现从更广泛的感受野进行蒸馏。同时,通道注意可以捕捉类本身相关的上下文信息,从而使合成图像对训练更加有用。通过整合这两种注意,我们的ATOM模块在各种计算机视觉数据集上的表现都超过了之前的水平,包括CIFAR10/100和TinyImagenet。值得注意的是,我们的方法在图像数量较低的情况下显著提高了性能,从而增强了其潜力。此外,我们还保持了在神经架构搜索等方面的改进。
https://arxiv.org/abs/2405.01373
Latent fingerprint matching is a daunting task, primarily due to the poor quality of latent fingerprints. In this study, we propose a deep-learning based dense minutia descriptor (DMD) for latent fingerprint matching. A DMD is obtained by extracting the fingerprint patch aligned by its central minutia, capturing detailed minutia information and texture information. Our dense descriptor takes the form of a three-dimensional representation, with two dimensions associated with the original image plane and the other dimension representing the abstract features. Additionally, the extraction process outputs the fingerprint segmentation map, ensuring that the descriptor is only valid in the foreground region. The matching between two descriptors occurs in their overlapping regions, with a score normalization strategy to reduce the impact brought by the differences outside the valid area. Our descriptor achieves state-of-the-art performance on several latent fingerprint datasets. Overall, our DMD is more representative and interpretable compared to previous methods.
潜在指纹匹配是一项具有挑战性的任务,主要原因是潜在指纹的质量较差。在这项研究中,我们提出了一个基于深度学习的密集最小二乘描述符(DMD)用于潜在指纹匹配。通过提取其中心最小二乘对齐的指纹补丁,捕获详细的最小二乘信息和纹理信息。我们的密集描述符呈三维表示,其中两个维度与原始图像平面相关,另一个维度代表抽象特征。此外,提取过程输出指纹分割图,确保描述符仅在前景区域内有效。两个描述符之间的匹配发生在它们的重叠区域,采用分数归一化策略减少超出有效区域差异的影响。我们的描述符在多个潜在指纹数据集上实现了最先进的性能。总体而言,我们的DMD比以往方法更具代表性且更易解释。
https://arxiv.org/abs/2405.01199
Molecular dynamics (MD) is a crucial technique for simulating biological systems, enabling the exploration of their dynamic nature and fostering an understanding of their functions and properties. To address exploration inefficiency, emerging enhanced sampling approaches like coarse-graining (CG) and generative models have been employed. In this work, we propose a \underline{Frame-to-Frame} generative model with guided \underline{Flow}-matching (F$3$low) for enhanced sampling, which (a) extends the domain of CG modeling to the SE(3) Riemannian manifold; (b) retreating CGMD simulations as autoregressively sampling guided by the former frame via flow-matching models; (c) targets the protein backbone, offering improved insights into secondary structure formation and intricate folding pathways. Compared to previous methods, F$3$low allows for broader exploration of conformational space. The ability to rapidly generate diverse conformations via force-free generative paradigm on SE(3) paves the way toward efficient enhanced sampling methods.
分子动力学(MD)是一种对生物系统进行模拟的关键技术,它允许探索其动态性和 fostering an understanding of their functions and properties. 为解决探索效率低下的问题,新兴的增强采样方法如粗粒度(CG)和生成模型已经被采用。在这项工作中,我们提出了一个 Frame-to-Frame 生成模型,带有指导的流匹配(F$3$low)用于增强采样,该模型(a)将CG模型的领域扩展到SE(3) Riemannian流形;(b)通过流匹配模型,将CGMD模拟作为自回归采样,引导前一个框架;(c) 将目标对准蛋白质骨架,提供对二级结构形成和复杂的折叠途径的更好的理解。与以前的方法相比,F$3$low 允许更广泛的探索形式空间。通过在SE(3)上通过力-free生成范式快速生成多样性的构象,为高效的增强采样方法铺平道路。
https://arxiv.org/abs/2405.00751
Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still heavily relies on large labeled datasets for effective training. Several semi-supervised approaches have emerged to overcome this challenge, often employing CNN-based detectors with anchor proposals and post-processing techniques like non-maximal suppression (NMS). However, recent advancements in the field have shifted the focus towards transformer-based techniques, eliminating the need for NMS and emphasizing object queries and attention mechanisms. Previous research has focused on two key areas to improve transformer-based detectors: refining the quality of object queries and optimizing attention mechanisms. However, increasing object queries can introduce redundancy, while adjustments to the attention mechanism can increase complexity. To address these challenges, we introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features. Our approach demonstrates remarkable reductions in false positives and substantial enhancements in table detection performance, particularly in complex documents characterized by diverse table structures. This work provides more efficient and accurate table detection in semi-supervised settings.
在文档图像中的表格检测是一个关键的任务,涉及表格的识别和定位。尽管最近在深度学习领域的进步大大提高了这一任务的准确性,但仍然高度依赖大型带标签数据集进行有效的训练。为克服这一挑战,已经出现了几種半监督方法,通常采用基于卷积神经网络(CNN)的检测器以及非最大抑制(NMS)等后处理技术。然而,该领域的最新进展已经将重点转向基于Transformer的技术,消除了NMS的需要,并强调了对象查询和注意机制。之前的研究集中在两个关键领域以提高基于Transformer的检测器的质量:优化对象查询和优化注意机制。然而,增加对象查询可能会引入冗余,而调整注意机制可能会增加复杂性。为了应对这些挑战,我们引入了一种半监督方法,使用了SAM-DETR,一种用于精确将对象查询与目标特征对齐的新颖方法。我们的方法在减少误检率和提高表格检测性能方面取得了显著的降幅,特别是在具有多样表格结构的复杂文档中。这项工作在半监督环境中提供了更高效和准确的表格检测。
https://arxiv.org/abs/2405.00187
Unsupervised object discovery is becoming an essential line of research for tackling recognition problems that require decomposing an image into entities, such as semantic segmentation and object detection. Recently, object-centric methods that leverage self-supervision have gained popularity, due to their simplicity and adaptability to different settings and conditions. However, those methods do not exploit effective techniques already employed in modern self-supervised approaches. In this work, we consider an object-centric approach in which DINO ViT features are reconstructed via a set of queried representations called slots. Based on that, we propose a masking scheme on input features that selectively disregards the background regions, inducing our model to focus more on salient objects during the reconstruction phase. Moreover, we extend the slot attention to a multi-query approach, allowing the model to learn multiple sets of slots, producing more stable masks. During training, these multiple sets of slots are learned independently while, at test time, these sets are merged through Hungarian matching to obtain the final slots. Our experimental results and ablations on the PASCAL-VOC 2012 dataset show the importance of each component and highlight how their combination consistently improves object localization. Our source code is available at: this https URL
无监督的对象发现已经成为解决需要将图像分解成实体以解决识别问题的关键技术之一。最近,利用自监督的方法在对象中心方法中取得了广泛的应用,因为它们的简单性和对不同设置和条件的适应性。然而,这些方法并没有充分利用现代自监督方法中已经采用的有效技术。在本文中,我们考虑了一种以对象为中心的方法,其中通过查询表示的槽来重构DINO ViT特征。基于这种方法,我们提出了一个在输入特征上应用遮罩方案,该方案选择性地忽略了背景区域,使我们的模型在重构阶段更加关注显著的物体。此外,我们还将槽注意扩展到多查询方法,使模型能够学习多个设置,从而产生更稳定的掩码。在训练过程中,这些多个设置是独立学习的,而在测试时,这些设置通过匈牙利匹配合并为最终设置。我们对PASCAL-VOC 2012数据集的实验结果和分析表明了每个组件的重要性,并强调了它们组合一致地改善了物体定位。我们的源代码可在此处访问:https:// this URL
https://arxiv.org/abs/2404.19654