Sports analytics benefits from recent advances in machine learning providing a competitive advantage for teams or individuals. One important task in this context is the performance measurement of individual players to provide reports and log files for subsequent analysis. During sport events like basketball, this involves the re-identification of players during a match either from multiple camera viewpoints or from a single camera viewpoint at different times. In this work, we investigate whether it is possible to transfer the out-standing zero-shot performance of pre-trained CLIP models to the domain of player re-identification. For this purpose we reformulate the contrastive language-to-image pre-training approach from CLIP to a contrastive image-to-image training approach using the InfoNCE loss as training objective. Unlike previous work, our approach is entirely class-agnostic and benefits from large-scale pre-training. With a fine-tuned CLIP ViT-L/14 model we achieve 98.44 % mAP on the MMSports 2022 Player Re-Identification challenge. Furthermore we show that the CLIP Vision Transformers have already strong OCR capabilities to identify useful player features like shirt numbers in a zero-shot manner without any fine-tuning on the dataset. By applying the Score-CAM algorithm we visualise the most important image regions that our fine-tuned model identifies when calculating the similarity score between two images of a player.
体育 analytics 受益于最近的机器学习进展,为团队或个人提供了竞争优势。在这个背景下,一个重要的任务是对个人球员的性能测量,以提供报告和日志文件,为后续分析提供便利。在篮球等运动比赛中,这涉及在一场比赛的不同时间内从多个相机视角或单个相机视角多次识别球员。在这项工作中,我们研究如何将前训练的 CLIP 模型的零样本性能转移到球员识别领域。为此,我们重新阐述了对比性语言-图像前训练方法从 CLIP 到对比性图像-图像训练方法,使用 InfoNCE 损失作为训练目标。与以前的工作不同,我们的 approach 是完全类别无关的,并从大规模的预训练中获得好处。使用优化的 CLIP ViT-L/14 模型,我们在 MMSports 2022 球员重新识别挑战中取得了 98.44 % 的 mAP。此外,我们表明, CLIP 视觉转换器已经具有强大的 OCR 能力,以零样本方式识别有用的球员特征,例如球衣号码,而无需在数据集上进行微调。通过应用评分-CAM 算法,我们可视化了我们的优化模型在计算球员两个图像之间的相似性评分时识别的最重要图像区域。
https://arxiv.org/abs/2303.11855
Video-based person re-identification (video re-ID) has lately fascinated growing attention due to its broad practical applications in various areas, such as surveillance, smart city, and public safety. Nevertheless, video re-ID is quite difficult and is an ongoing stage due to numerous uncertain challenges such as viewpoint, occlusion, pose variation, and uncertain video sequence, etc. In the last couple of years, deep learning on video re-ID has continuously achieved surprising results on public datasets, with various approaches being developed to handle diverse problems in video re-ID. Compared to image-based re-ID, video re-ID is much more challenging and complex. To encourage future research and challenges, this first comprehensive paper introduces a review of up-to-date advancements in deep learning approaches for video re-ID. It broadly covers three important aspects, including brief video re-ID methods with their limitations, major milestones with technical challenges, and architectural design. It offers comparative performance analysis on various available datasets, guidance to improve video re-ID with valuable thoughts, and exciting research directions.
视频身份识别(视频重配)最近吸引了越来越多的关注,因为它在许多领域都具有广泛的应用,例如监控、智慧城市和公共安全等。然而,视频重配仍然是一项相当困难的任务,并且仍然是一个持续的阶段,因为有许多不确定的挑战,例如视角、遮挡、姿态变化和不确定的视频序列等。在过去两年中,深度学习在视频重配方面已经取得了令人惊奇的结果,在公开数据集上开发了各种方法来处理各种视频重配问题。与基于图像的身份识别相比,视频重配更具挑战性和复杂性。为了鼓励未来的研究和挑战,本综述性论文介绍了关于视频重配深度学习方法的最新进展。它涵盖了三个重要的方面,包括简短的视频重配方法及其限制、具有技术挑战的主要里程碑和建筑设计。它提供了对各种可用数据集的比较性能分析、有价值的思想和改进视频重配的方法,并提出了令人兴奋的研究方向。
https://arxiv.org/abs/2303.11332
Occluded person re-identification (Re-ID) aims to address the potential occlusion problem when matching occluded or holistic pedestrians from different camera views. Many methods use the background as artificial occlusion and rely on attention networks to exclude noisy interference. However, the significant discrepancy between simple background occlusion and realistic occlusion can negatively impact the generalization of the this http URL address this issue, we propose a novel transformer-based Attention Disturbance and Dual-Path Constraint Network (ADP) to enhance the generalization of attention networks. Firstly, to imitate real-world obstacles, we introduce an Attention Disturbance Mask (ADM) module that generates an offensive noise, which can distract attention like a realistic occluder, as a more complex form of occlusion.Secondly, to fully exploit these complex occluded images, we develop a Dual-Path Constraint Module (DPC) that can obtain preferable supervision information from holistic images through dual-path interaction. With our proposed method, the network can effectively circumvent a wide variety of occlusions using the basic ViT baseline. Comprehensive experimental evaluations conducted on person re-ID benchmarks demonstrate the superiority of ADP over state-of-the-art methods.
遮挡人重识别(Re-ID)的目标是在匹配遮挡或整体行人从不同相机视图中识别潜在 occlusion 问题。许多方法使用背景作为 artificial occlusion 并依靠注意力网络排除噪声干扰。然而,简单的背景 occlusion 和真实的 occlusion 之间的显著差异可能会负面影响 this http URL 的泛化性。为了解决这一问题,我们提出了一种新颖的Transformer-based注意力干扰和双重路径约束网络(ADP)来增强注意力网络的泛化性。首先,模仿现实世界障碍物,我们引入了注意力干扰掩码(ADM)模块,产生 offensive 噪声,可以像真实的 occlusion 一样分散注意力,成为一种更复杂的 occlusion 形式。其次, fully 利用了这些复杂的 occlusion 图像,我们开发了双重路径约束模块(DPC),可以通过双重路径交互从整体图像获取更好的监督信息。通过我们提出的方法,网络可以 effectively 通过基本 ViT 基线绕过各种 occlusion。在人重识别基准点的 comprehensive 实验评估表明,ADP 比现有方法优越。
https://arxiv.org/abs/2303.10976
Pose transfer aims to transfer a given person into a specified posture, has recently attracted considerable attention. A typical pose transfer framework usually employs representative datasets to train a discriminative model, which is often violated by out-of-distribution (OOD) instances. Recently, test-time adaption (TTA) offers a feasible solution for OOD data by using a pre-trained model that learns essential features with self-supervision. However, those methods implicitly make an assumption that all test distributions have a unified signal that can be learned directly. In open-world conditions, the pose transfer task raises various independent signals: OOD appearance and skeleton, which need to be extracted and distributed in speciality. To address this point, we develop a SEquential Test-time Adaption (SETA). In the test-time phrase, SETA extracts and distributes external appearance texture by augmenting OOD data for self-supervised training. To make non-Euclidean similarity among different postures explicit, SETA uses the image representations derived from a person re-identification (Re-ID) model for similarity computation. By addressing implicit posture representation in the test-time sequentially, SETA greatly improves the generalization performance of current pose transfer models. In our experiment, we first show that pose transfer can be applied to open-world applications, including Tiktok reenactment and celebrity motion synthesis.
姿态转移的目标是将给定的人转移到指定的姿势,最近吸引了相当大的关注。典型的姿态转移框架通常使用代表性的数据集来训练一个鉴别模型,这常常受到分布之外(OOD)实例的违反。最近,测试时间适应(TTA)提供了一个可行的解决方案,使用一个自监督学习的训练模型来学习关键特征,以自学训练。然而,这些方法隐含地假设所有测试分布都有一个通用的信号,可以直接学习。在开放世界条件下,姿态转移任务产生了各种独立的信号:OOD的外观和骨骼,需要提取和分布在特定领域的专业知识。为了解决这一问题,我们开发了平方测试时间适应(SETA)。在测试时间短语中,SETA通过增加OOD数据来提高外部外观纹理,以自学训练。为了使不同姿势之间的非欧几何相似性更加明显,SETA使用从人身份验证(Re-ID)模型推导的图像表示来进行相似性计算。通过在测试时间顺序解决暗示的姿态表示问题,SETA极大地改善了当前姿态转移模型的泛化性能。在我们的实验中,我们首先展示了姿态转移可以应用于开放世界应用程序,包括 Tiktok重编和名人运动合成。
https://arxiv.org/abs/2303.10945
This report addresses the technical aspects of de-identification of medical images of human subjects and biospecimens, such that re-identification risk of ethical, moral, and legal concern is sufficiently reduced to allow unrestricted public sharing for any purpose, regardless of the jurisdiction of the source and distribution sites. All medical images, regardless of the mode of acquisition, are considered, though the primary emphasis is on those with accompanying data elements, especially those encoded in formats in which the data elements are embedded, particularly Digital Imaging and Communications in Medicine (DICOM). These images include image-like objects such as Segmentations, Parametric Maps, and Radiotherapy (RT) Dose objects. The scope also includes related non-image objects, such as RT Structure Sets, Plans and Dose Volume Histograms, Structured Reports, and Presentation States. Only de-identification of publicly released data is considered, and alternative approaches to privacy preservation, such as federated learning for artificial intelligence (AI) model development, are out of scope, as are issues of privacy leakage from AI model sharing. Only technical issues of public sharing are addressed.
本报告探讨了人类 subjects 和生物样本的医疗图像去识别的技术方面,以便能够尽量减少伦理、道德和法律方面的重识别风险,以便能够无限制地公开分享,无论来源和分发地点的管辖范围如何。无论 acquisition 方式如何,所有医疗图像都被考虑,但主要强调的是与数据元素相伴的医疗图像,特别是那些在数据元素嵌入的格式中编码的数据元素,特别是医学影像和通信(DICOM)格式。这些图像包括类似于图像的对象,如分割、标准化地图和放疗剂量对象。范围还包括相关的非图像对象,如放疗结构集、计划和剂量体积统计表、结构化报告和呈现状态。仅考虑公开发布的数据去识别,而保护隐私的替代方法,如人工智能模型开发的联邦学习,则超出了范围,同时也包括从人工智能模型共享中泄露隐私的问题。仅处理公开分享的技术问题。
https://arxiv.org/abs/2303.10473
Object tracking is divided into single-object tracking (SOT) and multi-object tracking (MOT). MOT aims to maintain the identities of multiple objects across a series of continuous video sequences. In recent years, MOT has made rapid progress. However, modeling the motion and appearance models of objects in complex scenes still faces various challenging issues. In this paper, we design a novel direction consistency method for smooth trajectory prediction (STP-DC) to increase the modeling of motion information and overcome the lack of robustness in previous methods in complex scenes. Existing methods use pedestrian re-identification (Re-ID) to model appearance, however, they extract more background information which lacks discriminability in occlusion and crowded scenes. We propose a hyper-grain feature embedding network (HG-FEN) to enhance the modeling of appearance models, thus generating robust appearance descriptors. We also proposed other robustness techniques, including CF-ECM for storing robust appearance information and SK-AS for improving association accuracy. To achieve state-of-the-art performance in MOT, we propose a robust tracker named Rt-track, incorporating various tricks and techniques. It achieves 79.5 MOTA, 76.0 IDF1 and 62.1 HOTA on the test set of MOT17.Rt-track also achieves 77.9 MOTA, 78.4 IDF1 and 63.3 HOTA on MOT20, surpassing all published methods.
对象跟踪可以分为单对象跟踪(SOT)和多对象跟踪(MOT)。MOT的目标是在一系列连续视频序列中维持多个物体的身份。近年来,MOT取得了迅速进展。然而,在复杂场景中建模物体的运动和外观模型仍然面临各种挑战。在本文中,我们设计了一种平滑路径预测的新方向一致性方法(STP-DC),以提高运动信息的建模能力,并克服在复杂场景中之前方法的缺乏可靠性。现有方法使用人名识别(Re-ID)来建模外观,但是它们提取更多的背景信息,在遮挡和拥挤场景中缺乏分辨性。我们提出了一种超颗粒特征嵌入网络(HG-FEN)来增强外观模型的建模能力,从而生成可靠的外观描述符。我们还提出了其他可靠性技术,包括存储可靠的外观信息的实验方法CF-ECM和提高关联准确性的SK-AS。为了在MOT中实现最先进的性能,我们提出了名为Rt-track的可靠跟踪器,综合各种技巧和方法。它在MOT17测试集上实现了79.5 MOTA、76.0 IDF1和62.1 HOTA。Rt-track还在MOT20上实现了77.9 MOTA、78.4 IDF1和63.3 HOTA,超越了所有公开方法。
https://arxiv.org/abs/2303.09668
Text-based person re-identification (ReID) aims to identify images of the targeted person from a large-scale person image database according to a given textual description. However, due to significant inter-modal gaps, text-based person ReID remains a challenging problem. Most existing methods generally rely heavily on the similarity contributed by matched word-region pairs, while neglecting mismatched word-region pairs which may play a decisive role. Accordingly, we propose to mine false positive examples (MFPE) via a jointly optimized multi-branch architecture to handle this problem. MFPE contains three branches including a false positive mining (FPM) branch to highlight the role of mismatched word-region pairs. Besides, MFPE delicately designs a cross-relu loss to increase the gap of similarity scores between matched and mismatched word-region pairs. Extensive experiments on CUHK-PEDES demonstrate the superior effectiveness of MFPE. Our code is released at this https URL.
基于文本的人重身份(ReID)旨在根据给定文本描述从大型人图像数据库中识别目标人物的图像。然而,由于存在显著的modal差异,基于文本的人重身份仍然是一个挑战性的问题。大多数现有方法通常 heavily rely on 匹配词框的相似性贡献,而忽视了可能扮演决定性角色的不匹配词框。因此,我们提议通过 jointly optimized 的多分支架构来处理这个问题,并开发了一种名为“False positive examples (MFPE)”的算法来 mine 误报实例(MFPE)。MFPE 包含三个分支,包括一个误报发现(FPM)分支,以突出不匹配词框的作用。此外,MFPE 精心设计了一个交叉relu损失,以增加匹配和不匹配词框之间的相似性得分之间的差距。在CUHK-PEDES 实验中,广泛的实验结果表明,MFPE 具有卓越的效果。我们的代码在此httpsURL 上发布。
https://arxiv.org/abs/2303.08466
Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance. Previous works explore new optimizing targets and search spaces for NAS ReID, yet they neglect the difference of training schemes between image classification and ReID. In this work, we propose a novel Twins Contrastive Mechanism (TCM) to provide more appropriate supervision for ReID architecture search. TCM reduces the category overlaps between the training and validation data, and assists NAS in simulating real-world ReID training schemes. We then design a Multi-Scale Interaction (MSI) search space to search for rational interaction operations between multi-scale features. In addition, we introduce a Spatial Alignment Module (SAM) to further enhance the attention consistency confronted with images from different sources. Under the proposed NAS scheme, a specific architecture is automatically searched, named as MSINet. Extensive experiments demonstrate that our method surpasses state-of-the-art ReID methods on both in-domain and cross-domain scenarios. Source code available in this https URL.
神经网络结构搜索(NAS)越来越吸引对象重新识别(ReID)社会的关注,因为任务特定的架构 significantly 改善检索性能。以前的工作探索了NAS ReID新的优化目标和搜索空间,但它们忽略了图像分类和ReID之间的训练计划差异。在本文中,我们提出了一种全新的双工比较机制(TCM),以提供更适当的监督ReID架构搜索。TCM减少了训练和验证数据之间的类别重叠,并协助NAS模拟真实的ReID训练计划。随后,我们设计了一个多尺度交互(MSI)搜索空间,以搜索多尺度特征之间的合理交互操作。此外,我们引入了一个空间定位模块(SAM),以进一步加强面对来自不同来源的图像的注意力一致性。在所提出的NAS计划下,一种特定的架构被自动搜索,称为MSINet。广泛的实验表明,我们的方法在跨域和域内场景上超越了最先进的ReID方法。源代码可在本 https URL 上获取。
https://arxiv.org/abs/2303.07065
Person re-identification (re-ID) via 3D skeleton data is an emerging topic with prominent advantages. Existing methods usually design skeleton descriptors with raw body joints or perform skeleton sequence representation learning. However, they typically cannot concurrently model different body-component relations, and rarely explore useful semantics from fine-grained representations of body joints. In this paper, we propose a generic Transformer-based Skeleton Graph prototype contrastive learning (TranSG) approach with structure-trajectory prompted reconstruction to fully capture skeletal relations and valuable spatial-temporal semantics from skeleton graphs for person re-ID. Specifically, we first devise the Skeleton Graph Transformer (SGT) to simultaneously learn body and motion relations within skeleton graphs, so as to aggregate key correlative node features into graph representations. Then, we propose the Graph Prototype Contrastive learning (GPC) to mine the most typical graph features (graph prototypes) of each identity, and contrast the inherent similarity between graph representations and different prototypes from both skeleton and sequence levels to learn discriminative graph representations. Last, a graph Structure-Trajectory Prompted Reconstruction (STPR) mechanism is proposed to exploit the spatial and temporal contexts of graph nodes to prompt skeleton graph reconstruction, which facilitates capturing more valuable patterns and graph semantics for person re-ID. Empirical evaluations demonstrate that TranSG significantly outperforms existing state-of-the-art methods. We further show its generality under different graph modeling, RGB-estimated skeletons, and unsupervised scenarios.
人重新识别(re-ID)通过3D骨骼数据是一个具有显著优势的新话题。现有的方法通常使用 raw body joints 或精细的骨骼 joint 表示法来设计骨骼描述符,或进行骨骼序列表示学习。然而,它们通常不能同时模型不同身体组件的关系,而且很少从骨骼 joint 的精细表示法中探索有用的语义。在本文中,我们提出了一种通用的Transformer-based Skeleton Graph 原型对比学习(TranSG)方法,并结合结构引导重构,以完全捕获骨骼关系和从骨骼 graphs 中获取宝贵的空间-时间语义。具体来说,我们首先设计了一个Skeleton Graph Transformer(SGT),以同时学习骨骼 graphs 中的身体和运动关系,以将关键相对节点特征聚合成 graph 表示。然后,我们提出了 Graph 原型对比学习(GPC)方法,以挖掘每个身份的最重要 graph 特征(graph 原型),并对比 graph 表示和不同原型之间的固有相似性,以学习分化的 graph 表示。最后,我们提出了一种 graph 结构引导重构(STPR)机制,利用 graph 节点的空间和时间上下文,以引导骨骼 graph 重构,这有助于捕捉更有价值的模式和 graph 语义,以用于人重新识别。实验结果表明,TranSG significantly outperforms existing state-of-the-art methods。我们还在不同 graph 建模、RGB估计骨骼和无监督场景下展示了它的通用性。
https://arxiv.org/abs/2303.06819
Unsupervised Re-ID methods aim at learning robust and discriminative features from unlabeled data. However, existing methods often ignore the relationship between module parameters of Re-ID framework and feature distributions, which may lead to feature misalignment and hinder the model performance. To address this problem, we propose a dynamic clustering and cluster contrastive learning (DCCC) method. Specifically, we first design a dynamic clustering parameters scheduler (DCPS) which adjust the hyper-parameter of clustering to fit the variation of intra- and inter-class distances. Then, a dynamic cluster contrastive learning (DyCL) method is designed to match the cluster representation vectors' weights with the local feature association. Finally, a label smoothing soft contrastive loss ($L_{ss}$) is built to keep the balance between cluster contrastive learning and self-supervised learning with low computational consumption and high computational efficiency. Experiments on several widely used public datasets validate the effectiveness of our proposed DCCC which outperforms previous state-of-the-art methods by achieving the best performance.
无监督Re-ID方法旨在从未标记数据中学习稳健和有区分的特征。然而,现有方法常常忽略Re-ID框架模块参数和特征分布之间的关系,这可能导致特征不匹配和妨碍模型性能。为了解决这一问题,我们提出了一种动态分组和分组比较学习(DCCC)方法。具体来说,我们首先设计了一个动态分组参数调度器(DCPS),该调度器调整分组的超参数以适应内层和间层距离的变化。然后,我们设计了一种动态分组比较学习(DyCL)方法,该方法匹配分组表示向量的权重与局部特征映射。最后,我们建立了一个标签平滑软比较损失($L_{ss}$),以保持分组比较学习和自监督学习之间的平衡,以减少计算消耗和提高计算效率。对多个广泛使用的公共数据集进行了实验,证明了我们提出的DCCC方法的有效性,该方法通过实现最佳性能而优于先前的先进技术方法。
https://arxiv.org/abs/2303.06810
In recent years, self-supervised learning has attracted widespread academic debate and addressed many of the key issues of computer vision. The present research focus is on how to construct a good agent task that allows for improved network learning of advanced semantic information on images so that model reasoning is accelerated during pre-training of the current task. In order to solve the problem that existing feature extraction networks are pre-trained on the ImageNet dataset and cannot extract the fine-grained information in pedestrian images well, and the existing pre-task of contrast self-supervised learning may destroy the original properties of pedestrian images, this paper designs a pre-task of mask reconstruction to obtain a pre-training model with strong robustness and uses it for the pedestrian re-identification task. The training optimization of the network is performed by improving the triplet loss based on the centroid, and the mask image is added as an additional sample to the loss calculation, so that the network can better cope with the pedestrian matching in practical applications after the training is completed. This method achieves about 5% higher mAP on Marker1501 and CUHK03 data than existing self-supervised learning pedestrian re-identification methods, and about 1% higher for Rank1, and ablation experiments are conducted to demonstrate the feasibility of this method. Our model code is located at this https URL.
过去几年,自监督学习已经引起了广泛的学术争论,并解决了计算机视觉中的许多关键问题。本研究的关注点是构建一个好的代理任务,以改善图像中高级语义信息的网络学习,从而使模型推理在当前任务的前训练中加速。为了解决现有的特征提取网络在ImageNet数据集上预先训练的问题,以及现有的对比自监督学习任务可能会破坏行人图像的原始性质的问题,本文设计了 mask重建的预处理任务,以获得具有强大鲁棒性的预处理模型,并将其用于行人重识别任务。网络的训练优化是通过改进基于中心点的三元损失来实现的,并将 mask 图像作为额外的样本添加到损失计算中,从而使网络在训练完成后更好地应对行人匹配在实际应用程序中的情况。这种方法在Marker1501和CUHK03数据上实现了约5%更高的mAP,而在排名1上实现了约1%的提高,并进行了 ablation实验以证明这种方法的可行性。我们的模型代码位于这个httpsURL上。
https://arxiv.org/abs/2303.06330
Data efficiency in robotic skill acquisition is crucial for operating robots in varied small-batch assembly settings. To operate in such environments, robots must have robust obstacle avoidance and versatile goal conditioning acquired from only a few simple demonstrations. Existing approaches, however, fall short of these requirements. Deep reinforcement learning (RL) enables a robot to learn complex manipulation tasks but is often limited to small task spaces in the real world due to sample inefficiency and safety concerns. Motion planning (MP) can generate collision-free paths in obstructed environments, but cannot solve complex manipulation tasks and requires goal states often specified by a user or object-specific pose estimator. In this work, we propose a system for efficient skill acquisition that leverages an object-centric generative model (OCGM) for versatile goal identification to specify a goal for MP combined with RL to solve complex manipulation tasks in obstructed environments. Specifically, OCGM enables one-shot target object identification and re-identification in new scenes, allowing MP to guide the robot to the target object while avoiding obstacles. This is combined with a skill transition network, which bridges the gap between terminal states of MP and feasible start states of a sample-efficient RL policy. The experiments demonstrate that our OCGM-based one-shot goal identification provides competitive accuracy to other baseline approaches and that our modular framework outperforms competitive baselines, including a state-of-the-art RL algorithm, by a significant margin for complex manipulation tasks in obstructed environments.
机器人技能获取数据的高效性对于在各种小型批量装配环境中操作机器人至关重要。在这样的环境中操作机器人,机器人需要具备强大的避免障碍物和多功能目标 conditioning 的能力,而现有的方法却无法满足这些要求。 Deep reinforcement learning (RL) 可以使得机器人学习复杂的操纵任务,但通常由于样本效率低下和安全问题,只能在现实世界中限制在较小的任务空间内。 Motion planning (MP) 可以在障碍物环境中生成无碰撞的路径,但无法解决复杂的操纵任务,并且通常需要用户或物体特定姿态估计器指定的目标状态。在本文中,我们提出了一种高效的技能获取系统,利用对象中心生成模型 (OCGM) 进行多功能目标识别,以指定 MP 和 RL 共同解决在障碍物环境中复杂操纵任务的目标。具体来说,OCGM 在新的场景中实现一次性的目标对象识别和重新识别,使 MP 能够引导机器人避开障碍物,并将机器人导向目标对象。这结合了技能转移网络,它 bridging the gap between the terminal states of MP and feasible start states of a sample-efficient RL policy。实验表明,我们的 OCGM 一次性目标识别方法与其他基准方法相比,在障碍物环境中解决复杂操纵任务具有竞争力的精度,我们的模块化框架在这方面比竞争基准方法,包括最先进的 RL 算法,表现优异。
https://arxiv.org/abs/2303.03365
Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 human-centric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task.
人类中心化感知在视觉模型工业应用中扮演着关键角色。虽然具体的人类中心化任务有其特定的相关语义方面需要关注,但它们也共享人类身体的基本语义结构。然而,只有少数工作尝试过利用这种一致性设计一个通用的人类中心化任务模型。在本工作中,我们回顾了广泛的人类中心化任务,并以 minimalist 的方式将它们统一起来。我们提出了 UniHCP,一个人类中心化感知的统一模型,它将广泛的人类中心化任务与简单的视觉变换架构相统一。通过在33个人类中心化数据集上进行大规模联合训练,UniHCP可以在几个内部和下游任务中通过直接评估比强基线表现更好。当适应特定的任务时,UniHCP在广泛的人类中心化任务中实现了新的 SOTA,例如,69.8 米IoU在 CIHP 中是人类Parsing,86.18 mA 在 PA-100K 中是属性预测,90.3 米AP在 Market1501 中是人类 ReID,和 85.8 JI 在 CrowdHuman 中用于 pedestrian detection,比每个任务专门的模型表现更好。
https://arxiv.org/abs/2303.02936
Occluded person re-identification (Re-ID) is a challenging problem due to the destruction of occluders. Most existing methods focus on visible human body parts through some prior information. However, when complementary occlusions occur, features in occluded regions can interfere with matching, which affects performance severely. In this paper, different from most previous works that discard the occluded region, we propose a Feature Completion Transformer (FCFormer) to implicitly complement the semantic information of occluded parts in the feature space. Specifically, Occlusion Instance Augmentation (OIA) is proposed to simulates real and diverse occlusion situations on the holistic image. These augmented images not only enrich the amount of occlusion samples in the training set, but also form pairs with the holistic images. Subsequently, a dual-stream architecture with a shared encoder is proposed to learn paired discriminative features from pairs of inputs. Without additional semantic information, an occluded-holistic feature sample-label pair can be automatically created. Then, Feature Completion Decoder (FCD) is designed to complement the features of occluded regions by using learnable tokens to aggregate possible information from self-generated occluded features. Finally, we propose the Cross Hard Triplet (CHT) loss to further bridge the gap between complementing features and extracting features under the same ID. In addition, Feature Completion Consistency (FC$^2$) loss is introduced to help the generated completion feature distribution to be closer to the real holistic feature distribution. Extensive experiments over five challenging datasets demonstrate that the proposed FCFormer achieves superior performance and outperforms the state-of-the-art methods by significant margins on occluded datasets.
遮罩人重定向(Re-ID)是一个由于遮罩破坏而带来的挑战性问题。大部分现有方法都通过某些先验信息专注于可见人体部位。然而,当互补遮罩发生时,遮罩区域中的特征是可能与匹配干扰的,这严重影响了性能。在本文中,与大多数先前工作放弃遮罩区域不同,我们提出了一个特征完成Transformer(FC Former),以在特征空间中隐含地补充遮罩部分语义信息。具体来说,我们提出了遮罩实例增强(OIA),以模拟整个图像中的实际和多样化的遮罩情况。这些增强图像不仅丰富了训练集中的遮罩样本数量,而且与整个图像形成了对对。随后,我们提出了一种具有共享编码器的双重流架构,从两个输入中学习对偶的特征。在没有额外的语义信息的情况下,可以自动创建遮罩-整体特征样本标签对。然后,我们提出了特征完成解码器(FCD),以通过可学习代币将自生成遮罩特征中的可能信息聚合起来,以补充遮罩区域的特征。最后,我们提出了交叉硬二元分类(CHT)损失,以进一步弥合重定向特征和提取特征的ID相同的特征提取特征之间的差距。此外,我们引入了特征完成一致性(FC$^2$)损失,以帮助生成的完成特征分布更接近真实的整体特征分布。广泛的实验在五个挑战性数据集上证明了,我们提出的FC Former取得了更好的性能,并在遮罩数据集上比最先进的方法领先显著。
https://arxiv.org/abs/2303.01656
To efficiently monitor the growth and evolution of a particular wildlife population, one of the main fundamental challenges to address in animal ecology is the re-identification of individuals that have been previously encountered but also the discrimination between known and unknown individuals (the so-called "open-set problem"), which is the first step to realize before re-identification. In particular, in this work, we are interested in the discrimination within digital photos of beluga whales, which are known to be among the most challenging marine species to discriminate due to their lack of distinctive features. To tackle this problem, we propose a novel approach based on the use of Membership Inference Attacks (MIAs), which are normally used to assess the privacy risks associated with releasing a particular machine learning model. More precisely, we demonstrate that the problem of discriminating between known and unknown individuals can be solved efficiently using state-of-the-art approaches for MIAs. Extensive experiments on three benchmark datasets related to whales, two different neural network architectures, and three MIA clearly demonstrate the performance of the approach. In addition, we have also designed a novel MIA strategy that we coined as ensemble MIA, which combines the outputs of different MIAs to increase the attack accuracy while diminishing the false positive rate. Overall, one of our main objectives is also to show that the research on privacy attacks can also be leveraged "for good" by helping to address practical challenges encountered in animal ecology.
要高效监测特定野生动物种群的增长和进化,动物生态学中的主要基本挑战之一是重新识别已知和未知的个体,同时也解决已知和未知的个体之间的歧视问题(所谓的“开放集问题”),这是实现重新识别之前的第一步。特别地,在这项工作中,我们关注的是数字照片中beluga鲸的歧视,由于它们缺乏独特特征,因此被认为是歧视最困难的海洋物种之一。为了解决这个问题,我们提出了一种基于使用成员推断攻击(MIA)的新方法,通常用于评估释放特定机器学习模型的隐私风险。更具体地说,我们证明可以通过使用MIA的最新方法来解决已知和未知的个体之间的歧视问题。对三个与鲸鱼相关的基准数据集、两种不同的神经网络架构和三个MIA的广泛实验清楚地证明了方法的性能。此外,我们还设计了一种新的MIA策略,称为综合MIA,它结合不同MIA的输出来提高攻击准确性,同时减少误报率。总的来说,我们的主要目标是展示隐私攻击研究也可以“利用好”以帮助解决动物生态学中实际挑战。
https://arxiv.org/abs/2302.14769
Person re-identification plays a key role in applications where a mobile robot needs to track its users over a long period of time, even if they are partially unobserved for some time, in order to follow them or be available on demand. In this context, deep-learning based real-time feature extraction on a mobile robot is often performed on special-purpose devices whose computational resources are shared for multiple tasks. Therefore, the inference speed has to be taken into account. In contrast, person re-identification is often improved by architectural changes that come at the cost of significantly slowing down inference. Attention blocks are one such example. We will show that some well-performing attention blocks used in the state of the art are subject to inference costs that are far too high to justify their use for mobile robotic applications. As a consequence, we propose an attention block that only slightly affects the inference speed while keeping up with much deeper networks or more complex attention blocks in terms of re-identification accuracy. We perform extensive neural architecture search to derive rules at which locations this attention block should be integrated into the architecture in order to achieve the best trade-off between speed and accuracy. Finally, we confirm that the best performing configuration on a re-identification benchmark also performs well on an indoor robotic dataset.
人重新配对在需要对移动机器人的用户进行长期跟踪的应用中发挥着关键作用,即使他们部分被观察了一段时间,以便跟随他们或随时可用。在这种情况下,基于深度学习的实时特征提取通常在特殊的专用设备上进行,这些设备的计算资源被共享用于多个任务。因此,推断速度必须考虑到。相比之下,人重新配对通常通过建筑结构改变来实现,这样做的代价是显著减缓推断速度。注意力块就是一个这样的例子。我们将证明,一些先进的注意力块在常用的设计中表现良好,但推断成本却非常高,以至于不能将其用于移动机器人应用。因此,我们提出了一个注意力块,它只略微影响推断速度,而能够在人重新配对精度方面与更深层的网络或更复杂的注意力块保持同步。我们进行了广泛的神经网络架构搜索,以推导出该注意力块应该嵌入到架构中的特定位置的规则,以实现速度与精度的最佳权衡。最后,我们确认,在人重新配对基准测试中表现最佳的配置也在室内机器人数据集上表现良好。
https://arxiv.org/abs/2302.14574
In person re-identification (re-ID) task, it is still challenging to learn discriminative representation by deep learning, due to limited data. Generally speaking, the model will get better performance when increasing the amount of data. The addition of similar classes strengthens the ability of the classifier to identify similar identities, thereby improving the discrimination of representation. In this paper, we propose a Diverse and Compact Transformer (DC-Former) that can achieve a similar effect by splitting embedding space into multiple diverse and compact subspaces. Compact embedding subspace helps model learn more robust and discriminative embedding to identify similar classes. And the fusion of these diverse embeddings containing more fine-grained information can further improve the effect of re-ID. Specifically, multiple class tokens are used in vision transformer to represent multiple embedding spaces. Then, a self-diverse constraint (SDC) is applied to these spaces to push them away from each other, which makes each embedding space diverse and compact. Further, a dynamic weight controller(DWC) is further designed for balancing the relative importance among them during training. The experimental results of our method are promising, which surpass previous state-of-the-art methods on several commonly used person re-ID benchmarks.
在人重定向(重定向)任务中,由于数据有限,仍然难以通过深度学习学习特征表示,从而提高分类器识别类似身份的能力,从而改善表示的区分性。通常来说,增加数据量会提高模型的性能。添加类似类可以加强分类器识别类似身份的能力,从而提高分类器的识别能力。在本文中,我们提出了一种多样性和紧凑的Transformer(DC- former),可以通过将嵌入空间分裂为多个多样性和紧凑的子空间来实现类似的效果。紧凑的嵌入子空间可以帮助模型学习更加稳健和歧视性的嵌入表示,以识别类似类。将这些不同种类的嵌入表示合并起来,可以进一步提高重定向的效果。具体来说,在视觉Transformer中,多个类代币用于表示多个嵌入空间。然后,一个自我多样性约束(SDC)被应用于这些空间,将它们推开,使每个嵌入空间多样化且紧凑。此外,动态权重控制器(DWC)被进一步设计用于在训练期间平衡它们之间的相对重要性。我们的方法的实验结果令人鼓舞,超过了先前常用的人重定向基准方法的几个优点。
https://arxiv.org/abs/2302.14335
Recent advances in MRI have led to the creation of large datasets. With the increase in data volume, it has become difficult to locate previous scans of the same patient within these datasets (a process known as re-identification). To address this issue, we propose an AI-powered medical imaging retrieval framework called DeepBrainPrint, which is designed to retrieve brain MRI scans of the same patient. Our framework is a semi-self-supervised contrastive deep learning approach with three main innovations. First, we use a combination of self-supervised and supervised paradigms to create an effective brain fingerprint from MRI scans that can be used for real-time image retrieval. Second, we use a special weighting function to guide the training and improve model convergence. Third, we introduce new imaging transformations to improve retrieval robustness in the presence of intensity variations (i.e. different scan contrasts), and to account for age and disease progression in patients. We tested DeepBrainPrint on a large dataset of T1-weighted brain MRIs from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and on a synthetic dataset designed to evaluate retrieval performance with different image modalities. Our results show that DeepBrainPrint outperforms previous methods, including simple similarity metrics and more advanced contrastive deep learning frameworks.
最近的MRI技术进步导致创造了大型数据集。随着数据量的增加,在这些数据集中找到同一个人的先前扫描变得越来越困难(这一过程被称为重名)。为了解决这个问题,我们提出了一个名为DeepBrainPrint的AI驱动的医疗影像检索框架,该框架旨在检索同一个人的脑MRI扫描。我们的框架是一种半自监督的对比深度学习方法,并具有三个主要创新。首先,我们使用自监督和监督范式的组合来创建一个有效的脑指纹,并将其用于实时图像检索。其次,我们使用一种特别的加权函数来指导训练并改善模型收敛。第三,我们引入了新的图像转换来改善在强度变化(即不同扫描对比)的情况下检索的可靠性,并考虑患者的年龄和疾病进展。我们测试了DeepBrainPrint在来自阿尔茨海默病神经影像学倡议(ADNI)的大规模T1加权脑MRI数据集中的一组数据,以及一个旨在评估不同图像模式检索性能的合成数据集。我们的结果表明,DeepBrainPrint比以前的方法表现出色,包括简单的相似度 metrics和更先进的对比深度学习框架。
https://arxiv.org/abs/2302.13057
Motion-based association for Multi-Object Tracking (MOT) has recently re-achieved prominence with the rise of powerful object detectors. Despite this, little work has been done to incorporate appearance cues beyond simple heuristic models that lack robustness to feature degradation. In this paper, we propose a novel way to leverage objects' appearances to adaptively integrate appearance matching into existing high-performance motion-based methods. Building upon the pure motion-based method OC-SORT, we achieve 1st place on MOT20 and 2nd place on MOT17 with 63.9 and 64.9 HOTA, respectively. We also achieve 61.3 HOTA on the challenging DanceTrack benchmark as a new state-of-the-art even compared to more heavily-designed methods. The code and models are available at \url{this https URL}.
基于运动的关系型多目标跟踪(MOT)最近随着 powerful 对象检测器的崛起而重新获得了关注。尽管如此,还没有大量的工作涉及到将外观线索引入到简单启发模型之外,这些模型缺乏特征退化的鲁棒性。在本文中,我们提出了一种新方法,利用物体的外观来自适应地将其集成到现有的高性能运动型方法中。基于纯粹的运动型方法OC-SORT,我们在MOT20和MOT17比赛中分别获得了第一和第二名,分别获得了63.9和64.9HOTA。我们还在具有挑战性的DanceTrack基准测试中获得了61.3HOTA,成为最新的前沿技术,即使与更加精心设计的方法相比也是如此。代码和模型可在 url{this https URL} 上获取。
https://arxiv.org/abs/2302.11813
Interest in automatic people re-identification systems has significantly grown in recent years, mainly for developing surveillance and smart shops software. Due to the variability in person posture, different lighting conditions, and occluded scenarios, together with the poor quality of the images obtained by different cameras, it is currently an unsolved problem. In machine learning-based computer vision applications with reduced data sets, one possibility to improve the performance of re-identification system is through the augmentation of the set of images or videos available for training the neural models. Currently, one of the most robust ways to generate synthetic information for data augmentation, whether it is video, images or text, are the generative adversarial networks. This article reviews the most relevant recent approaches to improve the performance of person re-identification models through data augmentation, using generative adversarial networks. We focus on three categories of data augmentation approaches: style transfer, pose transfer, and random generation.
自动人重定向系统的兴趣在近年来得到了显著增长,主要是为了开发监控和智能商店软件。由于人的姿态、不同照明条件和遮挡场景的变量,以及不同相机获取的图像质量不佳,目前这是一个无法解决的问题。在减少数据集的机器学习计算机视觉应用中,一种改善重定向系统性能的可能方法是增加训练神经网络的图像或视频数量。目前,生成对抗网络是生成合成信息中最稳健的方法之一,无论是视频、图像或文本,它们都是一种。本文综述了最近有关利用生成对抗网络提高人重定向模型性能的最受关注的方法。我们重点关注三个数据增强方法的分类:风格迁移、姿态迁移和随机生成。
https://arxiv.org/abs/2302.09119