Recent advances in learning-based approaches have led to impressive dexterous manipulation capabilities. Yet, we haven't witnessed widespread adoption of these capabilities beyond the laboratory. This is likely due to practical limitations, such as significant computational burden, inscrutable policy architectures, sensitivity to parameter initializations, and the considerable technical expertise required for implementation. In this work, we investigate the utility of Koopman operator theory in alleviating these limitations. Koopman operators are simple yet powerful control-theoretic structures that help represent complex nonlinear dynamics as linear systems in higher-dimensional spaces. Motivated by the fact that complex nonlinear dynamics underlie dexterous manipulation, we develop an imitation learning framework that leverages Koopman operators to simultaneously learn the desired behavior of both robot and object states. We demonstrate that a Koopman operator-based framework is surprisingly effective for dexterous manipulation and offers a number of unique benefits. First, the learning process is analytical, eliminating the sensitivity to parameter initializations and painstaking hyperparameter optimization. Second, the learned reference dynamics can be combined with a task-agnostic tracking controller such that task changes and variations can be handled with ease. Third, a Koopman operator-based approach can perform comparably to state-of-the-art imitation learning algorithms in terms of task success rate and imitation error, while being an order of magnitude more computationally efficient. In addition, we discuss a number of avenues for future research made available by this work.
最近在基于学习的方法方面的进展已经带来了令人印象深刻的灵巧操纵能力。然而,我们在实验室以外并未观察到这些能力的普及。这可能是由于实际限制,例如巨大的计算负担、难以解释的政策架构、对参数初始化的敏感性以及实现所需的相当专业的技术知识。在本文中,我们研究了科恩代理理论在减轻这些限制方面的应用价值。科恩代理是简单但强大的控制理论结构,在更高维度的空间中帮助将复杂的非线性动态表现为线性系统。鉴于复杂的非线性动态是灵巧操纵的基础,我们开发了一个模仿学习框架,利用科恩代理来同时学习机器人和对象状态所需的期望行为。我们证明了科恩代理框架在灵巧操纵方面出乎意料有效,并提供了多项独特的好处。首先,学习过程是分析的,消除了对参数初始化的敏感性和繁琐的超参数优化。其次, learned reference dynamics可以与任务无关跟踪控制器一起使用,从而使任务变化和变异可以轻松处理。第三,基于科恩代理的方法可以在任务成功率和模仿误差方面与最先进的模仿学习算法相当,但计算效率更高。此外,我们讨论了本工作提供的一系列未来研究途径。
https://arxiv.org/abs/2303.13446
This work presents a novel RGB-D-inertial dynamic SLAM method that can enable accurate localisation when the majority of the camera view is occluded by multiple dynamic objects over a long period of time. Most dynamic SLAM approaches either remove dynamic objects as outliers when they account for a minor proportion of the visual input, or detect dynamic objects using semantic segmentation before camera tracking. Therefore, dynamic objects that cause large occlusions are difficult to detect without prior information. The remaining visual information from the static background is also not enough to support localisation when large occlusion lasts for a long period. To overcome these problems, our framework presents a robust visual-inertial bundle adjustment that simultaneously tracks camera, estimates cluster-wise dense segmentation of dynamic objects and maintains a static sparse map by combining dense and sparse features. The experiment results demonstrate that our method achieves promising localisation and object segmentation performance compared to other state-of-the-art methods in the scenario of long-term large occlusion.
这项工作提出了一种 novel RGB-D-inertial 动态 SLAM 方法,能够在长时间内多个动态物体遮挡大部分摄像头视图的情况下实现准确的定位。大多数动态 SLAM 方法要么在动态物体占据视觉输入的较小比例时将其视为异常值并删除,要么在跟踪摄像头之前使用语义分割方法检测动态物体。因此,在没有先前信息的情况下难以检测造成大规模遮挡的动态物体。在长时间大规模遮挡的情况下,剩余的静态背景视觉信息不足以支持定位。因此,我们框架提出了一种稳健的视觉-inertial Bundle 调整方法,可以同时跟踪摄像头并估计动态物体的密集群组分割,并通过结合密集和稀疏特征维持静态稀疏地图。实验结果显示,与我们在其他长期大规模遮挡场景中使用的先进方法相比,我们的方法实现了 promising Localization 和物体分割性能。
https://arxiv.org/abs/2303.13316
Recent advances in machine learning and computer vision are revolutionizing the field of animal behavior by enabling researchers to track the poses and locations of freely moving animals without any marker attachment. However, large datasets of annotated images of animals for markerless pose tracking, especially high-resolution images taken from multiple angles with accurate 3D annotations, are still scant. Here, we propose a method that uses a motion capture (mo-cap) system to obtain a large amount of annotated data on animal movement and posture (2D and 3D) in a semi-automatic manner. Our method is novel in that it extracts the 3D positions of morphological keypoints (e.g eyes, beak, tail) in reference to the positions of markers attached to the animals. Using this method, we obtained, and offer here, a new dataset - 3D-POP with approximately 300k annotated frames (4 million instances) in the form of videos having groups of one to ten freely moving birds from 4 different camera views in a 3.6m x 4.2m area. 3D-POP is the first dataset of flocking birds with accurate keypoint annotations in 2D and 3D along with bounding box and individual identities and will facilitate the development of solutions for problems of 2D to 3D markerless pose, trajectory tracking, and identification in birds.
近年来在机器学习和计算机视觉方面的进展正在动物行为领域带来革命性的变革,使研究人员能够追踪自由移动动物的 pose 和位置,而无需附加标记。然而,用于标记less pose 跟踪的大型标注图像动物数据集仍然稀缺,特别是从多个角度拍摄且高精度3D标注的高清视频数据集。在这里,我们提出了一种方法,使用运动捕捉(mo-cap)系统以半自动方式获取大量关于动物运动和姿态(2D 和 3D)的标注数据。我们的方法是新颖的,它从参考动物附加标记的位置中提取形态学关键帧的 3D 位置,这种方法能够提取关键帧的位置,包括眼睛、喙和尾巴等。通过这种方法,我们获取了并在这里提供了一个新的数据集 - 3D-POP,其中包括约 300 万帧标注帧(4 百万实例),以视频形式,其中包含由四个不同相机视图组成的群体,每个视频片段包含 one 到 ten 个自由移动的鸟类。3D-POP是第一个包含2D和3D高精度关键帧标注并与边界框和个人身份一起考虑的鸟类群体数据集,这将促进解决2D到3D标记less 姿态、轨迹跟踪和识别鸟类问题的解决方案。
https://arxiv.org/abs/2303.13174
This paper studies the leader-following consensuses of uncertain and nonlinear multi-agent systems against composite attacks (CAs), including Denial of Service (DoS) attacks and actuation attacks (AAs). A double-layer control framework is formulated, where a digital twin layer (TL) is added beside the traditional cyber-physical layer (CPL), inspired by the recent Digital Twin technology. Consequently, the resilient control task against CAs can be divided into two parts: One is distributed estimation against DoS attacks on the TL and the other is resilient decentralized tracking control against actuation attacks on the CPL. %The data-driven scheme is used to deal with both model non-linearity and model uncertainty, in which only the input and output data of the system are employed throughout the whole control process. First, a distributed observer based on switching estimation law against DoS is designed on TL. Second, a distributed model free adaptive control (DMFAC) protocol based on attack compensation against AAs is designed on CPL. Moreover, the uniformly ultimately bounded convergence of consensus error of the proposed double-layer DMFAC algorithm is strictly proved. Finally, the simulation verifies the effectiveness of the resilient double-layer control scheme.
本论文研究的是不确定和非非线性多Agent系统的领袖跟随共识,包括拒绝服务(DoS)攻击和操纵攻击(AAs)。提出了一种双重控制框架,在该框架中,数字孪层(TL)被添加到传统的网络物理层(CPL)的旁边,以借鉴最近的数字孪技术。因此,对CAs的 resilient控制任务可以分成两个部分:第一部分是针对TL的DoS攻击分布式估计,第二部分是针对 CPL的操纵攻击 resilient分布式跟踪控制。采用数据驱动的方法处理模型非线性和模型不确定性,在整个控制过程中仅使用系统的输入和输出数据。首先,在 TL 上设计了一个基于切换估计 law 的分布式观察者。其次,在 CPL 上设计了一个基于攻击补偿对 AAs 的分布式模型自由自适应控制(DMFAC)协议。此外,非常严格证明了所提出的双层 DMFAC 算法的共识误差uniformly最终限定的收敛。最后,仿真验证了 resilient 双层控制框架的有效性。
https://arxiv.org/abs/2303.12823
Object detection is one of the most important and fundamental aspects of computer vision tasks, which has been broadly utilized in pose estimation, object tracking and instance segmentation models. To obtain training data for object detection model efficiently, many datasets opt to obtain their unannotated data in video format and the annotator needs to draw a bounding box around each object in the images. Annotating every frame from a video is costly and inefficient since many frames contain very similar information for the model to learn from. How to select the most informative frames from a video to annotate has become a highly practical task to solve but attracted little attention in research. In this paper, we proposed a novel active learning algorithm for object detection models to tackle this problem. In the proposed active learning algorithm, both classification and localization informativeness of unlabelled data are measured and aggregated. Utilizing the temporal information from video frames, two novel localization informativeness measurements are proposed. Furthermore, a weight curve is proposed to avoid querying adjacent frames. Proposed active learning algorithm with multiple configurations was evaluated on the MuPoTS dataset and FootballPD dataset.
对象检测是计算机视觉任务中最为重要的和基本方面之一,在姿态估计、物体跟踪和实例分割模型中被广泛应用。为了高效地获取对象检测模型的训练数据,许多数据集选择以视频格式获取未标注数据,标注者需要在每个图像中画一个边界框来包围每个对象。对每个视频帧进行标注非常昂贵且效率低,因为许多帧包含模型可学习的信息。如何选择视频中最有用的帧来标注已成为一个非常实际的问题,但研究人员对此问题的关注较少。在本文中,我们提出了一种针对对象检测模型的新主动学习算法来解决此问题。在所提出的主动学习算法中,未标注数据的分类和定位 informativeness 都被测量和聚合。利用视频帧的时序信息,我们提出了两个新的定位 informativeness 测量方法。此外,我们提出了一种权重曲线,以避免询问相邻帧。所提出的多种配置的主动学习算法在MuPoTS数据和足球PD数据集上进行了评估。
https://arxiv.org/abs/2303.12760
Siamese network based trackers develop rapidly in the field of visual object tracking in recent years. The majority of siamese network based trackers now in use treat each channel in the feature maps generated by the backbone network equally, making the similarity response map sensitive to background influence and hence challenging to focus on the target region. Additionally, there are no structural links between the classification and regression branches in these trackers, and the two branches are optimized separately during training. Therefore, there is a misalignment between the classification and regression branches, which results in less accurate tracking results. In this paper, a Target Highlight Module is proposed to help the generated similarity response maps to be more focused on the target region. To reduce the misalignment and produce more precise tracking results, we propose a corrective loss to train the model. The two branches of the model are jointly tuned with the use of corrective loss to produce more reliable prediction results. Experiments on 5 challenging benchmark datasets reveal that the method outperforms current models in terms of performance, and runs at 38 fps, proving its effectiveness and efficiency.
近年来,视觉对象跟踪领域Siamese网络 based跟踪器发展迅速。目前使用的大多数Siamese网络 based跟踪器都将每个由基线网络生成的特征通道视为同等重要的,这使得相似响应图具有背景影响敏感性,因此难以专注于目标区域。此外,这些跟踪器中的分类和回归分支之间没有任何结构性链接,并且在训练期间两个分支分别优化。因此,分类和回归分支之间的不对齐导致了不准确的跟踪结果。在本文中,我们提出了一个目标突出模块,以帮助生成的特征响应图更专注于目标区域。为了减少不对齐并产生更准确的跟踪结果,我们提出了一种纠正损失来训练模型。模型的两个分支通过使用纠正损失共同优化,以产生更可靠的预测结果。对五个具有挑战性的基准数据集的实验表明,这种方法在性能方面优于当前模型,运行在38帧率上,证明了其效率和有效性。
https://arxiv.org/abs/2303.12304
Object tracking (OT) aims to estimate the positions of target objects in a video sequence. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, OT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline. Extensive experiments on 7 tracking datasets, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
对象跟踪(OT)的目标是在视频序列中估计目标物体的位置。取决于目标物体的初始状态是否由第一个帧提供的注解或分类指定,OT可以被视为实例跟踪(例如SOT和VOS)和分类跟踪(例如MOT、MOTS和VIS)任务。结合两个社区开发的最佳实践的优势,我们提出了一种新的跟踪与检测范式,其中跟踪补充检测的外观先验,而检测提供跟踪与候选边界框以匹配。配备这种设计,一个统一的跟踪模型 OmniTracker 被进一步介绍,以通过完全共享的网络架构、模型权重和推理管道解决所有跟踪任务。对7个跟踪数据集,包括LaSOT、TrackingNet、 Davis16-17、MOT17、MOTS20和YTVIS19进行了广泛的实验,结果表明, OmniTracker 与任务特定的和统一跟踪模型相比,实现与同样或更好的结果。
https://arxiv.org/abs/2303.12079
3D single object tracking in LiDAR point clouds (LiDAR SOT) plays a crucial role in autonomous driving. Current approaches all follow the Siamese paradigm based on appearance matching. However, LiDAR point clouds are usually textureless and incomplete, which hinders effective appearance matching. Besides, previous methods greatly overlook the critical motion clues among targets. In this work, beyond 3D Siamese tracking, we introduce a motion-centric paradigm to handle LiDAR SOT from a new perspective. Following this paradigm, we propose a matching-free two-stage tracker M^2-Track. At the 1st-stage, M^2-Track localizes the target within successive frames via motion transformation. Then it refines the target box through motion-assisted shape completion at the 2nd-stage. Due to the motion-centric nature, our method shows its impressive generalizability with limited training labels and provides good differentiability for end-to-end cycle training. This inspires us to explore semi-supervised LiDAR SOT by incorporating a pseudo-label-based motion augmentation and a self-supervised loss term. Under the fully-supervised setting, extensive experiments confirm that M^2-Track significantly outperforms previous state-of-the-arts on three large-scale datasets while running at 57FPS (~8%, ~17% and ~22% precision gains on KITTI, NuScenes, and Waymo Open Dataset respectively). While under the semi-supervised setting, our method performs on par with or even surpasses its fully-supervised counterpart using fewer than half labels from KITTI. Further analysis verifies each component's effectiveness and shows the motion-centric paradigm's promising potential for auto-labeling and unsupervised domain adaptation.
在激光雷达点云中的三维单物体跟踪(LiDAR SOT)在无人驾驶中扮演着关键角色。当前的方法都基于外观匹配,但LiDAR点云通常缺乏纹理和不完整,这阻碍了有效的外观匹配。此外,以前的方法严重忽略了目标之间的关键运动线索。在本文中,除了3D Siamese跟踪,我们引入了一种以运动为中心的范式,从新的角度处理LiDAR SOT。遵循这个范式,我们提出了一个无匹配的两步跟踪器M^2-Track。在第一个阶段,M^2-Track通过运动变换在相邻帧内定位目标。然后,在第二个阶段,它通过运动辅助的形状重构优化目标框。由于运动中心性质,我们的方法和 limited训练标签的情况下表现出令人印象深刻的泛化能力,并为端到端循环训练提供了良好的不同iability。这激励我们探索半监督的LiDAR SOT,通过添加伪标签的运动增强和自监督损失函数。在完全监督的情况下,广泛的实验确认M^2-Track在三个大规模数据集上显著优于以前的最高水平,同时运行在57FPS(KITTI、NuScenes和Waymo Open Dataset分别提高了~8%、~17%和~22%的精度)。在半监督的情况下,我们的方法和使用KITTI不到一半的标签数量的性能与它的完全监督对手相当或甚至超过了它。进一步的分析证实了每个组件的有效性,并展示了运动中心范式在自动 labeling和无监督域适应方面的潜力。
https://arxiv.org/abs/2303.12535
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at this https URL.
自然语言指定的跟踪旨在根据自然语言描述在序列中定位提及的目标。现有算法采取了两个步骤来解决这个问题:视觉grounding和跟踪,并相应地部署分开的grounding模型和跟踪模型来实现这两个步骤。这种分开的框架忽略了视觉grounding和跟踪之间的联系,也就是自然语言描述为这两个步骤提供了全球语义线索。此外,分开的框架很难进行端到端的训练。为了解决这些问题,我们提出了一个联合的视觉grounding和跟踪框架,将其重新定义为一种统一的任务:根据给定的视觉语言引用定位提及的目标。具体来说,我们提出了一个多源关系建模模块,以有效地构建视觉语言引用和测试图像之间的关系。此外,我们设计了时间建模模块,以提供给我们的模型一个时间线索,以提供时间线索,并有效地改善其对目标外观变化适应性。在TNL2K、LaSOT、OTB99和RefCOCOg等实验中的结果表明,我们的方法和现有跟踪和grounding算法在这两个方面的性能都很优秀。代码可在这个https URL上获取。
https://arxiv.org/abs/2303.12027
Non-line-of-sight (NLOS) tracking has drawn increasing attention in recent years, due to its ability to detect object motion out of sight. Most previous works on NLOS tracking rely on active illumination, e.g., laser, and suffer from high cost and elaborate experimental conditions. Besides, these techniques are still far from practical application due to oversimplified settings. In contrast, we propose a purely passive method to track a person walking in an invisible room by only observing a relay wall, which is more in line with real application scenarios, e.g., security. To excavate imperceptible changes in videos of the relay wall, we introduce difference frames as an essential carrier of temporal-local motion messages. In addition, we propose PAC-Net, which consists of alternating propagation and calibration, making it capable of leveraging both dynamic and static messages on a frame-level granularity. To evaluate the proposed method, we build and publish the first dynamic passive NLOS tracking dataset, NLOS-Track, which fills the vacuum of realistic NLOS datasets. NLOS-Track contains thousands of NLOS video clips and corresponding trajectories. Both real-shot and synthetic data are included.
非可见性跟踪(NLOS)近年来吸引了越来越多的关注,因为它能够检测非可见的物体运动。以前的NLOS跟踪研究大多数依赖于 Active 照明,例如激光,并且成本高昂、实验条件复杂。此外,这些技术仍然远距应用于实际场景,例如安全。相反,我们提出了一种纯粹的 passive 方法,通过仅观察传递墙来跟踪一个人在一个看不见的房间里漫步,这更加符合实际应用场景,例如安全。为了挖掘传递墙视频中的微变化,我们引入了差异帧作为时间局部运动消息的重要载波。此外,我们提出了 PAC-Net,它包括交替传播和校准,使其能够在帧级别的 granularity 上利用动态和静态消息。为了评估提出的方法,我们建立了并发表了第一个动态 passive NLOS 跟踪数据集 NLOS-Track,该数据集填补了真实的NLOS数据集的空缺。NLOS-Track包含数千个NLOS视频片段和相应的轨迹。既有真实拍摄数据,也有合成数据。
https://arxiv.org/abs/2303.11791
With the development of deep learning technology, the facial manipulation system has become powerful and easy to use. Such systems can modify the attributes of the given facial images, such as hair color, gender, and age. Malicious applications of such systems pose a serious threat to individuals' privacy and reputation. Existing studies have proposed various approaches to protect images against facial manipulations. Passive defense methods aim to detect whether the face is real or fake, which works for posterior forensics but can not prevent malicious manipulation. Initiative defense methods protect images upfront by injecting adversarial perturbations into images to disrupt facial manipulation systems but can not identify whether the image is fake. To address the limitation of existing methods, we propose a novel two-tier protection method named Information-containing Adversarial Perturbation (IAP), which provides more comprehensive protection for {facial images}. We use an encoder to map a facial image and its identity message to a cross-model adversarial example which can disrupt multiple facial manipulation systems to achieve initiative protection. Recovering the message in adversarial examples with a decoder serves passive protection, contributing to provenance tracking and fake image detection. We introduce a feature-level correlation measurement that is more suitable to measure the difference between the facial images than the commonly used mean squared error. Moreover, we propose a spectral diffusion method to spread messages to different frequency channels, thereby improving the robustness of the message against facial manipulation. Extensive experimental results demonstrate that our proposed IAP can recover the messages from the adversarial examples with high average accuracy and effectively disrupt the facial manipulation systems.
随着深度学习技术的发展,面部操纵系统变得强大且易于使用。这些系统可以修改给定面部图像的属性,如发色、性别和年龄。恶意使用这些系统对个人隐私和声誉构成了严重的威胁。现有研究已经提出了多种方法来保护图像免受面部操纵。被动防御方法旨在检测面部是否真实或伪造,这种方法适用于后法医学,但不能防止恶意操纵。主动防御方法通过注入对抗性干扰来破坏多个面部操纵系统,但无法识别图像是否伪造。为了应对现有方法的局限性,我们提出了一种名为“包含信息对抗干扰”的新两级保护方法,该方法为面部图像提供了更加全面的保护。我们使用编码器将面部图像及其身份消息映射到跨模型的对抗示例中,该示例可以破坏多个面部操纵系统以实现主动防御。通过解码器恢复对抗示例中的信息提供被动保护,有助于追踪来源和检测假图像。我们提出了一种特征级协方差测量方法,比常用的平方误差测量方法更适合测量面部图像之间的差异。此外,我们提出了一种谱扩散方法将信息传播到不同的频率通道中,从而提高了面部操纵系统对信息的可靠性。广泛的实验结果表明,我们提出的IAP可以以高平均准确性从对抗示例中恢复信息,并有效地破坏了面部操纵系统。
https://arxiv.org/abs/2303.11625
Localization plays a critical role in the field of distributed swarm robotics. Previous work has highlighted the potential of relative localization for position tracking in multi-robot systems. Ultra-wideband (UWB) technology provides a good estimation of the relative position between robots but suffers from some limitations. This paper proposes improving the relative localization functionality developed in our previous work, which is based on UWB technology. Our new approach merges UWB telemetry and kinematic model into an extended Kalman filter to properly track the relative position of robots. We performed a simulation and validated the improvements in relative distance and angle accuracy for the proposed approach. An additional analysis was conducted to observe the increase in performance when the robots share their control inputs.
定位在分布式群机器人领域扮演着关键的角色。以前的研究已经强调了相对定位在多机器人系统位置跟踪方面的潜力。超宽带(UWB)技术提供了很好的机器人之间相对位置的估计,但也存在一些限制。本论文提出了改进我们基于UWB技术 previous 工作开发的相对定位功能的建议。我们的新方法将UWB遥测和运动模型合并到一个扩展的卡尔曼滤波中,以正确跟踪机器人之间的相对位置。我们进行了模拟并验证了我们建议的方法的相对距离和角度精度的提高。此外,我们还进行了额外的分析,以观察当机器人分享他们的控制输入时,性能的提高。
https://arxiv.org/abs/2303.11443
3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark.
3D物体检测通常依赖于手工制定的代理,例如支撑或中心,并将经典的2D框架翻译成3D。因此,稀疏的立方体特征需要被浓缩并通过密集预测头进行处理,这不可避免地会增加额外的计算成本。在本文中,我们则提出了 VoxelNext 用于完全稀疏的3D物体检测。我们的核心思想是通过稀疏立方体特征直接预测物体,而不需要依赖手工制定的代理。我们强大的稀疏卷积神经网络 VoxelNeXt 通过立方体特征全部检测和跟踪3D物体。这是一个优雅且高效的框架,不需要稀疏到密集的转换或NMS后处理。我们的方法在nuScenes数据集上实现了比其他主机检测器更好的速度和精度 trade-off。这是首次,我们展示了一个完全稀疏的立方体表示对于LIDAR 3D物体检测和跟踪是可行的。在nuScenes、Waymo和Argoverse2基准数据集上进行了广泛的实验,验证了我们方法的有效性。没有花哨的功能,我们的模型在nuScenes跟踪测试基准上比所有现有的LIDAR方法表现更好。
https://arxiv.org/abs/2303.11301
Although there have been considerable research efforts on controllable facial image editing, the desirable interactive setting where the users can interact with the system to adjust their requirements dynamically hasn't been well explored. This paper focuses on facial image editing via dialogue and introduces a new benchmark dataset, Multi-turn Interactive Image Editing (I2Edit), for evaluating image editing quality and interaction ability in real-world interactive facial editing scenarios. The dataset is constructed upon the CelebA-HQ dataset with images annotated with a multi-turn dialogue that corresponds to the user editing requirements. I2Edit is challenging, as it needs to 1) track the dynamically updated user requirements and edit the images accordingly, as well as 2) generate the appropriate natural language response to communicate with the user. To address these challenges, we propose a framework consisting of a dialogue module and an image editing module. The former is for user edit requirements tracking and generating the corresponding indicative responses, while the latter edits the images conditioned on the tracked user edit requirements. In contrast to previous works that simply treat multi-turn interaction as a sequence of single-turn interactions, we extract the user edit requirements from the whole dialogue history instead of the current single turn. The extracted global user edit requirements enable us to directly edit the input raw image to avoid error accumulation and attribute forgetting issues. Extensive quantitative and qualitative experiments on the I2Edit dataset demonstrate the advantage of our proposed framework over the previous single-turn methods. We believe our new dataset could serve as a valuable resource to push forward the exploration of real-world, complex interactive image editing. Code and data will be made public.
尽管已经有很多研究在可控面部图像编辑方面进行了努力,但用户能够与系统互动以动态调整要求的理想交互环境并没有得到充分探索。本文专注于通过对话进行面部图像编辑,并介绍了一个新的基准数据集——多轮交互图像编辑(I2Edit),以评估在真实世界中的多轮交互面部编辑场景下的图像处理质量和交互能力。该数据集基于CelebA-HQ数据集,其中图像带有多轮对话,对应着用户的编辑要求。I2Edit具有挑战性,需要1)跟踪动态更新的用户要求并据此编辑图像,以及2)生成适当的自然语言响应以与用户交流。为了应对这些挑战,我们提出了一个由对话模块和图像编辑模块组成的框架。对话模块用于跟踪用户的编辑要求并生成相应的指示性响应,而图像编辑模块则根据跟踪的用户的编辑要求编辑图像。与以前的工作不同,我们简单地将多轮交互视为一次单轮交互的序列,我们从整个对话历史中抽取用户的编辑要求,而不是当前的单轮要求。提取的全局用户编辑要求使我们能够直接编辑输入的原始图像,以避免错误累积和忘记属性的问题。在I2Edit数据集上进行广泛的定量和定性实验,证明了我们提出的框架相对于以前的单轮方法的优势。我们认为,我们的新数据集可以作为推进真实世界中复杂交互图像编辑的探索的重要资源。代码和数据将公开发布。
https://arxiv.org/abs/2303.11108
Most previous progress in object tracking is realized in daytime scenes with favorable illumination. State-of-the-arts can hardly carry on their superiority at night so far, thereby considerably blocking the broadening of visual tracking-related unmanned aerial vehicle (UAV) applications. To realize reliable UAV tracking at night, a spatial-channel Transformer-based low-light enhancer (namely SCT), which is trained in a novel task-inspired manner, is proposed and plugged prior to tracking approaches. To achieve semantic-level low-light enhancement targeting the high-level task, the novel spatial-channel attention module is proposed to model global information while preserving local context. In the enhancement process, SCT denoises and illuminates nighttime images simultaneously through a robust non-linear curve projection. Moreover, to provide a comprehensive evaluation, we construct a challenging nighttime tracking benchmark, namely DarkTrack2021, which contains 110 challenging sequences with over 100 K frames in total. Evaluations on both the public UAVDark135 benchmark and the newly constructed DarkTrack2021 benchmark show that the task-inspired design enables SCT with significant performance gains for nighttime UAV tracking compared with other top-ranked low-light enhancers. Real-world tests on a typical UAV platform further verify the practicability of the proposed approach. The DarkTrack2021 benchmark and the code of the proposed approach are publicly available at this https URL.
大多数先前在物体跟踪方面的进展都是在白天光线良好的场景中实现的。目前,最先进的技术很难在夜晚做出显著优势,因此极大地限制了与视觉跟踪相关的无人机应用的发展。为了实现可靠的无人机夜间跟踪,我们提出了一种基于空间通道Transformer的低光增强器(称为SCT),该增强器采用一种新的任务启发式方法进行训练。为了针对高级别任务实现语义层面的低光增强,我们提出了一种新的空间通道注意力模块,同时保留局部上下文信息。在增强过程中,SCT通过一种稳健的非线性曲线投影方式同时降噪和照明夜间图像。此外,为了进行全面评估,我们建立了一个具有挑战性的夜间跟踪基准,即DarkTrack2021,该基准包含超过100 K帧的110个挑战性序列。在公开的UAVDark135基准和新建的DarkTrack2021基准上进行了评估,结果表明,任务启发式设计使SCT在夜晚无人机跟踪方面比其他顶级低光增强器表现出显著的性能增益。针对典型的无人机平台的实际测试进一步验证了所提出的方法的可行性。DarkTrack2021基准和所提出的方法代码在此httpsURL上公开可用。
https://arxiv.org/abs/2303.10951
In this work, we propose a simultaneous localization and mapping (SLAM) system using a monocular camera and Ultra-wideband (UWB) sensors. Our system, referred to as VRSLAM, is a multi-stage framework that leverages the strengths and compensates for the weaknesses of each sensor. Firstly, we introduce a UWB-aided 7 degree-of-freedom (scale factor, 3D position, and 3D orientation) global alignment module to initialize the visual odometry (VO) system in the world frame defined by the UWB anchors. This module loosely fuses up-to-scale VO and ranging data using either a quadratically constrained quadratic programming (QCQP) or nonlinear least squares (NLS) algorithm based on whether a good initial guess is available. Secondly, we provide an accompanied theoretical analysis that includes the derivation and interpretation of the Fisher Information Matrix (FIM) and its determinant. Thirdly, we present UWBaided bundle adjustment (UBA) and UWB-aided pose graph optimization (UPGO) modules to improve short-term odometry accuracy, reduce long-term drift as well as correct any alignment and scale errors. Extensive simulations and experiments show that our solution outperforms UWB/camera-only and previous approaches, can quickly recover from tracking failure without relying on visual relocalization, and can effortlessly obtain a global map even if there are no loop closures.
在本研究中,我们提出了一种利用单目相机和超宽带(UWB)传感器同时定位和绘图的系统,我们称之为VRSLAM。我们的系统被称为VRSLAM,它是一个多阶段框架,利用每个传感器的优势并补偿其劣势。首先,我们引入了一个UWB辅助的7自由度(尺度、三维位置和三维取向)全球对齐模块,以在由UWB锚点定义的世界框架中初始化视觉导航(VO)系统。这个模块松散地结合到Scale-aware VO和距离数据,根据是否存在良好的初始猜测,使用quadratically constrained quadratic programming(QCQP)或非线性最小二乘法(NLS)算法进行非线性最小平方优化。其次,我们提供了伴随的理论分析,包括费舍尔信息矩阵(FIM)的推导和解释以及其决定值的阐述。第三,我们介绍了UWB辅助分组调整(UBA)和UWB辅助姿态图优化(UPGO)模块,以提高短期导航精度、减少长期漂移并纠正任何对齐和尺度错误。广泛的模拟和实验表明,我们的解决方案比仅使用UWB和相机的方法出色,能够迅速从跟踪失败中恢复,无需依赖视觉重定向,并且即使不存在循环终点,也能轻松获得全球地图。
https://arxiv.org/abs/2303.10903
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at this https URL.
视觉多模态对象追踪产生了一系列后续的多模态追踪分支。为了继承基础模型强大的表示,一种自然的多模态追踪方法是完全优化基于RGB参数的指标。尽管这种方法很有效,但由于后续数据稀缺且数据传输差,等等原因,它不是最优的。在本文中,受到语言模型 prompt learning 最近的成功启发,我们开发Visual Prompt multi-modal Tracking(ViPT),它学习modal相关的提示,以将冻结的训练基础模型适应各种后续的多模态跟踪任务。ViPT 找到了一种更好的方法来刺激规模训练好的RGB模型的知识,同时仅引入几个可训练参数(模型参数中的小于1%)。ViPT 在多个后续跟踪任务中比完全优化范式表现出色,包括 RGB+深度、RGB+ thermal 和 RGB+事件跟踪。广泛的实验表明,视觉提示学习对于多模态跟踪的潜力,ViPT 可以实现高水平的性能,同时满足参数效率。代码和模型在此 https URL 上可用。
https://arxiv.org/abs/2303.10826
Social ambiance describes the context in which social interactions happen, and can be measured using speech audio by counting the number of concurrent speakers. This measurement has enabled various mental health tracking and human-centric IoT applications. While on-device Socal Ambiance Measure (SAM) is highly desirable to ensure user privacy and thus facilitate wide adoption of the aforementioned applications, the required computational complexity of state-of-the-art deep neural networks (DNNs) powered SAM solutions stands at odds with the often constrained resources on mobile devices. Furthermore, only limited labeled data is available or practical when it comes to SAM under clinical settings due to various privacy constraints and the required human effort, further challenging the achievable accuracy of on-device SAM solutions. To this end, we propose a dedicated neural architecture search framework for Energy-efficient and Real-time SAM (ERSAM). Specifically, our ERSAM framework can automatically search for DNNs that push forward the achievable accuracy vs. hardware efficiency frontier of mobile SAM solutions. For example, ERSAM-delivered DNNs only consume 40 mW x 12 h energy and 0.05 seconds processing latency for a 5 seconds audio segment on a Pixel 3 phone, while only achieving an error rate of 14.3% on a social ambiance dataset generated by LibriSpeech. We can expect that our ERSAM framework can pave the way for ubiquitous on-device SAM solutions which are in growing demand.
社交氛围描述了社交互动的环境,并可以使用语音音频计数来测量,即同时讲话的人数。这种测量已经使各种心理健康跟踪和人为中心的物联网应用得以实现。尽管在设备上的社交氛围测量(SAM)是非常理想的,以确保用户隐私并促进上述应用的广泛采用,但先进的深度学习网络(DNN) powered的SAM解决方案所需的计算复杂性与移动设备通常面临的资源限制相矛盾。此外,只有在临床环境下才存在有限的标记数据或实际可用的数据,这与必要的人类努力一起,进一步挑战了在设备上的SAM解决方案可以实现的准确性。为此,我们提出了一个专门的神经网络架构搜索框架,以能源效率和实时SAM(ERSAM)。具体而言,我们的ERSAM框架可以自动搜索推动移动设备SAM解决方案实现准确性与硬件效率极限的DNN。例如,ERSAM提供的DNN仅在Pixel 3手机上消耗40毫瓦 x 12小时的能量,以及仅产生0.05秒的处理延迟,但对于由LriSpeech生成的社交氛围数据集,仅实现了14.3%的错误率。我们期望我们的ERSAM框架可以为日益增长的设备上的Sam解决方案需求铺平道路。
https://arxiv.org/abs/2303.10727
Simultaneously odometry and mapping using LiDAR data is an important task for mobile systems to achieve full autonomy in large-scale environments. However, most existing LiDAR-based methods prioritize tracking quality over reconstruction quality. Although the recently developed neural radiance fields (NeRF) have shown promising advances in implicit reconstruction for indoor environments, the problem of simultaneous odometry and mapping for large-scale scenarios using incremental LiDAR data remains unexplored. To bridge this gap, in this paper, we propose a novel NeRF-based LiDAR odometry and mapping approach, NeRF-LOAM, consisting of three modules neural odometry, neural mapping, and mesh reconstruction. All these modules utilize our proposed neural signed distance function, which separates LiDAR points into ground and non-ground points to reduce Z-axis drift, optimizes odometry and voxel embeddings concurrently, and in the end generates dense smooth mesh maps of the environment. Moreover, this joint optimization allows our NeRF-LOAM to be pre-trained free and exhibit strong generalization abilities when applied to different environments. Extensive evaluations on three publicly available datasets demonstrate that our approach achieves state-of-the-art odometry and mapping performance, as well as a strong generalization in large-scale environments utilizing LiDAR data. Furthermore, we perform multiple ablation studies to validate the effectiveness of our network design. The implementation of our approach will be made available at this https URL.
利用LiDAR数据同时进行步进测量和地图绘制是移动设备实现大规模环境完全自主的重要任务。然而,大多数现有的LiDAR-based方法将跟踪质量置于重建质量之上。虽然最近开发的神经网络光流场(NeRF)在在室内环境的隐含重构方面表现出有前途的进展,但使用增量LiDAR数据同时进行步进测量和地图绘制的问题仍未得到探索。为了解决这个问题,在本文中,我们提出了一种基于NeRF的LiDAR步进测量和地图方法NeRF-LOAM,它由三个模块组成:神经网络步进测量、神经网络映射和网格重构。所有这些模块都利用我们提出的神经网络 signed 距离函数,该函数将LiDAR点分为地面和非地面点,以减少Z轴漂移,同时优化步进测量和立方体嵌入,并最终生成环境密度平滑网格地图。此外,这种协同优化允许我们的NeRF-LOAM在自由预训练的情况下表现出先进的步进测量和地图性能,并在使用LiDAR数据进行大规模环境中表现出强泛化能力。我们对三个公开数据集进行了广泛的评估,证明我们的方法和方法实现实现了先进的步进测量和地图性能,并在使用LiDAR数据进行大规模环境中表现出强大的泛化能力。此外,我们进行了多项微分研究来验证我们网络设计的有效性。我们的方法和实现将在此httpsURL上提供。
https://arxiv.org/abs/2303.10709
Markerless motion capture using computer vision and human pose estimation (HPE) has the potential to expand access to precise movement analysis. This could greatly benefit rehabilitation by enabling more accurate tracking of outcomes and providing more sensitive tools for research. There are numerous steps between obtaining videos to extracting accurate biomechanical results and limited research to guide many critical design decisions in these pipelines. In this work, we analyze several of these steps including the algorithm used to detect keypoints and the keypoint set, the approach to reconstructing trajectories for biomechanical inverse kinematics and optimizing the IK process. Several features we find important are: 1) using a recent algorithm trained on many datasets that produces a dense set of biomechanically-motivated keypoints, 2) using an implicit representation to reconstruct smooth, anatomically constrained marker trajectories for IK, 3) iteratively optimizing the biomechanical model to match the dense markers, 4) appropriate regularization of the IK process. Our pipeline makes it easy to obtain accurate biomechanical estimates of movement in a rehabilitation hospital.
使用计算机视觉和人类姿态估计(HPE)进行无标记运动捕捉的潜在能力扩展了精确运动分析的访问。这将有助于康复,通过允许更准确地跟踪结果并提供更敏感的研究工具。在这项工作中,我们分析了几个步骤,包括用于检测关键点和关键点集的算法、用于重建生物医学逆运动学轨迹的方法,以及优化IK过程的方法。我们发现几个重要特征是:1)使用训练了许多数据集的最新算法,以生成一组生物医学动机的关键点的深度表示法;2)使用一种隐含表示来重建IK轨迹的平滑、具有身体解剖学约束的关键点标记法;3)迭代优化生物医学模型,使其与密集标记相匹配;4)适当规范IK过程。我们的管道使在康复医院获得准确的生物医学运动估计变得容易。
https://arxiv.org/abs/2303.10654