Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at this https URL.
today,大多数图像理解任务的方法都依赖于前馈神经网络。虽然这种方法通过微调获得了 empirical 的准确性和效率,但同时也存在一些基本缺陷。现有的网络往往难以在不同的数据集上泛化,即使是相同任务。通过设计,这些网络最终在预训练的3D对象表示的潜在空间中进行推理,这是具有挑战性的。尤其是在试图根据2D图像预测3D信息时,这更是如此。我们提出将从RGB相机中的3D多对象跟踪重新建模为同义词{反向渲染(IR)问题,通过优化通过不同的渲染管道在预训练3D对象表示的潜在空间中进行优化,并检索在给定输入图像中最好地表示物体实例的潜在。为此,我们优化了一个在生成性潜在空间上进行的图像损失。我们研究了不仅是对跟踪的另一种看法,而且我们的方法还允许我们检查生成的物体,推理失败情况,并解决模糊情况。我们通过仅从合成数据中学习生成先验来评估我们的方法的泛化能力和扩展能力。我们在 nuScenes 和 Waymo 数据集上对相机基于3D跟踪的性能进行了评估。这两个数据集完全未见对我们的方法,也不需要微调。视频和代码可在此处 https:// URL 下载。
https://arxiv.org/abs/2404.12359
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。
https://arxiv.org/abs/2404.12210
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.
基于事件的眼跟踪在具有高时间分辨率和高容错性的事件相机提供的功能方面表现出巨大的潜力。然而,眼运动模式的多样性和突然性,包括眨眼、固定、扫视和流畅跟踪,对眼定位提出了严重的挑战。为了实现一个稳定的基于事件的眼跟踪系统,本文提出了双向长时序列建模和时间可变状态选择机制,以充分利用眼睛运动变化对上下文时间信息的响应。具体来说,提出了MambaPupil网络,它由多层卷积编码器提取事件表示的 features,双向Gated Recurrent Unit (GRU) 和线性时间可变状态空间模块 (LTV-SSM) 组成,用于选择性地捕捉上下文关系中的局部相关性。此外,Bina-rep被用作紧凑的事件表示,而提出的数据增强技术,称为事件裁剪,通过应用空间随机掩码对事件图像进行空间随机遮盖,来增强模型的鲁棒性。在 ThreeET-plus 基准上进行的评估显示,MambaPupil 的性能优越,它在 CVPR'2024 AIS Event-based Eye Tracking挑战中获得了第 1 名。
https://arxiv.org/abs/2404.12083
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
跨对象跟踪任务的新趋势是使用自然语言跟踪感兴趣的对象。然而,缺乏成对提示实例数据会阻碍其进展。为解决这个问题,我们提出了一个高质量但成本低的数据生成方法,基于Unreal Engine 5,并构建了一个名为Refer-UE-City的新基准数据集,主要包括路口监视视频的场景,详细描述了人和车辆的外观和行为。具体来说,它提供了14个视频,总共有714个表情,与Refer-KITTI数据集的规模相当。此外,我们提出了一个多层语义引导多对象框架MLS-Track,通过引入语义引导模块(SGM)和语义相关分支(SCB)来增强模型和文本之间的交互。对Refer-UE-City和Refer-KITTI数据集的实验表明,我们提出的框架的有效性得到了证明,并且实现了最先进的性能。代码和数据集将可用。
https://arxiv.org/abs/2404.12031
Dynamic and continuous jumping remains an open yet challenging problem in bipedal robot control. The choice of dynamic models in trajectory optimization (TO) problems plays a huge role in trajectory accuracy and computation efficiency, which normally cannot be ensured simultaneously. In this letter, we propose a novel adaptive-model optimization approach, a unified framework of Adaptive-model TO and Adaptive-frequency Model Predictive Control (MPC), to effectively realize continuous and robust jumping on HECTOR bipedal robot. The proposed Adaptive-model TO fuses adaptive-fidelity dynamics modeling of bipedal jumping motion for model fidelity necessities in different jumping phases to ensure trajectory accuracy and computation efficiency. In addition, conventional approaches have unsynchronized sampling frequencies in TO and real-time control, causing the framework to have mismatched modeling resolutions. We adapt MPC sampling frequency based on TO trajectory resolution in different phases for effective trajectory tracking. In hardware experiments, we have demonstrated robust and dynamic jumps covering a distance of up to 40 cm (57% of robot height). To verify the repeatability of this experiment, we run 53 jumping experiments and achieve 90% success rate. In continuous jumps, we demonstrate continuous bipedal jumping with terrain height perturbations (up to 5 cm) and discontinuities (up to 20 cm gap).
动态和连续跳跃在双足机器人控制中仍然是一个开放且具有挑战性的问题。轨迹优化(TO)问题中动态模型的选择在轨迹精度和计算效率方面具有巨大的影响,通常不能同时确保。在这封信中,我们提出了一个新的人工模型优化方法,一个自适应模型TO和自适应频率模型预测控制(MPC)的统一框架,以有效地在HECTOR双足机器人上实现连续和鲁棒的跳跃。所提出的自适应模型TO将适应性跳跃运动的双足跳跃动量建模与模型精度需求在不同的跳跃阶段相结合,以确保轨迹精度和计算效率。此外,传统方法在TO和实时控制中具有异步采样频率,导致框架的建模分辨率不匹配。我们根据不同跳跃阶段的TO轨迹分辨率调整MPC采样频率,以实现有效的轨迹跟踪。在硬件实验中,我们证明了覆盖距离多达40厘米(机器人高度的57%)的稳健和动态跳跃。为了验证这一实验的重复性,我们进行了53次跳跃实验,并获得了90%的成功率。在连续跳跃中,我们展示了在地面高度扰动(最高达5厘米)和间断(最高达20厘米的缺口)条件下的连续双足跳跃。
https://arxiv.org/abs/2404.11807
This paper performs the crucial work of establishing a baseline for gaze-driven authentication performance to begin answering fundamental research questions using a very large dataset of gaze recordings from 9202 people with a level of eye tracking (ET) signal quality equivalent to modern consumer-facing virtual reality (VR) platforms. The size of the employed dataset is at least an order-of-magnitude larger than any other dataset from previous related work. Binocular estimates of the optical and visual axes of the eyes and a minimum duration for enrollment and verification are required for our model to achieve a false rejection rate (FRR) of below 3% at a false acceptance rate (FAR) of 1 in 50,000. In terms of identification accuracy which decreases with gallery size, we estimate that our model would fall below chance-level accuracy for gallery sizes of 148,000 or more. Our major findings indicate that gaze authentication can be as accurate as required by the FIDO standard when driven by a state-of-the-art machine learning architecture and a sufficiently large training dataset.
本文对 gaze-驱动身份验证的基准点进行了建立,以使用一个由 9202 人进行眼跟踪(ET)信号质量相当于现代消费级虚拟现实(VR)平台的大型数据集来回答基本研究问题。使用的数据集的大小至少是之前相关工作的数据集大小的十倍以上。为了使我们的模型在假拒绝率(FRR)为 3% 时,假接受率(FAR)为 1/50,000 时实现,需要估计双眼的角膜和视觉轴以及最小验证和注册持续时间。对于我们的模型,在画廊大小为 148,000 或更多时,识别准确度会降低到与机会水平准确性相当。我们得出的主要结论是,当由最先进的机器学习架构驱动时, gaze 身份验证可以达到 FIDO 标准所要求的精度。此外,我们还发现了一个与画廊大小相关的识别准确度下降趋势。
https://arxiv.org/abs/2404.11798
This survey reviews the AIS 2024 Event-Based Eye Tracking (EET) Challenge. The task of the challenge focuses on processing eye movement recorded with event cameras and predicting the pupil center of the eye. The challenge emphasizes efficient eye tracking with event cameras to achieve good task accuracy and efficiency trade-off. During the challenge period, 38 participants registered for the Kaggle competition, and 8 teams submitted a challenge factsheet. The novel and diverse methods from the submitted factsheets are reviewed and analyzed in this survey to advance future event-based eye tracking research.
本次调查对AIS 2024基于事件的眼跟踪(EET)挑战进行了回顾。挑战的重点在于通过事件相机记录的眼部运动来处理和预测眼睛的瞳孔中心。挑战强调了利用事件相机进行高效的眼跟踪以实现任务准确性和效率的权衡。在挑战期间,有38名参与者注册了Kaggle比赛,8支队伍提交了挑战事实册。本次调查对提交的事实册中的新颖且多样方法进行了审查和分析,以推动未来基于事件的眼跟踪研究的发展。
https://arxiv.org/abs/2404.11770
This work introduces a motion retargeting approach for legged robots, which aims to create motion controllers that imitate the fine behavior of animals. Our approach, namely spatio-temporal motion retargeting (STMR), guides imitation learning procedures by transferring motion from source to target, effectively bridging the morphological disparities by ensuring the feasibility of imitation on the target system. Our STMR method comprises two components: spatial motion retargeting (SMR) and temporal motion retargeting (TMR). On the one hand, SMR tackles motion retargeting at the kinematic level by generating kinematically feasible whole-body motions from keypoint trajectories. On the other hand, TMR aims to retarget motion at the dynamic level by optimizing motion in the temporal domain. We showcase the effectiveness of our method in facilitating Imitation Learning (IL) for complex animal movements through a series of simulation and hardware experiments. In these experiments, our STMR method successfully tailored complex animal motions from various media, including video captured by a hand-held camera, to fit the morphology and physical properties of the target robots. This enabled RL policy training for precise motion tracking, while baseline methods struggled with highly dynamic motion involving flying phases. Moreover, we validated that the control policy can successfully imitate six different motions in two quadruped robots with different dimensions and physical properties in real-world settings.
这项工作提出了一种适用于下肢机器人的运动再适方法,旨在创建具有动物精细行为的运动控制器。我们的方法,即空间时间运动再适(STMR),通过将运动从源系统传到目标系统来指导模仿学习过程,有效弥合了形态差异,确保了目标系统上的模仿可行性。我们的STMR方法包括两个组件:空间运动再适(SMR)和时间运动再适(TMR)。一方面,SMR通过从关键点轨迹中生成全局运动来解决运动再适问题。另一方面,TMR旨在通过优化在时间域中的运动来再适运动。我们在一系列模拟和硬件实验中展示了我们方法的有效性,通过这些实验和硬件实验,我们的STMR方法成功地将各种媒体中的复杂动物运动进行了调整,以适应目标机器人的形态和物理属性。这使得精确运动跟踪的RL策略训练成为可能,而基于基准方法的运动在飞行阶段中具有高度动态的情况表现不佳。此外,我们还验证了控制策略可以成功地模仿六种不同的运动,这些机器人具有不同的尺寸和物理属性,在现实场景中。
https://arxiv.org/abs/2404.11557
Increasing the annotation efficiency of trajectory annotations from videos has the potential to enable the next generation of data-hungry tracking algorithms to thrive on large-scale datasets. Despite the importance of this task, there are currently very few works exploring how to efficiently label tracking datasets comprehensively. In this work, we introduce SPAM, a tracking data engine that provides high-quality labels with minimal human intervention. SPAM is built around two key insights: i) most tracking scenarios can be easily resolved. To take advantage of this, we utilize a pre-trained model to generate high-quality pseudo-labels, reserving human involvement for a smaller subset of more difficult instances; ii) handling the spatiotemporal dependencies of track annotations across time can be elegantly and efficiently formulated through graphs. Therefore, we use a unified graph formulation to address the annotation of both detections and identity association for tracks across time. Based on these insights, SPAM produces high-quality annotations with a fraction of ground truth labeling cost. We demonstrate that trackers trained on SPAM labels achieve comparable performance to those trained on human annotations while requiring only 3-20% of the human labeling effort. Hence, SPAM paves the way towards highly efficient labeling of large-scale tracking datasets. Our code and models will be available upon acceptance.
提高视频轨迹注释的效率具有潜力,使下一代渴望大规模数据的数据跟踪算法在大型数据集上繁荣发展。尽管这项任务非常重要,但目前很少有工作探讨如何全面有效地对跟踪数据集进行标注。在这项工作中,我们介绍了一个名为SPAM的跟踪数据引擎,它提供高质量的数据注释,同时最小化人工干预。SPAM基于两个关键见解:i)大多数跟踪场景都可以轻松解决。为了利用这一点,我们利用预训练模型生成高质量伪标签,将人类参与度限制在较小的部分更难的实例上; ii)通过图的形式处理轨迹注释的时间维度上的空间依赖关系可以优雅而有效地进行表示。因此,我们使用统一图表示来解决跨时间对轨迹的注释。基于这些见解,SPAM在给定的地面真实标注成本下产生高质量注释。我们证明了,使用SPAM标签训练的跟踪器在性能上与使用人类标注标记的跟踪器相当,而只需要人类标注工作的3-20%。因此,SPAM为大规模轨迹数据集的高效标注铺平了道路。我们的代码和模型将在接受审查时公开可用。
https://arxiv.org/abs/2404.11426
We present SLAIM - Simultaneous Localization and Implicit Mapping. We propose a novel coarse-to-fine tracking model tailored for Neural Radiance Field SLAM (NeRF-SLAM) to achieve state-of-the-art tracking performance. Notably, existing NeRF-SLAM systems consistently exhibit inferior tracking performance compared to traditional SLAM algorithms. NeRF-SLAM methods solve camera tracking via image alignment and photometric bundle-adjustment. Such optimization processes are difficult to optimize due to the narrow basin of attraction of the optimization loss in image space (local minima) and the lack of initial correspondences. We mitigate these limitations by implementing a Gaussian pyramid filter on top of NeRF, facilitating a coarse-to-fine tracking optimization strategy. Furthermore, NeRF systems encounter challenges in converging to the right geometry with limited input views. While prior approaches use a Signed-Distance Function (SDF)-based NeRF and directly supervise SDF values by approximating ground truth SDF through depth measurements, this often results in suboptimal geometry. In contrast, our method employs a volume density representation and introduces a novel KL regularizer on the ray termination distribution, constraining scene geometry to consist of empty space and opaque surfaces. Our solution implements both local and global bundle-adjustment to produce a robust (coarse-to-fine) and accurate (KL regularizer) SLAM solution. We conduct experiments on multiple datasets (ScanNet, TUM, Replica) showing state-of-the-art results in tracking and in reconstruction accuracy.
我们提出了SLAIM - 同时定位和隐式映射。我们针对Neural Radiance Field SLAM(NeRF-SLAM)提出了一种新颖的粗-到细跟踪模型,以实现最先进的跟踪性能。值得注意的是,现有的NeRF-SLAM系统与传统SLAM算法相比,跟踪性能 consistently较差。通过图像对齐和光度 bundle-adjustment,NeRF-SLAM方法通过优化损失函数在图像空间中的狭窄吸引域(局部最小值)来解决相机跟踪问题。由于优化损失函数在图像空间中的狭窄吸引域和缺乏初始对应关系,这种优化过程很难优化。通过在NeRF上实现高斯金字塔滤波器,我们通过促进粗-到细跟踪优化策略来缓解这些限制。此外,NeRF系统在有限输入视图下难以收敛到正确的几何形状。虽然之前的 approaches 使用基于 Signed-Distance Function (SDF) 的NeRF并直接通过深度测量通过近似地面真实 SDF 来指导SDF值,但通常会导致次优的几何形状。相比之下,我们的方法采用体积密度表示,并在光线终止分布上引入了一种新颖的KL正则化器,将场景几何限制为空旷空间和透明表面。我们的解决方案同时实现局部和全局束调整,产生一个稳健(粗-到细)和准确的SLAM解决方案。我们在多个数据集(ScanNet,TUM,Replica)上进行实验,展示了在跟踪和重建精度方面的最先进结果。
https://arxiv.org/abs/2404.11419
Tracking and identifying athletes on the pitch holds a central role in collecting essential insights from the game, such as estimating the total distance covered by players or understanding team tactics. This tracking and identification process is crucial for reconstructing the game state, defined by the athletes' positions and identities on a 2D top-view of the pitch, (i.e. a minimap). However, reconstructing the game state from videos captured by a single camera is challenging. It requires understanding the position of the athletes and the viewpoint of the camera to localize and identify players within the field. In this work, we formalize the task of Game State Reconstruction and introduce SoccerNet-GSR, a novel Game State Reconstruction dataset focusing on football videos. SoccerNet-GSR is composed of 200 video sequences of 30 seconds, annotated with 9.37 million line points for pitch localization and camera calibration, as well as over 2.36 million athlete positions on the pitch with their respective role, team, and jersey number. Furthermore, we introduce GS-HOTA, a novel metric to evaluate game state reconstruction methods. Finally, we propose and release an end-to-end baseline for game state reconstruction, bootstrapping the research on this task. Our experiments show that GSR is a challenging novel task, which opens the field for future research. Our dataset and codebase are publicly available at this https URL.
在球场上跟踪和识别运动员对收集从比赛中获取关键洞察力具有中心作用,例如估计球员总共跑过的距离或了解球队战术。这个跟踪和识别过程对于重新构建比赛状态至关重要,定义了运动员在2D球场上位置和身份的(即最小视图)。然而,从单个摄像机捕捉的视频重构比赛状态是具有挑战性的。这需要了解运动员的位置和摄像头的视角,以便在比赛中定位和识别球员。在这项工作中,我们正式化了这个任务,并引入了SoccerNet-GSR,一个专注于足球视频的游戏状态重建数据集。SoccerNet-GSR由200个视频序列组成,每个序列为30秒,带有9.37百万行注释,用于球场的定位和相机校准,以及球员在球场上的位置,队伍和球衣号码。此外,我们还引入了GS-HOTA,一种新的指标来评估游戏状态重建方法。最后,我们提出了一个端到端的基础结构,用于游戏状态重建,并在这项研究中进行研究。我们的实验结果表明,GSR是一个具有挑战性的新任务,为未来研究打开了领域。我们的数据集和代码库可以通过此链接公开使用。
https://arxiv.org/abs/2404.11335
The paper presents the methodology used for accuracy and repeatability measurements of the experimental model of a parallel robot developed for surgical applications. The experimental setup uses a motion tracking system (for accuracy) and a high precision measuring arm for position (for repeatability). The accuracy was obtained by comparing the trajectory data from the experimental measurement with a baseline trajectory defined with the kinematic models of the parallel robotic system. The repeatability was experi-mentally determined by moving (repeatedly) the robot platform in predefined points.
本文阐述了用于手术应用中并行机器人模型的准确性和重复性测量的方法。实验设置使用运动跟踪系统(用于准确度)和高精度测量臂(用于重复性)。通过将实验测量轨迹与并行机器人系统的运动模型定义的基线轨迹进行比较,获得了精度。通过在预定义的点上重复移动机器人平台,通过实验方法确定了重复性。
https://arxiv.org/abs/2404.11140
Creating large LiDAR datasets with pixel-level labeling poses significant challenges. While numerous data augmentation methods have been developed to reduce the reliance on manual labeling, these methods predominantly focus on static scenes and they overlook the importance of data augmentation for dynamic scenes, which is critical for autonomous driving. To address this issue, we propose D-Aug, a LiDAR data augmentation method tailored for augmenting dynamic scenes. D-Aug extracts objects and inserts them into dynamic scenes, considering the continuity of these objects across consecutive frames. For seamless insertion into dynamic scenes, we propose a reference-guided method that involves dynamic collision detection and rotation alignment. Additionally, we present a pixel-level road identification strategy to efficiently determine suitable insertion positions. We validated our method using the nuScenes dataset with various 3D detection and tracking methods. Comparative experiments demonstrate the superiority of D-Aug.
创建大量带有像素级标注的大LiDAR数据集是一项具有挑战性的任务。虽然已经开发了许多数据增强方法以减少对手动标注的依赖,但这些方法主要关注静态场景,并忽略了数据增强对于动态场景的重要性,这对于自动驾驶至关重要。为解决这个问题,我们提出了D-Aug,一种专为增强动态场景而设计的LiDAR数据增强方法。D-Aug从动态场景中提取物体,并将其插入其中,考虑这些物体在连续帧之间的连续性。为了实现无缝的插入到动态场景中,我们提出了一个基于动态碰撞检测和旋转对齐的参考引导方法。此外,我们还提出了一个用于确定插入位置的像素级道路识别策略,以提高数据增强的效率。我们使用各种3D检测和跟踪方法对 nuScenes 数据集进行了验证。比较实验证明了 D-Aug 的优越性。
https://arxiv.org/abs/2404.11127
Vision sensors are versatile and can capture a wide range of visual cues, such as color, texture, shape, and depth. This versatility, along with the relatively inexpensive availability of machine vision cameras, played an important role in adopting vision-based environment perception systems in autonomous vehicles (AVs). However, vision-based perception systems can be easily affected by glare in the presence of a bright source of light, such as the sun or the headlights of the oncoming vehicle at night or simply by light reflecting off snow or ice-covered surfaces; scenarios encountered frequently during driving. In this paper, we investigate various glare reduction techniques, including the proposed saturated pixel-aware glare reduction technique for improved performance of the computer vision (CV) tasks employed by the perception layer of AVs. We evaluate these glare reduction methods based on various performance metrics of the CV algorithms used by the perception layer. Specifically, we considered object detection, object recognition, object tracking, depth estimation, and lane detection which are crucial for autonomous driving. The experimental findings validate the efficacy of the proposed glare reduction approach, showcasing enhanced performance across diverse perception tasks and remarkable resilience against varying levels of glare.
视觉传感器具有多功能,可以捕捉各种视觉线索,如颜色、纹理、形状和深度。这种多功能,再加上机器视觉摄像头相对较低的价格,在自动驾驶车辆(AVs)中采用基于视觉的环境感知系统发挥了重要作用。然而,基于视觉的感知系统很容易受到在明亮光源下存在的眩光的影响,例如太阳或夜间或仅仅是由于雪或冰表面反射的光线;这些情况在驾驶过程中经常遇到。在本文中,我们研究了各种眩光减除技术,包括为提高AV感知层计算机视觉(CV)任务的性能而提出的饱和像素感知眩光减除技术。我们根据CV算法使用感知层所实现的各种性能指标评估这些眩光减除方法。具体来说,我们考虑了物体检测、物体识别、物体跟踪、深度估计和车道检测,这些对于自动驾驶至关重要。实验结果证实了所提出的眩光减除方法的有效性,展示了在各种感知任务中出色的性能和对于眩光水平变化的非凡的鲁棒性。
https://arxiv.org/abs/2404.10992
Soft-robot designs are manifold, but only a few are publicly available. Often, these are only briefly described in their publications. This complicates reproduction, and hinders the reproducibility and comparability of research results. If the designs were uniform and open source, validating researched methods on real benchmark systems would be possible. To address this, we present two variants of a soft pneumatic robot with antagonistic bellows as open source. Starting from a semi-modular design with multiple cables and tubes routed through the robot body, the transition to a fully modular robot with integrated microvalves and serial communication is highlighted. Modularity in terms of stackability, actuation, and communication is achieved, which is the crucial requirement for building soft robots with many degrees of freedom and high dexterity for real-world tasks. Both systems are compared regarding their respective advantages and disadvantages. The robots' functionality is demonstrated in experiments on airtightness, gravitational influence, position control with mean tracking errors of <3 deg, and long-term operation of cast and printed bellows. All soft- and hardware files required for reproduction are provided.
软机器人设计多种多样,但只有少数公开可用。通常,这些设计只是在发表文章时简要描述一下。这使得复制很复杂,同时也阻碍了研究结果的可重复性和可比性。如果设计是统一的且开源的,那么在实际基准系统上验证研究方法是可能的。为解决这个问题,我们提出了两个带有对抗性活塞的软气动机器人开源设计。设计从具有多个电缆和管道的半模块设计开始,重点转向具有集成微活塞和串行通信的全模块机器人。在软机器人,通过堆叠、执行和通信实现模块性,这是构建具有多自由度和高灵巧性的软机器人的关键要求。这两个系统在各自的优缺点方面进行了比较。我们通过测试空气密封性、重力影响、带有<3°平均跟踪误差的位置控制以及软管和打印活塞的长期操作,展示了机器人的功能性。所有用于复制的软硬件文件都提供。
https://arxiv.org/abs/2404.10734
We present an algorithm for detecting and tracking underwater mobile objects using active acoustic transmission of broadband chirp signals whose reflections are received by a hydrophone array. The method overcomes the problem of high false alarm rate by applying a track-before-detect ap- proach to the sequence of received reflections. A 2D time- space matrix is created for the reverberations received from each transmitted probe signal by performing delay and sum beamforming and pulse compression. The result is filtered by a 2D constant false alarm rate (CFAR) detector to identify reflection patterns corresponding to potential targets. Closely spaced signals for multiple probe transmissions are combined into blobs to avoid multiple detections of a single object. A track- before-detect method using a Nearly Constant Velocity (NCV) model is employed to track multiple objects. The position and velocity is estimated by the debiased converted measurement Kalman filter. Results are analyzed for simulated scenarios and for experiments at sea, where GPS tagged gilt-head seabream fish were tracked. Compared to two benchmark schemes, the results show a favorable track continuity and accuracy that is robust to the choice of detection threshold.
我们提出了一种使用宽带脉冲信号的主动声发射来检测和跟踪水下移动目标的算法。该方法通过应用跟踪在检测前的序列来降低虚假警报率。通过进行延迟和求和 beamforming 和脉冲压缩,为每个传输的探测信号创建了2D 时间-空间矩阵。结果通过2D 常数虚假警报率(CFAR)检测器进行滤波,以识别潜在目标的反射模式。 对于多个探测信号,将近距离的信号合并成团以避免对单个目标的多次检测。采用Nearly Constant Velocity(NCV)模型采用跟踪在检测前的方法来跟踪多个目标。通过无偏转换测量 Kalman 滤波器估计位置和速度。 在模拟场景和海上实验中分析结果。与两个基准方案相比,结果表明,该方法具有更好的跟踪连续性和准确性,并且对检测阈值的選擇具有鲁棒性。
https://arxiv.org/abs/2404.10316
Iterative learning control (ILC) is a method for reducing system tracking or estimation errors over multiple iterations by using information from past iterations. The disturbance observer (DOB) is used to estimate and mitigate disturbances within the system, while the system is being affected by them. ILC enhances system performance by introducing a feedforward signal in each iteration. However, its effectiveness may diminish if the conditions change during the iterations. On the other hand, although DOB effectively mitigates the effects of new disturbances, it cannot entirely eliminate them as it operates reactively. Therefore, neither ILC nor DOB alone can ensure sufficient robustness in challenging scenarios. This study focuses on the simultaneous utilization of ILC and DOB to enhance system robustness. The proposed methodology specifically targets dynamically different linearized systems performing repetitive tasks. The systems share similar forms but differ in dynamics (e.g. sizes, masses, and controllers). Consequently, the design of learning filters must account for these differences in dynamics. To validate the approach, the study establishes a theoretical framework for designing learning filters in conjunction with DOB. The validity of the framework is then confirmed through numerical studies and experimental tests conducted on unmanned aerial vehicles (UAVs). Although UAVs are nonlinear systems, the study employs a linearized controller as they operate in proximity to the hover condition. A video introduction of this paper is available via this link: this https URL.
迭代学习控制(ILC)是一种通过利用过去的迭代信息来减少系统跟踪或估计误差的方法。扰动观测器(DOB)用于估计和减轻系统受到的影响。在每次迭代中引入前一次迭代的信息,从而增强系统的性能。然而,如果迭代过程中条件发生变化,其效果可能会减弱。另一方面,尽管DOB有效地减轻了新的干扰的影响,但它无法完全消除它们,因为它是以反应方式工作的。因此,ILC和DOB单独不能确保在具有挑战性的场景中具有足够的鲁棒性。 本研究关注的是同时使用ILC和DOB来增强系统的鲁棒性。所提出的方法针对执行重复任务的动态不同线性化系统。这些系统具有相似的形式,但动态(如大小,质量和控制器)有所不同。因此,设计学习滤波器时必须考虑这些动态差异。为了验证这种方法,研究建立了一个与DOB相结合的理论框架来设计学习滤波器。然后通过数值研究和无人飞行器(UAV)的实验测试来证实该框架的有效性。尽管UAV是 nonlinear 系统,但研究使用线性化控制器,因为它们在接近悬停状态时操作。本文的简介视频可以通过这个链接查看:https:// this URL。
https://arxiv.org/abs/2404.10231
This work proposes a novel learning framework for visual hand dynamics analysis that takes into account the physiological aspects of hand motion. The existing models, which are simplified joint-actuated systems, often produce unnatural motions. To address this, we integrate a musculoskeletal system with a learnable parametric hand model, MANO, to create a new model, MS-MANO. This model emulates the dynamics of muscles and tendons to drive the skeletal system, imposing physiologically realistic constraints on the resulting torque trajectories. We further propose a simulation-in-the-loop pose refinement framework, BioPR, that refines the initial estimated pose through a multi-layer perceptron (MLP) network. Our evaluation of the accuracy of MS-MANO and the efficacy of the BioPR is conducted in two separate parts. The accuracy of MS-MANO is compared with MyoSuite, while the efficacy of BioPR is benchmarked against two large-scale public datasets and two recent state-of-the-art methods. The results demonstrate that our approach consistently improves the baseline methods both quantitatively and qualitatively.
本文提出了一种新的视觉手动态分析学习框架,考虑了手运动的生理学方面。现有的模型,这些模型是简化的关节驱动系统,通常会产生不自然运动。为了解决这个问题,我们将一个肌肉骨骼系统与可学习参数化手模型MANO集成,以创建一个新的模型MS-MANO。这个模型模拟了肌肉和肌腱的运动,驱动骨骼系统,对结果的扭矩轨迹施加生理学上的约束。我们还提出了一个模拟-在-循环姿势优化框架BioPR,该框架通过多层感知器(MLP)网络对初始估计姿势进行优化。我们对MS-MANO和BioPR的准确性和有效性进行了两个部分的评估。MS-MANO的准确性与MyoSuite进行了比较,BioPR的有效性则与两个大型公开数据集和两个最近的方法进行了比较。结果表明,我们的方法在数量和质量上都能显著提高基准方法。
https://arxiv.org/abs/2404.10227
Embodied visual tracking is to follow a target object in dynamic 3D environments using an agent's egocentric vision. This is a vital and challenging skill for embodied agents. However, existing methods suffer from inefficient training and poor generalization. In this paper, we propose a novel framework that combines visual foundation models (VFM) and offline reinforcement learning (offline RL) to empower embodied visual tracking. We use a pre-trained VFM, such as ``Tracking Anything", to extract semantic segmentation masks with text prompts. We then train a recurrent policy network with offline RL, e.g., Conservative Q-Learning, to learn from the collected demonstrations without online agent-environment interactions. To further improve the robustness and generalization of the policy network, we also introduce a mask re-targeting mechanism and a multi-level data collection strategy. In this way, we can train a robust tracker within an hour on a consumer-level GPU, e.g., Nvidia RTX 3090. Such efficiency is unprecedented for RL-based visual tracking methods. We evaluate our tracker on several high-fidelity environments with challenging situations, such as distraction and occlusion. The results show that our agent outperforms state-of-the-art methods in terms of sample efficiency, robustness to distractors, and generalization to unseen scenarios and targets. We also demonstrate the transferability of the learned tracker from the virtual world to real-world scenarios.
肢体视觉跟踪是通过使用代理的以自我为中心的视觉来跟随动态3D环境中的目标对象。这是 embodied 代理的一种关键和具有挑战性的技能。然而,现有的方法在训练和泛化方面存在效率低和表现差的问题。在本文中,我们提出了一种结合视觉基础模型(VFM)和离线强化学习(offline RL)的新框架,以增强 embodied 视觉跟踪。我们使用预训练的 VFM,如 "Tracking Anything",以提取带文本提示的语义分割掩码。然后,我们使用离线 RL 训练一个循环策略网络,例如 Conservative Q-Learning,以从收集的演示中学习,而无需与在线代理环境和交互。为了进一步提高策略网络的稳健性和泛化性,我们还引入了掩码重置机制和多级数据收集策略。通过这种方式,我们可以在消费者级 GPU(例如 Nvidia RTX 3090)上训练一个稳健的跟踪器,例如一个小时。这是基于 RL 的视觉跟踪方法前所未有的效率。我们在具有挑战性的环境中评估我们的跟踪器,例如分心和遮挡。结果表明,我们的代理在样本效率、对干扰者的鲁棒性和对未见过的场景和目标的泛化方面优于最先进的 methods。我们还证明了从虚拟世界中学到的跟踪器在现实世界场景中的可转移性。
https://arxiv.org/abs/2404.09857
When working with 3D facial data, improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos, recent methods have focused on how to perform monocular 3D face tracking. However, these methods often fall short in capturing precise facial movements due to limitations in their network architecture, training, and evaluation processes. Addressing these challenges, we propose a novel face tracker, FlowFace, that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work, FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations, integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally, we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos, which leads to performance gains on downstream tasks.
在处理三维面部数据时,提高准确性和避免深度谷效应的关键取决于准确的三维面部表演捕捉。因为这些方法代价昂贵,而且由于2D视频的广泛可用性,最近的方法集中于如何进行单目三维面部跟踪。然而,由于其网络架构、训练和评估过程的局限性,这些方法往往无法准确捕捉到精确的面部运动。为了解决这些问题,我们提出了一个新颖的跟踪器——流形面部(FlowFace),它引入了一种创新的高质量2D对齐网络来解决深度对齐问题。与之前的工作不同,流形面部在高质量3D扫描注释上进行训练,而不是弱监督或合成数据。我们的3D模型拟合模块与一个或多个观察结果相结合,实现了对 identity 和expression 的增强,以及对细节面部特征的重建。此外,我们还提出了一个用于评估跟踪准确性的新指标和基准。我们的方法在自定义和公开可用的基准上都表现出卓越的性能。为了进一步验证跟踪器的有效性,我们通过从2D视频中生成高质量的3D数据,实现了在下游任务上的性能提升。
https://arxiv.org/abs/2404.09819