There is a large population of wheelchair users. Most of the wheelchair users need help with daily tasks. However, according to recent reports, their needs are not properly satisfied due to the lack of caregivers. Therefore, in this project, we develop WeHelp, a shared autonomy system aimed for wheelchair users. A robot with a WeHelp system has three modes, following mode, remote control mode and tele-operation mode. In the following mode, the robot follows the wheelchair user automatically via visual tracking. The wheelchair user can ask the robot to follow them from behind, by the left or by the right. When the wheelchair user asks for help, the robot will recognize the command via speech recognition, and then switch to the teleoperation mode or remote control mode. In the teleoperation mode, the wheelchair user takes over the robot with a joy stick and controls the robot to complete some complex tasks for their needs, such as opening doors, moving obstacles on the way, reaching objects on a high shelf or on the low ground, etc. In the remote control mode, a remote assistant takes over the robot and helps the wheelchair user complete some complex tasks for their needs. Our evaluation shows that the pipeline is useful and practical for wheelchair users. Source code and demo of the paper are available at \url{this https URL}.
有很大的轮椅用户人口。大多数轮椅用户需要日常生活中的帮助。然而,据最近报道,由于缺乏护理人员,他们的需求没有得到适当的满足。因此,在这个项目中,我们开发了一个名为WeHelp的共享自治系统,专为轮椅用户设计。具有WeHelp系统的机器人有三种模式:跟随模式,遥控模式和远程操作模式。在跟随模式下,机器人通过视觉跟踪来跟随轮椅用户。轮椅用户可以通过左或右向机器人发出请求。当轮椅用户寻求帮助时,机器人将通过语音识别接收到命令,然后切换到遥控模式或远程操作模式。在遥控模式下,轮椅用户通过摇杆掌控机器人,并帮助机器人完成一些复杂的任务,如打开门,在路上移动障碍物,到达高货架或低地面等。在远程操作模式下,一个远程助手接管机器人,帮助轮椅用户完成一些复杂的任务。我们的评估显示,该流程对于轮椅用户来说是有用且实用的。论文的源代码和演示文稿可在此处访问:\url{this <https://this URL>}.
https://arxiv.org/abs/2409.12159
Representing the 3D environment with instance-aware semantic and geometric information is crucial for interaction-aware robots in dynamic environments. Nonetheless, creating such a representation poses challenges due to sensor noise, instance segmentation and tracking errors, and the objects' dynamic motion. This paper introduces a novel particle-based instance-aware semantic occupancy map to tackle these challenges. Particles with an augmented instance state are used to estimate the Probability Hypothesis Density (PHD) of the objects and implicitly model the environment. Utilizing a State-augmented Sequential Monte Carlo PHD (S$^2$MC-PHD) filter, these particles are updated to jointly estimate occupancy status, semantic, and instance IDs, mitigating noise. Additionally, a memory module is adopted to enhance the map's responsiveness to previously observed objects. Experimental results on the Virtual KITTI 2 dataset demonstrate that the proposed approach surpasses state-of-the-art methods across multiple metrics under different noise conditions. Subsequent tests using real-world data further validate the effectiveness of the proposed approach.
用实例感知的语义和几何信息来表示3D环境对于动态环境中的交互式机器人至关重要。然而,创建这种表示由于传感器噪音、实例分割和跟踪误差以及对象的动态运动等问题而带来了挑战。本文提出了一种新型的基于粒子的实例感知的语义占有图来解决这些挑战。使用具有增强实例状态的粒子来估计对象的概率假设密度(PHD),并隐含地建模环境。利用状态增强的随序蒙特卡洛 PHD(S$^2$MC-PHD)滤波器,这些粒子被更新以共同估计占有状态、语义和实例ID,减轻噪声。此外,还采用了一个记忆模块来增强地图对之前观察过的物体的响应。在Virtual KITTI 2数据集上的实验结果表明,与不同噪声条件下的最先进方法相比,所提出的方法在多个指标上超越了最先进方法。后续使用真实世界数据进行的测试进一步验证了所提出方法的有效性。
https://arxiv.org/abs/2409.11975
Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point's trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24$\%$ on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed high-resolution image-event synchronization device. Our source code will be released at this https URL.
基于图像帧跟踪任何一点是被帧率约束的,这会导致在高速场景下不稳定,以及在现实应用中的泛化能力受限。为了克服这些限制,我们提出了一个图像事件融合跟踪器FE-TAP,它结合了图像帧中的上下文信息与事件的超高速时分辨率,实现高帧率和稳健的点跟踪。具体来说,我们设计了一个进化融合模块(EvoFusion),用于建模由事件指导的图像生成过程。这个模块可以有效地整合来自不同频率操作的宝贵信息。为了实现平滑的点轨迹,我们采用了一种基于Transformer的优化策略,每次迭代更新点的轨迹和特征。大量实验证明,我们的方法超越了最先进的 approaches,尤其是在EDS数据集上,预计特征年龄提高了24%。最后,我们通过使用我们专门设计的高分辨率图像事件同步设备,对算法在真实驾驶场景下的鲁棒性进行了定性评估。我们的源代码将发布在這個https URL上。
https://arxiv.org/abs/2409.11953
The problem of safety for robotic systems has been extensively studied. However, little attention has been given to security issues for three-dimensional systems, such as quadrotors. Malicious adversaries can compromise robot sensors and communication networks, causing incidents, achieving illegal objectives, or even injuring people. This study first designs an intelligent control system for autonomous quadrotors. Then, it investigates the problems of optimal false data injection attack scheduling and countermeasure design for unmanned aerial vehicles. Using a state-of-the-art deep learning-based approach, an optimal false data injection attack scheme is proposed to deteriorate a quadrotor's tracking performance with limited attack energy. Subsequently, an optimal tracking control strategy is learned to mitigate attacks and recover the quadrotor's tracking performance. We base our work on Agilicious, a state-of-the-art quadrotor recently deployed for autonomous settings. This paper is the first in the United Kingdom to deploy this quadrotor and implement reinforcement learning on its platform. Therefore, to promote easy reproducibility with minimal engineering overhead, we further provide (1) a comprehensive breakdown of this quadrotor, including software stacks and hardware alternatives; (2) a detailed reinforcement-learning framework to train autonomous controllers on Agilicious agents; and (3) a new open-source environment that builds upon PyFlyt for future reinforcement learning research on Agilicious platforms. Both simulated and real-world experiments are conducted to show the effectiveness of the proposed frameworks in section 5.2.
机器机器人系统的安全性问题已经得到了广泛研究。然而,对于三维系统(如四旋翼)的安全性问题,关注较少。恶意攻击者可能攻击机器人传感器和通信网络,导致事故、实现非法目标或甚至伤害人员。本研究首先为自主四旋翼设计了智能控制系统。然后,研究了无人机上最优假数据注入攻击调度问题和反制设计问题。采用最先进的深度学习方法,提出了用有限攻击能量恶化四旋翼跟踪性能的最优假数据注入攻击方案。接着,学习最优跟踪控制策略以减轻攻击并恢复四旋翼跟踪性能。我们的工作基于最新部署的智能四旋翼Agilicious,这是英国首个在自主环境中部署的智能四旋翼,并在其平台上实现了强化学习。因此,为了通过最小化工程开发生度促进易于重复,我们进一步提供了(1)对Agilicious的全面拆分,包括软件堆栈和硬件替代方案;(2)用于在Agilicious代理上训练自主控制器的详细强化学习框架;(3)利用PyFlyt构建未来Agilicious平台上的强化学习研究的新开源环境。第5.2节中的模拟和现实世界实验都进行了研究,以证明所提出的框架的有效性。
https://arxiv.org/abs/2409.11897
Deep trackers have proven success in visual tracking. Typically, these trackers employ optimally pre-trained deep networks to represent all diverse objects with multi-channel features from some fixed layers. The deep networks employed are usually trained to extract rich knowledge from massive data used in object classification and so they are capable to represent generic objects very well. However, these networks are too complex to represent a specific moving object, leading to poor generalization as well as high computational and memory costs. This paper presents a novel and general framework termed channel distillation to facilitate deep trackers. To validate the effectiveness of channel distillation, we take discriminative correlation filter (DCF) and ECO for example. We demonstrate that an integrated formulation can turn feature compression, response map generation, and model update into a unified energy minimization problem to adaptively select informative feature channels that improve the efficacy of tracking moving objects on the fly. Channel distillation can accurately extract good channels, alleviating the influence of noisy channels and generally reducing the number of channels, as well as adaptively generalizing to different channels and networks. The resulting deep tracker is accurate, fast, and has low memory requirements. Extensive experimental evaluations on popular benchmarks clearly demonstrate the effectiveness and generalizability of our framework.
深度跟踪器已经在视觉跟踪方面取得了成功。通常,这些跟踪器使用预训练的深度网络来表示所有不同对象的多个通道特征,这些网络在某些固定层上进行优化。使用的深度网络通常是为了从大规模数据中提取丰富的知识,因此它们能够很好地表示通用对象。然而,这些网络过于复杂,无法表示特定的运动物体,导致泛化差劣以及高计算和内存成本。本文提出了一种名为通道剥离的新颖且通用的框架,以帮助深度跟踪器。为了验证通道剥离的有效性,我们以判别相关滤波器(DCF)和ECO为例。我们证明了整合公式可以将特征压缩、响应图生成和模型更新统一为一个能量最小化问题,以便在飞行中选择有用的特征通道,提高跟踪移动物体的有效性。通道剥离可以准确地提取好的通道,减轻噪音通道的影响,并通常减少通道数量,同时适应不同通道和网络。通过在流行基准上进行广泛的实验评估,我们充分证明了我们框架的有效性和通用性。
https://arxiv.org/abs/2409.11785
3D Multi-Object Tracking (MOT) obtains significant performance improvements with the rapid advancements in 3D object detection, particularly in cost-effective multi-camera setups. However, the prevalent end-to-end training approach for multi-camera trackers results in detector-specific models, limiting their versatility. Moreover, current generic trackers overlook the unique features of multi-camera detectors, i.e., the unreliability of motion observations and the feasibility of visual information. To address these challenges, we propose RockTrack, a 3D MOT method for multi-camera detectors. Following the Tracking-By-Detection framework, RockTrack is compatible with various off-the-shelf detectors. RockTrack incorporates a confidence-guided preprocessing module to extract reliable motion and image observations from distinct representation spaces from a single detector. These observations are then fused in an association module that leverages geometric and appearance cues to minimize mismatches. The resulting matches are propagated through a staged estimation process, forming the basis for heuristic noise modeling. Additionally, we introduce a novel appearance similarity metric for explicitly characterizing object affinities in multi-camera settings. RockTrack achieves state-of-the-art performance on the nuScenes vision-only tracking leaderboard with 59.1% AMOTA while demonstrating impressive computational efficiency.
3D Multi-Object Tracking(MOT)在3D物体检测的快速发展下取得了显著的性能提升,特别是在成本效益的多摄像头设置中。然而,普遍的端到端训练方法导致多摄像头跟踪器产生特定于检测器的模型,限制了它们的多样性。此外,当前的通用跟踪器忽略了多摄像头检测器的独特特点,即运动观察的不确定性和视觉信息的可行性。为了应对这些挑战,我们提出了RockTrack,一种多摄像头检测的3D MOT方法。遵循跟踪ById的方法,RockTrack与各种商用探测器兼容。RockTrack包含一个基于置信度的预处理模块,从单个检测器的不同表示空间中提取可靠的运动和图像观察。这些观察被融合到一个关联模块中,该模块利用几何和外观线索最小化匹配差异。由此产生的匹配通过级联估计过程传播,成为基于噪声建模的基。此外,我们引入了一个新颖的视觉相似度度量指标,明确表示多摄像头设置中物体的亲和性。RockTrack在仅使用视觉信息的nuScenes视觉跟踪领导者名单上实现了最先进的性能,同时具有出色的计算效率。
https://arxiv.org/abs/2409.11749
Autism Spectrum Disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ($81\%$ AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \url{this https URL}.
autism spectrum disorder (ASD) significantly affects the social and communication abilities of children, and eye-tracking is commonly used as a diagnostic tool by identifying associated atypical gaze patterns. Traditional methods demand manual identification of Areas of Interest in gaze patterns, lowering the performance of gaze behavior analysis in ASD subjects. To tackle this limitation, we propose a novel method to automatically analyze gaze behaviors in ASD children with superior accuracy. To be specific, we first apply and optimize seven clustering algorithms to automatically group gaze points to compare ASD subjects with typically developing peers. Subsequently, we extract 63 significant features to fully describe the patterns. These features can describe correlations between ASD diagnosis and gaze patterns. Lastly, using these features as prior knowledge, we train multiple predictive machine learning models to predict and diagnose ASD based on their gaze behaviors. To evaluate our method, we apply our method to three ASD datasets. The experimental and visualization results demonstrate the improvements of clustering algorithms in the analysis of unique gaze patterns in ASD children. Additionally, these predictive machine learning models achieved state-of-the-art prediction performance ($81\%$ AUC) in the field of automatically constructed gaze point features for ASD diagnosis. Our code is available at \url{this <https:// this URL}.
https://arxiv.org/abs/2409.11744
A major limitation of minimally invasive surgery is the difficulty in accurately locating the internal anatomical structures of the target organ due to the lack of tactile feedback and transparency. Augmented reality (AR) offers a promising solution to overcome this challenge. Numerous studies have shown that combining learning-based and geometric methods can achieve accurate preoperative and intraoperative data registration. This work proposes a real-time monocular 3D tracking algorithm for post-registration tasks. The ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The primitive 3D shape is used for fast initialization of the monocular SLAM. A pseudo-segmentation strategy is employed to separate the target organ from the background for tracking purposes, and the geometric prior of the 3D shape is incorporated as an additional constraint in the pose graph. Experiments from in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system provides robust 3D tracking and effectively handles typical challenges such as fast motion, out-of-field-of-view scenarios, partial visibility, and "organ-background" relative motion.
最小侵入性手术的一个主要局限性是缺乏触觉反馈和透明度,导致难以准确地定位目标器官的内部解剖结构。增强现实(AR)为克服这一挑战提供了一个有前景的解决方案。大量研究表明,将学习和几何方法相结合可以实现精确的术前和术中数据对齐。本文提出了一种实时单目3D跟踪算法,用于后配准任务。采用ORB-SLAM2框架并对其进行修改,以实现基于先验的3D跟踪。原始3D形状用于快速初始化单目SLAM。采用伪分割策略分离目标器官与背景,将3D形状的几何先验作为姿态图中的附加约束。体内和体外实验的结果表明,所提出的3D跟踪系统提供了稳健的3D跟踪,并有效处理了典型挑战,如快速运动、视场外场景、部分可见和“器官背景”相对运动。
https://arxiv.org/abs/2409.11688
Underwater object-level mapping requires incorporating visual foundation models to handle the uncommon and often previously unseen object classes encountered in marine scenarios. In this work, a metric of semantic uncertainty for open-set object detections produced by visual foundation models is calculated and then incorporated into an object-level uncertainty tracking framework. Object-level uncertainties and geometric relationships between objects are used to enable robust object-level loop closure detection for unknown object classes. The above loop closure detection problem is formulated as a graph-matching problem. While graph matching, in general, is NP-Complete, a solver for an equivalent formulation of the proposed graph matching problem as a graph editing problem is tested on multiple challenging underwater scenes. Results for this solver as well as three other solvers demonstrate that the proposed methods are feasible for real-time use in marine environments for the robust, open-set, multi-object, semantic-uncertainty-aware loop closure detection. Further experimental results on the KITTI dataset demonstrate that the method generalizes to large-scale terrestrial scenes.
水下对象级映射需要将视觉基础模型中处理常见且通常以前未见到的海洋场景中的对象类纳入其中。在这项工作中,计算了一个用于开放集对象检测的语义不确定性度量,并将其纳入一个对象级不确定性跟踪框架中。对象级不确定性和对象之间的几何关系被用来实现未知对象类的水准对象级循环检测。上述循环检测问题用图匹配问题来表述。虽然图匹配通常被认为是NP-Complete,但为等价于所提出的图匹配问题的一个图形编辑问题测试了多个具有挑战性的水下场景。这种算法的结果以及另外三种算法的结果表明,对于实时使用在海洋环境中的稳健、开放集、多对象、语义不确定性感知循环检测,所提出的方法是可行的。进一步的KITTI数据集的实验结果表明,该方法适用于大型陆地场景。
https://arxiv.org/abs/2409.11555
This paper aims to increase the safety and reliability of executing trajectories planned for robots with non-trivial dynamics given a light-weight, approximate dynamics model. Scenarios include mobile robots navigating through workspaces with imperfectly modeled surfaces and unknown friction. The proposed approach, Kinodynamic Replanning over Approximate Models with Feedback Tracking (KRAFT), integrates: (i) replanning via an asymptotically optimal sampling-based kinodynamic tree planner, with (ii) trajectory following via feedback control, and (iii) a safety mechanism to reduce collision due to second-order dynamics. The planning and control components use a rough dynamics model expressed analytically via differential equations, which is tuned via system identification (SysId) in a training environment but not the deployed one. This allows the process to be fast and achieve long-horizon reasoning during each replanning cycle. At the same time, the model still includes gaps with reality, even after SysID, in new environments. Experiments demonstrate the limitations of kinematic path planning and path tracking approaches, highlighting the importance of: (a) closing the feedback-loop also at the planning level; and (b) long-horizon reasoning, for safe and efficient trajectory execution given inaccurate models.
本文旨在通过轻量级、近似动力模型提高机器人执行计划轨迹的安全性和可靠性。场景包括在不完美建模表面的办公环境中移动的机器人以及未知摩擦的情况。所提出的方法,Kinodynamic Replanning over Approximate Models with Feedback Tracking (KRAFT),包括以下内容:(i)通过渐进最优的采样基于 kinodynamic 树规划器的复制规划,(ii)通过反馈控制跟踪轨迹,(iii)减少由于 second-order dynamics 导致的碰撞的安全机制。计划和控制组件使用通过微分方程表达的粗略动力模型,在训练环境中通过系统识别 (SysId) 进行微调,但在部署环境中不会进行微调。这使得过程具有快速性,并在每次规划周期内实现长距离推理。同时,在 SysID 后,模型仍然存在与现实之间的差距。实验证明了运动路径规划和跟踪方法的局限性,强调了:(a)在规划层面也要关闭反馈循环;(b)对于不准确模型的安全且有效的轨迹执行具有长距离推理的重要性。
https://arxiv.org/abs/2409.11522
An inherent fragility of quadrotor systems stems from model inaccuracies and external disturbances. These factors hinder performance and compromise the stability of the system, making precise control challenging. Existing model-based approaches either make deterministic assumptions, utilize Gaussian-based representations of uncertainty, or rely on nominal models, all of which often fall short in capturing the complex, multimodal nature of real-world dynamics. This work introduces DroneDiffusion, a novel framework that leverages conditional diffusion models to learn quadrotor dynamics, formulated as a sequence generation task. DroneDiffusion achieves superior generalization to unseen, complex scenarios by capturing the temporal nature of uncertainties and mitigating error propagation. We integrate the learned dynamics with an adaptive controller for trajectory tracking with stability guarantees. Extensive experiments in both simulation and real-world flights demonstrate the robustness of the framework across a range of scenarios, including unfamiliar flight paths and varying payloads, velocities, and wind disturbances.
quadrotor系统的固有脆弱性源于模型不准确性和外部干扰。这些因素阻碍了性能并破坏了系统的稳定性,使得精确控制变得具有挑战性。现有的基于模型的方法要么做出确定性的假设,要么利用基于高斯的不确定性表示,或者依赖于原型模型,而这些方法往往都难以捕捉到现实世界动态的复杂和多模态特性。本文引入了DroneDiffusion,一种利用条件扩散模型学习 quadrotor 动态的新框架,将其表示为序列生成任务。DroneDiffusion 通过捕捉不确定性的时间特性并减轻错误传播,实现了对未知场景的优越泛化。我们将学习到的动态与自适应控制器相结合,为轨迹跟踪提供稳定性保证。在模拟和现实世界的飞行中,我们对该框架进行了广泛的实验,包括不熟悉的飞行路径和不同的负载、速度和风干扰等场景。实验结果表明,该框架在各种场景中具有鲁棒性,包括不熟悉的飞行路径和不同的负载、速度和风干扰等场景。
https://arxiv.org/abs/2409.11292
Tracking controllers enable robotic systems to accurately follow planned reference trajectories. In particular, reinforcement learning (RL) has shown promise in the synthesis of controllers for systems with complex dynamics and modest online compute budgets. However, the poor sample efficiency of RL and the challenges of reward design make training slow and sometimes unstable, especially for high-dimensional systems. In this work, we leverage the inherent Lie group symmetries of robotic systems with a floating base to mitigate these challenges when learning tracking controllers. We model a general tracking problem as a Markov decision process (MDP) that captures the evolution of both the physical and reference states. Next, we prove that symmetry in the underlying dynamics and running costs leads to an MDP homomorphism, a mapping that allows a policy trained on a lower-dimensional "quotient" MDP to be lifted to an optimal tracking controller for the original system. We compare this symmetry-informed approach to an unstructured baseline, using Proximal Policy Optimization (PPO) to learn tracking controllers for three systems: the Particle (a forced point mass), the Astrobee (a fullyactuated space robot), and the Quadrotor (an underactuated system). Results show that a symmetry-aware approach both accelerates training and reduces tracking error after the same number of training steps.
跟踪控制器使机器人系统能够准确地遵循计划中的参考轨迹。特别是,强化学习(RL)已经在具有复杂动态和有限的在线计算预算的系统的控制器合成方面显示出前景。然而,RL的低样本效率和奖励设计的挑战使得训练变得缓慢甚至有时候不稳定,特别是对于高维系统。在这项工作中,我们利用具有浮点基的机器人系统的固有Lie群对称性来缓解这些挑战,从而在学习跟踪控制器时缓解这些挑战。我们将一般跟踪问题建模为马尔可夫决策过程(MDP),该过程捕捉了物理和参考状态的演化。接下来,我们证明对称在底层动态和运行成本上会导致MDP同构,一个映射,可以让在低维度"差分"MDP上训练的策略被提升到原始系统的最优跟踪控制器。我们比较了这种对称引导的方法与无结构基线,使用Proximal Policy Optimization(PPO)学习三个系统的跟踪控制器:Particle(一个强制点质量),Astrobee(一个完全激活的空间机器人)和Quadrotor(一个欠激活系统)。结果表明,对称引导的方法可以加速训练,并且在相同训练步骤后减少跟踪误差。
https://arxiv.org/abs/2409.11238
Open-vocabulary Multiple Object Tracking (MOT) aims to generalize trackers to novel categories not in the training set. Currently, the best-performing methods are mainly based on pure appearance matching. Due to the complexity of motion patterns in the large-vocabulary scenarios and unstable classification of the novel objects, the motion and semantics cues are either ignored or applied based on heuristics in the final matching steps by existing methods. In this paper, we present a unified framework SLAck that jointly considers semantics, location, and appearance priors in the early steps of association and learns how to integrate all valuable information through a lightweight spatial and temporal object graph. Our method eliminates complex post-processing heuristics for fusing different cues and boosts the association performance significantly for large-scale open-vocabulary tracking. Without bells and whistles, we outperform previous state-of-the-art methods for novel classes tracking on the open-vocabulary MOT and TAO TETA benchmarks. Our code is available at \href{this https URL}{this http URL}.
开放词汇多对象跟踪(MOT)旨在将跟踪器扩展到训练集中没有的新类别的目标。目前,最佳方法主要基于纯外观匹配。由于大规模词汇场景中运动模式的复杂性和新类别的不可预测分类,现有的方法在最后匹配阶段主要基于经验主义原则应用运动和语义线索。在本文中,我们提出了一个统一的框架SLAck,在联合考虑语义、位置和外观先验的情况下,通过轻量化的空间和时间对象图学习如何整合所有有价值的信息。我们的方法消除了对不同提示进行融合的复杂后处理技巧,显著提高了大规模开放词汇跟踪的联想性能。与浮夸的实现相比,我们的方法在开放词汇MOT和TAO TETA基准上实现了新颖类别跟踪的最佳性能。无需任何花哨的功能,我们在开放词汇MOT和TAO TETA基准上显著超过了最先进的水平。我们的代码可在此处访问:\href{this https URL}{this http URL}.
https://arxiv.org/abs/2409.11235
Multiple object tracking (MOT) in Unmanned Aerial Vehicle (UAV) videos is important for diverse applications in computer vision. Current MOT trackers rely on accurate object detection results and precise matching of target reidentification (ReID). These methods focus on optimizing target spatial attributes while overlooking temporal cues in modelling object relationships, especially for challenging tracking conditions such as object deformation and blurring, etc. To address the above-mentioned issues, we propose a novel Spatio-Temporal Cohesion Multiple Object Tracking framework (STCMOT), which utilizes historical embedding features to model the representation of ReID and detection features in a sequential order. Concretely, a temporal embedding boosting module is introduced to enhance the discriminability of individual embedding based on adjacent frame cooperation. While the trajectory embedding is then propagated by a temporal detection refinement module to mine salient target locations in the temporal field. Extensive experiments on the VisDrone2019 and UAVDT datasets demonstrate our STCMOT sets a new state-of-the-art performance in MOTA and IDF1 metrics. The source codes are released at this https URL.
无人机视频中的多个目标跟踪(MOT)对于计算机视觉的各种应用非常重要。当前的MOT跟踪器依赖于准确的物体检测结果和精确的目标识别(ReID)匹配。这些方法专注于优化目标的空间属性,而忽略了建模物体关系的时间线索,尤其是在具有挑战性的跟踪条件下,如物体变形和模糊等。为解决上述问题,我们提出了一个新颖的时空凝聚多目标跟踪框架(STCMOT),它利用历史嵌入特征来建模ReID和检测特征的序列顺序。具体来说,我们引入了一个时间嵌入增强模块,以增强基于相邻帧合作的个体嵌入的区分度。然后,通过一个时间检测平滑模块将轨迹嵌入传播,以挖掘时间域中的显著目标位置。在VisDrone2019和UAVDT数据集上进行的大量实验证明,我们的STCMOT在MOTA和IDF1指标上达到了最先进的水平。源代码已发布在https://这个链接上。
https://arxiv.org/abs/2409.11234
Multi-robot collaboration for target tracking presents significant challenges in hazardous environments, including addressing robot failures, dynamic priority changes, and other unpredictable factors. Moreover, these challenges are increased in adversarial settings if the environment is unknown. In this paper, we propose a resilient and adaptive framework for multi-robot, multi-target tracking in environments with unknown sensing and communication danger zones. The damages posed by these zones are temporary, allowing robots to track targets while accepting the risk of entering dangerous areas. We formulate the problem as an optimization with soft chance constraints, enabling real-time adjustments to robot behavior based on varying types of dangers and failures. An adaptive replanning strategy is introduced, featuring different triggers to improve group performance. This approach allows for dynamic prioritization of target tracking and risk aversion or resilience, depending on evolving resources and real-time conditions. To validate the effectiveness of the proposed method, we benchmark and evaluate it across multiple scenarios in simulation and conduct several real-world experiments.
多机器人协同目标跟踪在具有未知感测和通信危险区域的环境中具有显著的挑战,包括解决机器人故障、动态优先级变化和其他不可预测的因素。此外,如果在未知环境中,这些挑战将变得更加严重。在本文中,我们提出了一个适用于未知感测和通信危险区域的多机器人多目标跟踪的弹性自适应框架。这些区域所造成的伤害是暂时的,允许机器人在接受进入危险区域的风险的同时跟踪目标。我们将问题转化为带有软概率约束的优化问题,以便根据不同类型的危险和故障进行实时调整机器人的行为。我们引入了一种自适应规划策略,包括不同的触发器以提高团队表现。这种方法允许根据不断变化资源和实时条件动态优先化目标跟踪和风险回避或弹性。为了验证所提出方法的有效性,我们在仿真中进行了多次基准测试和现实世界实验。
https://arxiv.org/abs/2409.11230
3D Gaussian Splatting (3DGS) has gained significant attention for its application in dense Simultaneous Localization and Mapping (SLAM), enabling real-time rendering and high-fidelity mapping. However, existing 3DGS-based SLAM methods often suffer from accumulated tracking errors and map drift, particularly in large-scale environments. To address these issues, we introduce GLC-SLAM, a Gaussian Splatting SLAM system that integrates global optimization of camera poses and scene models. Our approach employs frame-to-model tracking and triggers hierarchical loop closure using a global-to-local strategy to minimize drift accumulation. By dividing the scene into 3D Gaussian submaps, we facilitate efficient map updates following loop corrections in large scenes. Additionally, our uncertainty-minimized keyframe selection strategy prioritizes keyframes observing more valuable 3D Gaussians to enhance submap optimization. Experimental results on various datasets demonstrate that GLC-SLAM achieves superior or competitive tracking and mapping performance compared to state-of-the-art dense RGB-D SLAM systems.
3D Gaussian Splatting (3DGS) 因其在密集同时定位与映射(SLAM)中的应用而备受关注,实现了实时渲染和高保真度映射。然而,基于 3DGS 的 SLAM 方法通常会受到累积跟踪误差和地图漂移的影响,特别是在大场景中。为了解决这些问题,我们引入了 GLC-SLAM,一种基于全局优化相机姿态和场景模型的 Gaussian Splatting SLAM 系统。我们的方法采用帧到模型的跟踪,并通过全局到局部策略触发循环闭合以最小化漂移累积。通过将场景划分为 3D Gaussian 子图,我们促使在大场景中进行有效的地图更新。此外,我们通过最小化不确定性来选择关键帧,以便增强子图优化。在各种数据集上的实验结果表明,GLC-SLAM 实现了与最先进的密集 RGB-D SLAM 系统相当或卓越的跟踪和映射性能。
https://arxiv.org/abs/2409.10982
We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities. Experiment videos can be found at \url{this https URL\_cod/}.
我们关注四足机器人在不连续地形(如楼梯和石阶)上的敏捷、连续和地形适应跳跃。与单步跳跃不同,连续跳跃需要准确地在较长的水平方向上执行高度动态的运动,这对于现有的方法来说具有挑战性。为了实现这一目标,我们设计了一个分层学习和控制框架,包括用于稳健地形感知的学习高度图预测器、基于强化学习的中心水平运动策略以及基于低级模型的腿部控制器。此外,我们通过准确建模硬件特征来最小化模拟与现实之间的差距。我们的框架使得单元树Go1机器人能够在人类大小的楼梯和稀疏的石阶上进行敏捷和连续跳跃,这是目前我们所知的最优结果。特别地,机器人可以在每次跳跃中跨越两个楼梯级,并在4.5秒内完成一个2.8米高、14级台阶。此外,相同的政策在各种公园冒险任务中优于基线,例如跳过单程水平或垂直不连续性。实验视频可以在 \url{this <https://cod.google.com/g/url_cod/}找到。
https://arxiv.org/abs/2409.10923
Frequency-modulated continuous-wave (FMCW) scanning radar has emerged as an alternative to spinning LiDAR for state estimation on mobile robots. Radar's longer wavelength is less affected by small particulates, providing operational advantages in challenging environments such as dust, smoke, and fog. This paper presents Radar Teach and Repeat (RT&R): a full-stack radar system for long-term off-road robot autonomy. RT&R can drive routes reliably in off-road cluttered areas without any GPS. We benchmark the radar system's closed-loop path-tracking performance and compare it to its 3D LiDAR counterpart. 11.8 km of autonomous driving was completed without interventions using only radar and gyro for navigation. RT&R was evaluated on different routes with progressively less structured scene geometry. RT&R achieved lateral path-tracking root mean squared errors (RMSE) of 5.6 cm, 7.5 cm, and 12.1 cm as the routes became more challenging. On the robot we used for testing, these RMSE values are less than half of the width of one tire (24 cm). These same routes have worst-case errors of 21.7 cm, 24.0 cm, and 43.8 cm. We conclude that radar is a viable alternative to LiDAR for long-term autonomy in challenging off-road scenarios. The implementation of RT&R is open-source and available at: this https URL.
频移连续波(FMCW)扫描雷达已成为在移动机器人上估计的替代旋转激光雷达的一种选择。雷达的较长波长对小颗粒的影响较小,因此在具有挑战性的环境中(如尘土、烟雾和雾),操作优势更加明显。本文介绍了名为Radar Teach and Repeat(RT&R)的完整雷达系统,用于实现长时间离地机器人自主。RT&R可以在没有GPS的情况下,可靠地驱动离地乱糟糟区域的路程。我们用雷达和陀螺仪进行路径跟踪的闭环路径性能进行了基准测试,并将其与3D LiDAR的相应版本进行了比较。在没有干预的情况下,使用雷达和陀螺仪完成了11.8公里的自动驾驶。我们对不同的道路进行了评估,道路变得越来越具有挑战性。在测试机器人上,这些最大距离误差(RMSE)值不到一个轮胎宽度的一半(24厘米)。同样的路线在最坏情况下出现了21.7厘米、24.0厘米和43.8厘米的误差。我们得出结论,雷达是对抗复杂离地环境的一种可行的LiDAR替代方案。RT&R的实现是开源的,可以从此链接获得:https://this URL。
https://arxiv.org/abs/2409.10491
In the rapidly evolving field of vision-language navigation (VLN), ensuring robust safety mechanisms remains an open challenge. Control barrier functions (CBFs) are efficient tools which guarantee safety by solving an optimal control problem. In this work, we consider the case of a teleoperated drone in a VLN setting, and add safety features by formulating a novel scene-aware CBF using ego-centric observations obtained through an RGB-D sensor. As a baseline, we implement a vision-language understanding module which uses the contrastive language image pretraining (CLIP) model to query about a user-specified (in natural language) landmark. Using the YOLO (You Only Look Once) object detector, the CLIP model is queried for verifying the cropped landmark, triggering downstream navigation. To improve navigation safety of the baseline, we propose ASMA -- an Adaptive Safety Margin Algorithm -- that crops the drone's depth map for tracking moving object(s) to perform scene-aware CBF evaluation on-the-fly. By identifying potential risky observations from the scene, ASMA enables real-time adaptation to unpredictable environmental conditions, ensuring optimal safety bounds on a VLN-powered drone actions. Using the robot operating system (ROS) middleware on a parrot bebop2 quadrotor in the gazebo environment, ASMA offers 59.4% - 61.8% increase in success rates with insignificant 5.4% - 8.2% increases in trajectory lengths compared to the baseline CBF-less VLN while recovering from unsafe situations.
在快速发展的视觉语言导航(VLN)领域,确保稳健的安全机制仍然是一个开放性的挑战。控制障碍功能(CBFs)是一种有效的工具,通过解决最优控制问题来保证安全。在这项工作中,我们考虑了一个遥控无人机在VLN环境中的情况,并通过通过RGB-D传感器获得的自适应场景观察结果,形式化了一种新颖的场景感知CBF。作为基线,我们实现了一个视觉语言理解模块,该模块使用预训练的对比语言图像(CLIP)模型来查询用户指定(自然语言)目标点。使用You Only Look Once(YOLO)物体检测器,CBF模型被查询以验证裁剪的目标点,从而触发下游导航。为了提高基线的导航安全性,我们提出了ASMA--自适应安全边距算法--,该算法对无人机的深度图进行裁剪,以在飞行中进行场景感知CBF评估。通过从场景中识别出潜在的风险观察,ASMA能够实现对不可预测的环境条件的实时适应,从而在VLN驱动的无人机操作中确保最优的安全边界。在gazebo环境中使用机器人操作系统(ROS)中间件,ASMA相对于基线CBF-less VLN,成功率提高了59.4% - 61.8%,而轨迹长度无意义的增加了5.4% - 8.2%。
https://arxiv.org/abs/2409.10283
Sprinting is a determinant ability, especially in team sports. The kinematics of the sprint have been studied in the past using different methods specially developed considering human biomechanics and, among those methods, markerless systems stand out as very cost-effective. On the other hand, we have now multiple general methods for pixel and body tracking based on recent machine learning breakthroughs with excellent performance in body tracking, but these excellent trackers do not generally consider realistic human biomechanics. This investigation first adapts two of these general trackers (MoveNet and CoTracker) for realistic biomechanical analysis and then evaluate them in comparison to manual tracking (with key points manually marked using the software Kinovea). Our best resulting markerless body tracker particularly adapted for sprint biomechanics is termed VideoRun2D. The experimental development and assessment of VideoRun2D is reported on forty sprints recorded with a video camera from 5 different subjects, focusing our analysis in 3 key angles in sprint biomechanics: inclination of the trunk, flex extension of the hip and the knee. The CoTracker method showed huge differences compared to the manual labeling approach. However, the angle curves were correctly estimated by the MoveNet method, finding errors between 3.2° and 5.5°. In conclusion, our proposed VideoRun2D based on MoveNet core seems to be a helpful tool for evaluating sprint kinematics in some scenarios. On the other hand, the observed precision of this first version of VideoRun2D as a markerless sprint analysis system may not be yet enough for highly demanding applications. Future research lines towards that purpose are also discussed at the end: better tracking post-processing and user- and time-dependent adaptation.
短跑是一种确定性能力,尤其是在团队运动项目中。过去,使用专门考虑人类生物力学和采用无标记系统的不同方法对短跑的姿势学进行研究。在这些方法中,无标记系统脱颖而出,因为它们具有很好的性价比。另一方面,我们现在有了多个基于最近机器学习突破的像素和人体跟踪的一般方法,这些方法在人体跟踪方面具有出色的表现,但通常不考虑人体的现实生物力学。这次调查首先为现实生物力学分析适应两个无标记的跟踪器(MoveNet和CoTracker),然后将它们与手动跟踪(通过使用软件Kinovea手动标记关键点)进行比较。我们最好的标记less短跑跟踪器特别适用于短跑生物力学,被称为VideoRun2D。关于VideoRun2D的实验开发和评估,我们在5个不同受试者的40个短跑中进行了记录,重点关注短跑生物力学中的三个关键角度:脊柱的倾斜角、臀关节的伸展角和膝盖的伸展角。CoTracker方法与手动标记方法相比显示出巨大的差异。然而,MoveNet方法正确估计了角度曲线,发现了3.2°至5.5°的误差。总之,基于MoveNet核心的VideoRun2D似乎是一个有用的工具,用于评估某些场景下的短跑生物力学。另一方面,第一个VideoRun2D作为标记less短跑分析系统所观察到的精度可能还不够高,无法满足高度要求的应用程序。在结论部分,我们也讨论了朝着这个目的进行进一步研究的一些方向:更好的跟踪后处理和用户和时间相关的自适应。
https://arxiv.org/abs/2409.10175