Navigating complex environments requires Unmanned Aerial Vehicles (UAVs) and autonomous systems to perform trajectory tracking and obstacle avoidance in real-time. While many control strategies have effectively utilized linear approximations, addressing the non-linear dynamics of UAV, especially in obstacle-dense environments, remains a key challenge that requires further research. This paper introduces a Non-linear Model Predictive Control (NMPC) framework for the DJI Matrice 100, addressing these challenges by using a dynamic model and B-spline interpolation for smooth reference trajectories, ensuring minimal deviation while respecting safety constraints. The framework supports various trajectory types and employs a penalty-based cost function for control accuracy in tight maneuvers. The framework utilizes CasADi for efficient real-time optimization, enabling the UAV to maintain robust operation even under tight computational constraints. Simulation and real-world indoor and outdoor experiments demonstrated the NMPC ability to adapt to disturbances, resulting in smooth, collision-free navigation.
导航复杂的环境需要无人机(UAVs)和自主系统在实时进行轨迹跟踪和避障。虽然许多控制策略有效地利用了线性近似,但处理UAV的非线性动力学,特别是在密集障碍物环境中,仍然是一个关键挑战,需要进一步研究。本文介绍了一种非线性模型预测控制(NMPC)框架,用于DJI Matrice 100,通过使用动态模型和B-spline插值来提供平滑的参考轨迹,确保在遵守安全约束的情况下最小偏差。该框架支持各种轨迹类型,并采用基于惩罚的成本函数来控制精确度在紧缩操纵中。该框架利用CasADi实现高效的实时优化,使无人机在计算约束紧张的情况下仍保持稳健操作。模拟和现实世界的室内和室外实验证明,NMPC能力能够适应干扰,从而实现平滑、无碰撞的导航。
https://arxiv.org/abs/2410.02732
Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.
准确的在线多摄像头车辆跟踪对于智能交通系统、自动驾驶和智能城市应用至关重要。与单摄像头多对象跟踪一样,通常用跟踪检测问题来表示它。在这个框架内,现有的在线方法通常包括两个步骤:首先进行时序聚类,然后进行空间聚类;或者反过来。这是计算密集型且容易累积错误的。我们引入了一个图表示,允许在单个、联合步骤中进行空间-时间聚类:新检测到的样本在空间和时间上与现有的聚类相互连接。通过保留所有检测到的样本的稀疏表示和位置线索,我们的方法可以基于最强的可用证据比较聚类。通过简单的多路复用分配方案,我们可以在在线过程中获得最终轨迹。我们的方法不需要在目标场景上进行训练,也不需要预先提取单摄像头的轨迹或附加注释。值得注意的是,我们在CityFlow数据集上比在线最先进的方法提高了约14%,而在Synthehicle数据集上提高了约25%。代码是公开可用的。
https://arxiv.org/abs/2410.02638
Visual language tracking (VLT) has emerged as a cutting-edge research area, harnessing linguistic data to enhance algorithms with multi-modal inputs and broadening the scope of traditional single object tracking (SOT) to encompass video understanding applications. Despite this, most VLT benchmarks still depend on succinct, human-annotated text descriptions for each video. These descriptions often fall short in capturing the nuances of video content dynamics and lack stylistic variety in language, constrained by their uniform level of detail and a fixed annotation frequency. As a result, algorithms tend to default to a "memorize the answer" strategy, diverging from the core objective of achieving a deeper understanding of video content. Fortunately, the emergence of large language models (LLMs) has enabled the generation of diverse text. This work utilizes LLMs to generate varied semantic annotations (in terms of text lengths and granularities) for representative SOT benchmarks, thereby establishing a novel multi-modal benchmark. Specifically, we (1) propose a new visual language tracking benchmark with diverse texts, named DTVLT, based on five prominent VLT and SOT benchmarks, including three sub-tasks: short-term tracking, long-term tracking, and global instance tracking. (2) We offer four granularity texts in our benchmark, considering the extent and density of semantic information. We expect this multi-granular generation strategy to foster a favorable environment for VLT and video understanding research. (3) We conduct comprehensive experimental analyses on DTVLT, evaluating the impact of diverse text on tracking performance and hope the identified performance bottlenecks of existing algorithms can support further research in VLT and video understanding. The proposed benchmark, experimental results and toolkit will be released gradually on this http URL.
视觉语言跟踪(VLT)已成为一个尖端的研究领域,利用语言数据来增强具有多模态输入的算法的性能,并将传统单对象跟踪(SOT)的范围扩展到涵盖视频理解应用。然而,大多数VLT基准仍然依赖于对每个视频的简洁、人类编写的文本描述。这些描述往往捕捉不到视频内容动态的细微之处,缺乏语言的风格多样性,受到其详细程度和固定注释周期的限制。因此,算法倾向于默认采用“记住答案”策略,从实现对视频内容更深刻理解的核心目标上偏离。 幸运的是,大型语言模型(LLMs)的出现已经使得生成多样文本成为可能。这项工作利用LLMs生成具有不同文本长度和粒度的多样语义注释(在语义层次上),从而建立了一个新颖的多模态基准。具体来说,我们(1)提出了一个名为DTVLT的新视觉语言跟踪基准,基于五个突出的VLT和SOT基准,包括三个子任务:短期跟踪、长期跟踪和全局实例跟踪。(2)我们在基准中提供了四种粒度文本,考虑了语义信息的范围和密度。我们期望这种多粒度生成策略将为VLT和视频理解研究创造一个有利的环境。(3)我们对DTVLT进行了全面实验分析,评估了多样性文本对跟踪性能的影响,并希望识别出现有算法的性能瓶颈,以便进一步研究VLT和视频理解。提出的基准、实验结果和工具包将逐步发布在上述网址。
https://arxiv.org/abs/2410.02492
Integrating artificial intelligence into modern society is profoundly transformative, significantly enhancing productivity by streamlining various daily tasks. AI-driven recognition systems provide notable advantages in the food sector, including improved nutrient tracking, tackling food waste, and boosting food production and consumption efficiency. Accurate food classification is a crucial initial step in utilizing advanced AI models, as the effectiveness of this process directly influences the success of subsequent operations; therefore, achieving high accuracy at a reasonable speed is essential. Despite existing research efforts, a gap persists in improving performance while ensuring rapid processing times, prompting researchers to pursue cost-effective and precise models. This study addresses this gap by employing the state-of-the-art EfficientNetB7 architecture, enhanced through transfer learning, data augmentation, and the CBAM attention module. This methodology results in a robust model that surpasses previous studies in accuracy while maintaining rapid processing suitable for real-world applications. The Food11 dataset from Kaggle was utilized, comprising 16643 imbalanced images across 11 diverse classes with significant intra-category diversities and inter-category similarities. Furthermore, the proposed methodology, bolstered by various deep learning techniques, consistently achieves an impressive average accuracy of 96.40%. Notably, it can classify over 60 images within one second during inference on unseen data, demonstrating its ability to deliver high accuracy promptly. This underscores its potential for practical applications in accurate food classification and enhancing efficiency in subsequent processes.
将人工智能融入现代社会是彻底颠覆性的,通过简化各种日常任务显著提高生产力。 AI 驱动的识别系统在食品领域具有显著优势,包括改善营养追踪、解决食品浪费和提高食品生产和消费效率。准确的食品分类是利用高级 AI 模型的关键初始步骤,因为这一过程的有效性直接影响后续操作的成功;因此,在合理的时间内实现高准确度是至关重要的。尽管现有研究已经取得了很大进展,但在保证快速处理时间的同时提高性能方面仍然存在差距,促使研究人员追求成本效益和精确的模型。本研究通过采用最先进的 EfficientNetB7 架构、通过迁移学习、数据增强和 CBAM 注意模块进行优化,来解决这一差距。这一方法产生了一个稳健的模型,在保持对真实应用场景的高准确度的同时,实现了惊人的平均准确度为 96.40%。值得注意的是,在推理时它可以将超过 60 张图像分类,证明其迅速提供高准确度的能力。这表明其在准确食品分类和提高后续过程效率的实用潜力。
https://arxiv.org/abs/2410.02304
Event-based cameras are attracting significant interest as they provide rich edge information, high dynamic range, and high temporal resolution. Many state-of-the-art event-based algorithms rely on splitting the events into fixed groups, resulting in the omission of crucial temporal information, particularly when dealing with diverse motion scenarios (e.g., high/low speed). In this work, we propose SpikeSlicer, a novel-designed plug-and-play event processing method capable of splitting events stream adaptively. SpikeSlicer utilizes a lightweight (0.41M) and low-energy spiking neural network (SNN) to trigger event slicing. To guide the SNN to fire spikes at optimal time steps, we propose the Spiking Position-aware Loss (SPA-Loss) to modulate the neuron's state. Additionally, we develop a Feedback-Update training strategy that refines the slicing decisions using feedback from the downstream artificial neural network (ANN). Extensive experiments demonstrate that our method yields significant performance improvements in event-based object tracking and recognition. Notably, SpikeSlicer provides a brand-new SNN-ANN cooperation paradigm, where the SNN acts as an efficient, low-energy data processor to assist the ANN in improving downstream performance, injecting new perspectives and potential avenues of exploration.
基于事件的相机因提供丰富的边缘信息、高动态范围和高时间分辨率而受到广泛关注。许多最先进的基于事件的算法依赖于将事件划分为固定组,导致关键时间信息的遗漏,尤其是在处理多样运动场景(如高速/低速)时。在这项工作中,我们提出了SpikeSlicer,一种新型的插件和可运行的事件处理方法,具有自适应分割事件流的功能。SpikeSlicer利用轻量级(0.41M)且低能量的尖峰神经网络(SNN)触发事件切片。为了指导SNN在最佳时间步骤触发尖峰,我们提出了尖峰位置感知损失(SPA-Loss)来调节神经元状态。此外,我们还开发了反馈更新训练策略,通过下游人工神经网络(ANN)的反馈来优化切片决策。大量实验证明,我们的方法在基于事件的对象跟踪和识别方面产生了显著的性能提升。值得注意的是,SpikeSlicer提供了一种新的SNN-ANN合作范例,其中SNN充当高效、低能量的数据处理器,协助ANN提高下游性能,注入新的视点和探索途径。
https://arxiv.org/abs/2410.02249
Accurate real-time tracking of dexterous hand movements and interactions has numerous applications in human-computer interaction, metaverse, robotics, and tele-health. Capturing realistic hand movements is challenging because of the large number of articulations and degrees of freedom. Here, we report accurate and dynamic tracking of articulated hand and finger movements using stretchable, washable smart gloves with embedded helical sensor yarns and inertial measurement units. The sensor yarns have a high dynamic range, responding to low 0.005 % to high 155 % strains, and show stability during extensive use and washing cycles. We use multi-stage machine learning to report average joint angle estimation root mean square errors of 1.21 and 1.45 degrees for intra- and inter-subjects cross-validation, respectively, matching accuracy of costly motion capture cameras without occlusion or field of view limitations. We report a data augmentation technique that enhances robustness to noise and variations of sensors. We demonstrate accurate tracking of dexterous hand movements during object interactions, opening new avenues of applications including accurate typing on a mock paper keyboard, recognition of complex dynamic and static gestures adapted from American Sign Language and object identification.
准确实时追踪灵活的手部运动和交互在人类-计算机交互、元宇宙、机器人和远程医疗等领域具有许多应用价值。由于手部动作的数量和自由度较大,捕捉真实的握持动作具有挑战性。在这里,我们报道了一种使用具有可伸缩、可清洗的智能手套以及嵌入的螺旋传感器纤维的准确而动态追踪具有关节和手指的运动。传感器纤维具有高动态范围,响应于低0.005%至高155%的应变,并在大范围使用和清洗周期中表现出稳定性。我们使用多级机器学习来报告跨个体交叉验证的平均关节角度估计根方差分别为1.21度和1.45度,分别与没有遮挡或视野限制的昂贵运动捕捉摄像机相匹配,具有与昂贵相机相媲美的准确性。我们还报道了一种增强传感器噪声和灵敏度变化的数据增强技术。我们展示了在物体交互过程中准确追踪灵活的手部动作,开辟了包括在模拟纸键盘上准确打字、从美国手语识别复杂动态和静态手势以及物体识别等新应用领域的道路。
https://arxiv.org/abs/2410.02221
Objects we encounter often change appearance as we interact with them. Changes in illumination (shadows), object pose, or movement of nonrigid objects can drastically alter available image features. How do biological visual systems track objects as they change? It may involve specific attentional mechanisms for reasoning about the locations of objects independently of their appearances -- a capability that prominent neuroscientific theories have associated with computing through neural synchrony. We computationally test the hypothesis that the implementation of visual attention through neural synchrony underlies the ability of biological visual systems to track objects that change in appearance over time. We first introduce a novel deep learning circuit that can learn to precisely control attention to features separately from their location in the world through neural synchrony: the complex-valued recurrent neural network (CV-RNN). Next, we compare object tracking in humans, the CV-RNN, and other deep neural networks (DNNs), using FeatureTracker: a large-scale challenge that asks observers to track objects as their locations and appearances change in precisely controlled ways. While humans effortlessly solved FeatureTracker, state-of-the-art DNNs did not. In contrast, our CV-RNN behaved similarly to humans on the challenge, providing a computational proof-of-concept for the role of phase synchronization as a neural substrate for tracking appearance-morphing objects as they move about.
我们经常遇到的对象在互动过程中会改变外观。光照变化、物体姿态或运动非刚性对象的改变会导致可用图像特征发生极大改变。生物视觉系统如何跟踪随其变化的对象呢?这可能涉及特定注意力机制来独立于物体外观计算物体位置的推理能力——这一能力与通过神经同步计算的神经科学理论密切相关。我们通过计算视觉注意力通过神经同步实现来测试假设,即视觉注意力通过神经同步实现了生物视觉系统在时间上跟踪随其外观变化的对象的能力。 首先,我们介绍了一个新型的深度学习电路,可以通过神经同步准确地控制对特征的关注度,而无需考虑它们在空间中的位置:复杂值循环神经网络(CV-RNN)。接下来,我们使用FeatureTracker这个大型的挑战来比较人类、CV-RNN和其他深度神经网络(DNNs)的物体跟踪能力。尽管人类轻松地解决了FeatureTracker,但最先进的DNNs没有做到。相反,我们的CV-RNN在挑战中表现出了与人类相似的行为,提供了计算同步作为神经基因为追踪随其运动变化的外貌变形的物体的证明。
https://arxiv.org/abs/2410.02094
We reframe scene flow as the problem of estimating a continuous space and time PDE that describes motion for an entire observation sequence, represented with a neural prior. Our resulting unsupervised method, EulerFlow, produces high quality scene flow on real-world data across multiple domains, including large-scale autonomous driving scenes and dynamic tabletop settings. Notably, EulerFlow produces high quality flow on small, fast moving objects like birds and tennis balls, and exhibits emergent 3D point tracking behavior by solving its estimated PDE over long time horizons. On the Argoverse 2 2024 Scene Flow Challenge, EulerFlow outperforms all prior art, beating the next best unsupervised method by over 2.5x and the next best supervised method by over 10%.
我们将场景流重新建模为估计一个连续的空间和时间PDE,该PDE描述了整个观测序列的运动,用神经先验表示。我们得到的无监督方法EulerFlow在多个领域产生了高质量的场景流,包括大规模自动驾驶场景和动态桌面设置。值得注意的是,EulerFlow在小型、快速移动的对象(如鸟和网球)上产生了高质量的流,通过在长时间尺度上求解其估计的PDE而表现出自适应的3D点跟踪行为。在2024年ArgoVerse场景流挑战中,EulerFlow超越了所有先驱技术,超过了下一届最好的无监督方法的2.5倍,超过了下一届最好的监督方法的10%。
https://arxiv.org/abs/2410.02031
Multiple object tracking in complex scenarios - such as coordinated dance performances, team sports, or dynamic animal groups - presents unique challenges. In these settings, objects frequently move in coordinated patterns, occlude each other, and exhibit long-term dependencies in their trajectories. However, it remains a key open research question on how to model long-range dependencies within tracklets, interdependencies among tracklets, and the associated temporal occlusions. To this end, we introduce Samba, a novel linear-time set-of-sequences model designed to jointly process multiple tracklets by synchronizing the multiple selective state-spaces used to model each tracklet. Samba autoregressively predicts the future track query for each sequence while maintaining synchronized long-term memory representations across tracklets. By integrating Samba into a tracking-by-propagation framework, we propose SambaMOTR, the first tracker effectively addressing the aforementioned issues, including long-range dependencies, tracklet interdependencies, and temporal occlusions. Additionally, we introduce an effective technique for dealing with uncertain observations (MaskObs) and an efficient training recipe to scale SambaMOTR to longer sequences. By modeling long-range dependencies and interactions among tracked objects, SambaMOTR implicitly learns to track objects accurately through occlusions without any hand-crafted heuristics. Our approach significantly surpasses prior state-of-the-art on the DanceTrack, BFT, and SportsMOT datasets.
在复杂场景中,例如舞蹈表演、团队运动或动态动物群,多个物体追踪带来了独特的挑战。在这些场景中,物体经常以协调的方式移动,遮挡彼此,并表现出它们轨迹中的长期依赖关系。然而,如何建模在跟踪器中长距离依赖关系、轨迹间相互依赖以及相关的时间遮挡仍然是一个关键的研究问题。为此,我们引入了Samba,一种新颖的线性时间序列模型,旨在通过同步处理多个跟踪器的多个选择状态来共同处理多个跟踪器。Samba自回归地预测每个序列的未来轨迹,同时保持同步的长远记忆表示。通过将Samba集成到跟踪通过传播的框架中,我们提出了SambaMOTR,这是第一个有效解决上述问题的跟踪器,包括长距离依赖关系、轨迹间相互依赖以及时间遮挡。此外,我们还引入了处理不确定观测(MaskObs)的有效技术以及将SambaMOTR扩展到较长序列的高效训练方法。通过建模长距离依赖关系和跟踪对象之间的相互作用,SambaMOTR通过遮挡而无需任何自定义的启发式方法学会了准确跟踪物体。与DanceTrack、BFT和SportsMOT数据集上的 prior 状态最先进相比,我们的方法在很大程度上超越了这些数据集。
https://arxiv.org/abs/2410.01806
C-ADMM is a well-known distributed optimization framework due to its guaranteed convergence in convex optimization problems. Recently, C-ADMM has been studied in robotics applications such as multi-vehicle target tracking and collaborative manipulation tasks. However, few works have investigated the performance of C-ADMM applied to non-convex problems in robotics applications due to a lack of theoretical guarantees. For this project, we aim to quantitatively explore and examine the convergence behavior of non-convex C-ADMM through the scope of distributed multi-robot trajectory planning. We propose a convex trajectory planning problem by leveraging C-ADMM and Buffered Voronoi Cells (BVCs) to get around the non-convex collision avoidance constraint and compare this convex C-ADMM algorithm to a non-convex C-ADMM baseline with non-convex collision avoidance constraints. We show that the convex C-ADMM algorithm requires 1000 fewer iterations to achieve convergence in a multi-robot waypoint navigation scenario. We also confirm that the non-convex C-ADMM baseline leads to sub-optimal solutions and violation of safety constraints in trajectory generation.
C-ADMM是一个因在凸优化问题中保证收敛而广受欢迎的分布式优化框架。最近,C-ADMM在机器人应用领域(如多机器人目标跟踪和合作操作任务)中进行了研究。然而,由于缺乏理论保证,很少有研究探讨了机器人应用中C-ADMM应用于非凸优化问题的性能。本项目旨在通过分布式多机器人轨迹规划的范畴,定量探讨和评估非凸C-ADMM的收敛行为。我们通过利用C-ADMM和缓冲 Voronoi Cells(BVCs)来提出一个凸轨迹规划问题,以克服非凸避障约束,并将其与具有非凸避障约束的非凸C-ADMM基线进行比较。我们证实,凸C-ADMM算法在多机器人路径导航场景中需要1000 fewer iterations才能达到收敛。我们还证实,非凸C-ADMM基线导致轨迹生成过程中的 sub-optimal solutions 和违反安全约束。
https://arxiv.org/abs/2410.01728
3D multi-object tracking plays a critical role in autonomous driving by enabling the real-time monitoring and prediction of multiple objects' movements. Traditional 3D tracking systems are typically constrained by predefined object categories, limiting their adaptability to novel, unseen objects in dynamic environments. To address this limitation, we introduce open-vocabulary 3D tracking, which extends the scope of 3D tracking to include objects beyond predefined categories. We formulate the problem of open-vocabulary 3D tracking and introduce dataset splits designed to represent various open-vocabulary scenarios. We propose a novel approach that integrates open-vocabulary capabilities into a 3D tracking framework, allowing for generalization to unseen object classes. Our method effectively reduces the performance gap between tracking known and novel objects through strategic adaptation. Experimental results demonstrate the robustness and adaptability of our method in diverse outdoor driving scenarios. To the best of our knowledge, this work is the first to address open-vocabulary 3D tracking, presenting a significant advancement for autonomous systems in real-world settings. Code, trained models, and dataset splits are available publicly.
3D多对象跟踪在自动驾驶中扮演着关键角色,通过实现对多个物体运动情况的实时监测和预测,提高了自动驾驶系统的实时性能。传统3D跟踪系统通常受到预定义的物体类别的限制,导致其对动态环境中新颖、未见物体的适应性受限。为了应对这一局限,我们引入了开放词汇3D跟踪,将3D跟踪的范围扩展到包括超出预定义类别的物体。我们形式化开放词汇3D跟踪的问题,并引入了旨在表示各种开放词汇场景的数据集划分。我们提出了一个新方法,将开放词汇功能整合到3D跟踪框架中,允许对未见物体类进行泛化。我们的方法通过策略性调整有效减少了跟踪已知和未见物体之间的性能差距。实验结果表明,我们的方法在各种户外驾驶场景中具有稳健性和适应性。据我们所知,这是第一个针对开放词汇3D跟踪的论文,为现实环境中的自动驾驶系统带来了显著的进展。代码、训练的模型和数据集都可以公开获取。
https://arxiv.org/abs/2410.01678
Advancements in Natural Language Processing (NLP), have led to the emergence of Large Language Models (LLMs) such as GPT, Llama, Claude, and Gemini, which excel across a range of tasks but require extensive fine-tuning to align their outputs with human expectations. A widely used method for achieving this alignment is Reinforcement Learning from Human Feedback (RLHF), which, despite its success, faces challenges in accurately modelling human preferences. In this paper, we introduce GazeReward, a novel framework that integrates implicit feedback -- and specifically eye-tracking (ET) data -- into the Reward Model (RM). In addition, we explore how ET-based features can provide insights into user preferences. Through ablation studies we test our framework with different integration methods, LLMs, and ET generator models, demonstrating that our approach significantly improves the accuracy of the RM on established human preference datasets. This work advances the ongoing discussion on optimizing AI alignment with human values, exploring the potential of cognitive data for shaping future NLP research.
自然语言处理(NLP)的进步导致了大型语言模型(LLMs)的出现,如GPT、Llama、Claude和Gemini,它们在各种任务上表现优异,但需要进行广泛的微调才能使它们的输出与人类期望相符。实现这种与人类期望相一致的方法是来自人类反馈的强化学习(RLHF),尽管它取得了成功,但在准确建模人类偏好方面仍然面临着挑战。在本文中,我们引入了GazeReward,一种将隐性反馈(以及 specifically眼动(ET)数据)集成到奖励模型(RM)中的新框架。此外,我们探讨了ET基特征如何提供有关用户偏好的洞察。通过消融实验,我们用不同的集成方法、LLM和ET生成模型测试我们的框架,证明了我们的方法显著提高了已知人类偏好数据上RM的准确性。这项工作推动了优化人工智能与人类价值观的讨论,探索了认知数据对未来NLP研究的塑造潜力。
https://arxiv.org/abs/2410.01532
Diffusion models have recently shown the ability to generate high-quality images. However, controlling its generation process still poses challenges. The image style transfer task is one of those challenges that transfers the visual attributes of a style image to another content image. Typical obstacle of this task is the requirement of additional training of a pre-trained model. We propose a training-free style transfer algorithm, Style Tracking Reverse Diffusion Process (STRDP) for a pretrained Latent Diffusion Model (LDM). Our algorithm employs Adaptive Instance Normalization (AdaIN) function in a distinct manner during the reverse diffusion process of an LDM while tracking the encoding history of the style image. This algorithm enables style transfer in the latent space of LDM for reduced computational cost, and provides compatibility for various LDM models. Through a series of experiments and a user study, we show that our method can quickly transfer the style of an image without additional training. The speed, compatibility, and training-free aspect of our algorithm facilitates agile experiments with combinations of styles and LDMs for extensive application.
扩散模型最近展示了生成高质量图像的能力。然而,控制其生成过程仍然具有挑战性。图像风格迁移任务是那种将样式图像的视觉属性转移到另一个内容图像的挑战之一。这个任务的典型障碍是需要对预训练模型进行额外的训练。我们提出了一个无需训练的样式迁移算法:Style Tracking Reverse Diffusion Process (STRDP) for a Pre-trained Latent Diffusion Model (LDM)。 在我们的算法中,在LDM的逆扩散过程中采用了一种与通常的样式迁移方法不同的自适应实例归一化(AdaIN)函数来跟踪样式图像的编码历史。这个算法可以在降低计算成本的情况下将LDM中的样式进行迁移,并为各种LDM模型提供兼容性。通过一系列实验和用户研究,我们证明了我们的方法可以快速将图像的风格转移到不需要额外训练。这种算法的速度、兼容性和无需训练的特点使得它可以与各种风格和LDM模型进行灵活的实验,为广泛的应用提供了支持。
https://arxiv.org/abs/2410.01366
Surgery monitoring in Mixed Reality (MR) environments has recently received substantial focus due to its importance in image-based decisions, skill assessment, and robot-assisted surgery. Tracking hands and articulated surgical instruments is crucial for the success of these applications. Due to the lack of annotated datasets and the complexity of the task, only a few works have addressed this problem. In this work, we present SurgeoNet, a real-time neural network pipeline to accurately detect and track surgical instruments from a stereo VR view. Our multi-stage approach is inspired by state-of-the-art neural-network architectural design, like YOLO and Transformers. We demonstrate the generalization capabilities of SurgeoNet in challenging real-world scenarios, achieved solely through training on synthetic data. The approach can be easily extended to any new set of articulated surgical instruments. SurgeoNet's code and data are publicly available.
近年来,由于在图像决策、技能评估和机器人辅助手术中具有重要性,手术在混合现实(MR)环境中的监测引起了广泛关注。跟踪双手和操作外科器械对于这些应用的成功至关重要。由于缺乏注释的數據和任务的复杂性,只有几篇论文解决了这个问题。在这篇工作中,我们提出了SurgeoNet,用于准确检测和跟踪手术器械的立体 VR 视角。我们多阶段的方法受到最先进的神经网络架构设计(如YOLO 和Transformer)的启发。我们证明了SurgeoNet在具有挑战性的真实世界场景中的泛化能力,仅通过在假数据上训练实现。这种方法可以很容易地扩展到任何新的操作外科器械的组合。SurgeoNet的代码和数据公开可用。
https://arxiv.org/abs/2410.01293
In percutaneous pelvic trauma surgery, accurate placement of Kirschner wires (K-wires) is crucial to ensure effective fracture fixation and avoid complications due to breaching the cortical bone along an unsuitable trajectory. Surgical navigation via mixed reality (MR) can help achieve precise wire placement in a low-profile form factor. Current approaches in this domain are as yet unsuitable for real-world deployment because they fall short of guaranteeing accurate visual feedback due to uncontrolled bending of the wire. To ensure accurate feedback, we introduce StraightTrack, an MR navigation system designed for percutaneous wire placement in complex anatomy. StraightTrack features a marker body equipped with a rigid access cannula that mitigates wire bending due to interactions with soft tissue and a covered bony surface. Integrated with an Optical See-Through Head-Mounted Display (OST HMD) capable of tracking the cannula body, StraightTrack offers real-time 3D visualization and guidance without external trackers, which are prone to losing line-of-sight. In phantom experiments with two experienced orthopedic surgeons, StraightTrack improves wire placement accuracy, achieving the ideal trajectory within $5.26 \pm 2.29$ mm and $2.88 \pm 1.49$ degree, compared to over 12.08 mm and 4.07 degree for comparable methods. As MR navigation systems continue to mature, StraightTrack realizes their potential for internal fracture fixation and other percutaneous orthopedic procedures.
在经皮会阴创伤手术中,准确放置Kirschner电极(K-电极)至关重要,以确保有效的骨折固定并避免因沿着不合适的轨迹撕裂皮质骨而导致并发症。通过混合现实(MR)进行手术导航可以帮助实现精确的电极放置,特别是在低轮廓形式中。目前,该领域的方法还不足以保证准确的视觉反馈,因为电极在弯曲过程中受到未受控的扭曲。为了确保准确的反馈,我们引入了StraightTrack,这是一种专为复杂解剖学经皮电极放置而设计的MR导航系统。StraightTrack配备了一个带有刚性访问管的标记身体,可以减轻与软组织相互作用的电极弯曲,并覆盖一个骨性表面。与具有跟踪电极身体的光学可透过头戴显示器(OST HMD)集成,StraightTrack实现了没有外部跟踪器的情况下实时3D视图和指导,这些外部跟踪器容易丢失视野。在由两位经验丰富的骨科医生进行的大规模幻灯片实验中,StraightTrack提高了电极放置精度,实现了理想轨迹(相对距离为5.26±2.29毫米,角度为2.88±1.49度),而类似方法得到的相对距离为12.08±4.07毫米,角度为4.07±2.88度。随着MR导航系统继续成熟,StraightTrack意识到其在大规模内部骨折固定和其他经皮骨科手术中的潜力。
https://arxiv.org/abs/2410.01143
Golog is an expressive high-level agent language that includes nondeterministic operators which allow to leave some of the decisions to be made only at execution time. This so-called program realization is typically implemented by means of search, or in an incremental online fashion. In this paper, we consider the more realistic case where parts of the non-determinism are under the control of the environment. Program realization then becomes a synthesis problem, where a successful realization executes the program and satisfies the temporal goal for all possible environment actions. We consider Golog programs in combination with an expressive class of first-order action theories that allow for an unbounded number of objects and non-local effects, together with a temporal goal specified in a first-order extension of LTLf. We solve the synthesis problem by constructing a game arena that captures all possible executions of the program while tracking the satisfaction of the temporal goal and then solving the resulting two-player game. We evaluate the approach in two domains, showing the general feasibility of the approach.
Golog是一种富有表现力的高级代理语言,包括非确定性运算符,允许在执行时间之前 leave some of the decisions to be made。这种所谓的程序实现通常通过搜索实现,或者以增量的在线方式实现。在本文中,我们考虑了非确定性部分受环境控制的情况。在这种情况下,程序实现变成了合成问题,其中成功的实现会执行程序并满足所有可能的环境行为的所有时间目标。我们将Golog程序与具有无限制数量对象和非局部效应的富有表现力的类第一顺序动作理论相结合,其中第一扩展中的LTLf指定了一个时间目标。我们通过构建一个游戏竞技场来捕获程序的所有可能执行,同时跟踪时间目标的满足,然后解决由此产生的两个玩家游戏。我们在两个领域评估了这种方法,证明了该方法具有广泛的应用前景。
https://arxiv.org/abs/2410.00726
Cross-camera data association is one of the cornerstones of the multi-camera computer vision field. Although often integrated into detection and tracking tasks through architecture design and loss definition, it is also recognized as an independent challenge. The ultimate goal is to connect appearances of one item from all cameras, wherever it is visible. Therefore, one possible perspective on this task involves supervised clustering of the affinity graph, where nodes are instances captured by all cameras. They are represented by appropriate visual features and positional attributes. We leverage the advantages of GNN (Graph Neural Network) architecture to examine nodes' relations and generate representative edge embeddings. These embeddings are then classified to determine the existence or non-existence of connections in node pairs. Therefore, the core of this approach is graph connectivity prediction. Experimental validation was conducted on multicamera pedestrian datasets across diverse environments such as the laboratory, basketball court, and terrace. Our proposed method, named SGC-CCA, outperformed the state-of-the-art method named GNN-CCA across all clustering metrics, offering an end-to-end clustering solution without the need for graph post-processing. The code is available at this https URL.
跨摄像头数据关联是多摄像头计算机视觉领域的一个基石。虽然通常通过架构设计和损失定义将它们集成到检测和跟踪任务中,但它也被认为是一个独立的挑战。最终目标是将所有摄像机中一个物品的显现连接起来。因此,对于这项任务的一个可能的视角是使用有向图聚类对互斥图进行监督聚类,其中节点是所有相机捕获的实例。它们由适当的视觉特征和位置属性表示。我们利用GNN(图神经网络)架构的优势来研究节点之间的关系并生成具有代表性的边嵌入。这些嵌入然后被分类以确定节点对之间的连接是否存在。因此,这种方法的核心是节点之间的图形连接预测。在多样环境(如实验室、篮球场和露台)下的多摄像头行人数据集上进行了实验验证。与名为GNN-CCA的现有方法相比,我们提出的SGC-CCA方法在所有聚类指标上超过了最先进的水平,提供了一种端到端的聚类解决方案,无需进行图后处理。代码可在此处下载:https://url.com/
https://arxiv.org/abs/2410.00643
Unmanned Aerial Vehicles (UAVs) are becoming more popular in various sectors, offering many benefits, yet introducing significant challenges to privacy and safety. This paper investigates state-of-the-art solutions for detecting and tracking quadrotor UAVs to address these concerns. Cutting-edge deep learning models, specifically the YOLOv5 and YOLOv8 series, are evaluated for their performance in identifying UAVs accurately and quickly. Additionally, robust tracking systems, BoT-SORT and Byte Track, are integrated to ensure reliable monitoring even under challenging conditions. Our tests on the DUT dataset reveal that while YOLOv5 models generally outperform YOLOv8 in detection accuracy, the YOLOv8 models excel in recognizing less distinct objects, demonstrating their adaptability and advanced capabilities. Furthermore, BoT-SORT demonstrated superior performance over Byte Track, achieving higher IoU and lower center error in most cases, indicating more accurate and stable tracking. Code: this https URL Tracking demo: this https URL
无人机(UAVs)在各个领域变得越来越受欢迎,为人们带来了许多好处,但同时也对隐私和安全带来了 significant challenges。本文调查了检测和跟踪四旋翼UAV的最佳现有解决方案,以解决这些担忧。对于准确性和快速性识别UAVs,评估了最先进的深度学习模型,特别是YOLOv5和YOLOv8系列。此外,还集成了稳健的跟踪系统,如BoT-SORT和Byte Track,以确保在具有挑战性的条件下实现可靠的监测。在DUT数据集上进行的测试显示,虽然YOLOv5模型在检测准确性方面通常优于YOLOv8模型,但YOLOv8模型在识别较少 distinct objects 方面表现出色,展示了它们的适应性和先进 capabilities。此外,BoT-SORT在大多数情况下都表现出比Byte Track更高的IoU和较低的中心误差,表明更准确和稳定的跟踪。代码:https:// this URL 跟踪演示:https:// this URL
https://arxiv.org/abs/2410.00285
Model-based control faces fundamental challenges in partially-observable environments due to unmodeled obstacles. We propose an online learning and optimization method to identify and avoid unobserved obstacles online. Our method, Constraint Obeying Gaussian Implicit Surfaces (COGIS), infers contact data using a combination of visual input and state tracking, informed by predictions from a nominal dynamics model. We then fit a Gaussian process implicit surface (GPIS) to these data and refine the dataset through a novel method of enforcing constraints on the estimated surface. This allows us to design a Model Predictive Control (MPC) method that leverages the obstacle estimate to complete multiple manipulation tasks. By modeling the environment instead of attempting to directly adapt the dynamics, our method succeeds at both low-dimensional peg-in-hole tasks and high-dimensional deformable object manipulation tasks. Our method succeeds in 10/10 trials vs 1/10 for a baseline on a real-world cable manipulation task under partial observability of the environment.
基于模型的控制方法在部分不可观测环境中面临基本挑战,因为未建模的障碍。我们提出了一种在线学习和优化方法来在线识别和避免未观测到的障碍。我们的方法,约束遵守高斯隐式表面(COGIS),通过视觉输入和状态跟踪来推断接触数据,受到名义动力学模型的预测影响。然后将这些数据拟合为一个高斯过程隐式表面(GPIS),并通过一种新的方法对估计表面施加约束,从而改进数据集。通过将环境建模而不是直接适应动态,我们的方法在低维度插孔任务和高维度变形对象操作任务上都取得了成功。在环境部分不可观测的情况下,我们的方法在现实世界的电缆操作任务上的成功率与基线相比为10/10,而在1/10。
https://arxiv.org/abs/2410.00157
Humanoid robots are designed to perform diverse loco-manipulation tasks. However, they face challenges due to their high-dimensional and unstable dynamics, as well as the complex contact-rich nature of the tasks. Model-based optimal control methods offer precise and systematic control but are limited by high computational complexity and accurate contact sensing. On the other hand, reinforcement learning (RL) provides robustness and handles high-dimensional spaces but suffers from inefficient learning, unnatural motion, and sim-to-real gaps. To address these challenges, we introduce Opt2Skill, an end-to-end pipeline that combines model-based trajectory optimization with RL to achieve robust whole-body loco-manipulation. We generate reference motions for the Digit humanoid robot using differential dynamic programming (DDP) and train RL policies to track these trajectories. Our results demonstrate that Opt2Skill outperforms pure RL methods in both training efficiency and task performance, with optimal trajectories that account for torque limits enhancing trajectory tracking. We successfully transfer our approach to real-world applications.
类人机器人被设计成执行各种运动控制任务。然而,由于它们的高维和不稳定动态,以及任务的复杂接触性质,它们面临着挑战。基于模型的最优控制方法提供了精确和系统的控制,但受到高计算复杂度和准确接触感知的限制。另一方面,强化学习(RL)提供了鲁棒性,处理高维空间,但存在学习效率低、非自然运动和模拟-实况差距等问题。为了应对这些挑战,我们引入了Opt2Skill,一个端到端的流程,将基于模型的轨迹优化与RL相结合以实现鲁棒的全身运动控制。我们使用微分动力规划(DDP)为数字类人机器人生成参考运动,并训练RL策略跟踪这些轨迹。我们的结果表明,Opt2Skill在训练效率和任务性能上都优于纯RL方法,具有考虑扭矩限制的最优轨迹跟踪。我们成功地将其方法应用到现实世界的应用中。
https://arxiv.org/abs/2409.20514