Facial feature tracking is essential in imaging ballistocardiography for accurate heart rate estimation and enables motor degradation quantification in Parkinson's disease through skin feature tracking. While deep convolutional neural networks have shown remarkable accuracy in tracking tasks, they typically require extensive labeled data for supervised training. Our proposed pipeline employs a convolutional stacked autoencoder to match image crops with a reference crop containing the target feature, learning deep feature encodings specific to the object category in an unsupervised manner, thus reducing data requirements. To overcome edge effects making the performance dependent on crop size, we introduced a Gaussian weight on the residual errors of the pixels when calculating the loss function. Training the autoencoder on facial images and validating its performance on manually labeled face and hand videos, our Deep Feature Encodings (DFE) method demonstrated superior tracking accuracy with a mean error ranging from 0.6 to 3.3 pixels, outperforming traditional methods like SIFT, SURF, Lucas Kanade, and the latest transformers like PIPs++ and CoTracker. Overall, our unsupervised learning approach excels in tracking various skin features under significant motion conditions, providing superior feature descriptors for tracking, matching, and image registration compared to both traditional and state-of-the-art supervised learning methods.
面部特征跟踪在球面心电图成像中至关重要,因为它能准确估计心脏率,并且通过皮肤特征跟踪在帕金森病患者中实现运动降解量化。虽然深度卷积神经网络在跟踪任务中表现出惊人的准确性,但通常需要大量的有标签数据进行监督训练。我们提出的方案采用卷积堆叠自编码器将图像块与包含目标特征的参考块匹配,无监督地学习特定于物体类别的深度特征编码,从而减少了数据需求。为了克服边缘效果,使性能取决于图像大小,我们在计算损失函数时对像素残差应用高斯权重。在面部图像上训练自编码器并验证其性能,我们的Deep Feature Encodings(DFE)方法在平均误差范围内从0.6到3.3像素,超越了传统方法(如SIFT,SURF,Lucas Kanade)和最先进的变压器(如PIPs++和CoTracker),展示了卓越的跟踪精度。总的来说,我们的无监督学习方法在重大运动条件下 excels于跟踪各种皮肤特征,为跟踪、匹配和图像配准提供卓越的性能,与传统和最先进的监督学习方法相比。
https://arxiv.org/abs/2405.04943
The collaborative robot market is flourishing as there is a trend towards simplification, modularity, and increased flexibility on the production line. But when humans and robots are collaborating in a shared environment, the safety of humans should be a priority. We introduce a novel wearable robotic system to enhance safety during Human Robot Interaction (HRI). The proposed wearable robot is designed to hold a fiducial marker and maintain its visibility to the tracking system, which, in turn, localizes the user's hand with good accuracy and low latency and provides haptic feedback on the user's wrist. The haptic feedback guides the user's hand movement during collaborative tasks in order to increase safety and enhance collaboration efficiency. A user study was conducted to assess the recognition and discriminability of ten designed haptic patterns applied to the volar and dorsal parts of the user's wrist. As a result, four patterns with a high recognition rate were chosen to be incorporated into our system. A second experiment was carried out to evaluate the system integration into real-world collaborative tasks.
协作机器人市场正在蓬勃发展,因为生产线上的简化、模块化和 increased flexibility 的趋势越来越受欢迎。但是,当人类和机器人共同协作时,人类的 safety 应该成为首要考虑因素。我们介绍了一种新颖的可穿戴式机器人系统,用于提高在人机交互(HRI)过程中的安全性。所提出的可穿戴机器人被设计成持有抵押品标尺,并保持其可见性给跟踪系统,从而,通过高精度和低延迟定位用户的手,并提供在手腕上的触觉反馈。触觉反馈指导用户在协作任务中移动手指,以提高安全性和提高协作效率。 对用户的手进行用户研究,以评估其对手腕volar和dorsal部分设计的十种触觉模式的识别度和区分度。结果显示,四种具有高识别率的模式被选入我们的系统。为了评估系统在现实世界协作任务中的集成度,进行了一项实验。
https://arxiv.org/abs/2405.04899
In this study, we propose a safety-critical compliant control strategy designed to strictly enforce interaction force constraints during the physical interaction of robots with unknown environments. The interaction force constraint is interpreted as a new force-constrained control barrier function (FC-CBF) by exploiting the generalized contact model and the prior information of the environment, i.e., the prior stiffness and rest position, for robot kinematics. The difference between the real environment and the generalized contact model is approximated by constructing a tracking differentiator, and its estimation error is quantified based on Lyapunov theory. By interpreting strict interaction safety specification as a dynamic constraint, restricting the desired joint angular rates in kinematics, the proposed approach modifies nominal compliant controllers using quadratic programming, ensuring adherence to interaction force constraints in unknown environments. The strict force constraint and the stability of the closed-loop system are rigorously analyzed. Experimental tests using a UR3e industrial robot with different environments verify the effectiveness of the proposed method in achieving the force constraints in unknown environments.
在这项研究中,我们提出了一个安全关键 compliant 的控制策略,旨在在机器人与未知环境的物理交互过程中严格实施交互力约束。交互力约束被解释为一种新的力约束控制屏障函数(FC-CBF),通过利用泛化接触模型和环境的前信息(即前刚度和静止位置)来进行机器人运动学。两者之间的差异通过构建跟踪差分来进行近似,其估计误差根据 Lipschitz 理论进行量化。将严格的交互安全规格解释为动态约束,限制了机器人运动学中所需的目标关节角速度,所提出的策略通过二次规划修改了顺从控制器,确保在未知环境中遵守交互力约束。对齐严格的力约束和闭环系统的稳定性进行了严格的分析。使用不同环境的UR3e工业机器人进行实验证实了所提出方法在实现未知环境中的力约束方面的有效性。
https://arxiv.org/abs/2405.04859
Aligning machine learning systems with human expectations is mostly attempted by training with manually vetted human behavioral samples, typically explicit feedback. This is done on a population level since the context that is capturing the subjective Point-Of-View (POV) of a concrete person in a specific situational context is not retained in the data. However, we argue that alignment on an individual level can boost the subjective predictive performance for the individual user interacting with the system considerably. Since perception differs for each person, the same situation is observed differently. Consequently, the basis for decision making and the subsequent reasoning processes and observable reactions differ. We hypothesize that individual perception patterns can be used for improving the alignment on an individual level. We test this, by integrating perception information into machine learning systems and measuring their predictive performance wrt.~individual subjective assessments. For our empirical study, we collect a novel data set of multimodal stimuli and corresponding eye tracking sequences for the novel task of Perception-Guided Crossmodal Entailment and tackle it with our Perception-Guided Multimodal Transformer. Our findings suggest that exploiting individual perception signals for the machine learning of subjective human assessments provides a valuable cue for individual alignment. It does not only improve the overall predictive performance from the point-of-view of the individual user but might also contribute to steering AI systems towards every person's individual expectations and values.
将机器学习系统与人类期望对齐主要是通过手动审核的人类行为样本进行训练,通常是有明确反馈的。这是在整个人口水平上进行的,因为捕获了一个具体人在特定情境背景中的主观观点的上下文的数据中不保留该上下文。然而,我们认为在个体层面上进行对齐可以显著提高与系统交互的用户的主观预测表现。由于每个人的感知不同,相同的情况以不同的方式被观察。因此,决策基础和后续推理过程以及可观察的反应是不同的。我们假设,个体感知模式可以用于提高个体层面的对齐。我们通过将感知信息集成到机器学习系统中,并测量其对个体主观评估的预测性能来进行实验,以验证这个假设。在我们的实证研究中,我们收集了一个新的多模态刺激数据集以及相应的心跳序列,用于新任务感知引导跨模态共情。我们使用感知引导多模态Transformer来解决这个任务。我们的研究结果表明,利用个人感知信号进行机器学习可以提供有价值的线索来对个体进行对齐。这不仅可以提高个体用户的总体预测表现,而且还可以引导AI系统朝着每个人的个人期望和价值观的方向发展。
https://arxiv.org/abs/2405.04443
Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However, current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks, overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper, we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework, dubbed \emph{DriveWorld}, which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically, we propose a Memory State-Space Model for spatio-temporal modelling, which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset, DriveWorld achieves a 7.5% increase in mAP for 3D object detection, a 3.0% increase in IoU for online mapping, a 5.0% increase in AMOTA for multi-object tracking, a 0.1m decrease in minADE for motion forecasting, a 3.0% increase in IoU for occupancy prediction, and a 0.34m reduction in average L2 error for planning.
近年来,随着其较低成本,视觉中心化的自动驾驶引起了广泛关注。预训练对于提取普遍表示至关重要。然而,目前视觉中心化的预训练通常依赖于2D或3D预训练任务,忽视了自动驾驶作为4D场景理解任务的时空特征。在本文中,我们通过引入基于世界模型的自动驾驶4D表示学习框架\emph{DriveWorld}来解决这一挑战。该框架能够以时空方式从多摄像头驾驶视频中进行预训练。具体来说,我们提出了一个记忆状态空间模型进行空间-时间建模,包括动态内存库模块用于学习时空感知到的潜在动态,静态场景传播模块用于学习空间感知到的潜在静态,以提供全面的场景上下文。我们还引入了一个任务提示,用于解耦各种下游任务的关注点特征。实验证明,DriveWorld在各种自动驾驶任务上取得了很好的效果。当使用OpenScene数据集进行预训练时,DriveWorld在3D物体检测上实现了7.5%的mAP增加,在在线地图上实现了3%的IoU增加,在多对象跟踪上实现了5%的AMOTA增加,在运动预测中降低了0.1m的minADE,在占用预测上实现了3%的IoU增加,在规划中减少了0.34m的L2误差。
https://arxiv.org/abs/2405.04390
High-definition map with accurate lane-level information is crucial for autonomous driving, but the creation of these maps is a resource-intensive process. To this end, we present a cost-effective solution to create lane-level roadmaps using only the global navigation satellite system (GNSS) and a camera on customer vehicles. Our proposed solution utilizes a prior standard-definition (SD) map, GNSS measurements, visual odometry, and lane marking edge detection points, to simultaneously estimate the vehicle's 6D pose, its position within a SD map, and also the 3D geometry of traffic lines. This is achieved using a Bayesian simultaneous localization and multi-object tracking filter, where the estimation of traffic lines is formulated as a multiple extended object tracking problem, solved using a trajectory Poisson multi-Bernoulli mixture (TPMBM) filter. In TPMBM filtering, traffic lines are modeled using B-spline trajectories, and each trajectory is parameterized by a sequence of control points. The proposed solution has been evaluated using experimental data collected by a test vehicle driving on highway. Preliminary results show that the traffic line estimates, overlaid on the satellite image, generally align with the lane markings up to some lateral offsets.
具有高清晰度地图和准确的路级信息对于自动驾驶至关重要,但创建这些地图是一个资源密集的过程。为此,我们提出了一个成本有效的解决方案,使用仅是全球导航卫星系统(GNSS)和车辆上的摄像机来创建道路级地图。我们的解决方案利用了预定义的标准定义(SD)地图、GNSS测量、视觉观测里程和车道标记边缘检测点,同时估计车辆的6D姿态,其在SD地图上的位置以及交通线的3D几何形状。这是通过使用贝叶斯同时定位和多对象跟踪滤波器实现的,其中交通线的估计作为一个多扩展对象跟踪问题,通过轨迹的概率Multi-Bernoulli混合(TPMBM)滤波器求解。在TPMBM滤波器中,交通线通过B-spline轨迹建模,并且每个轨迹由一系列控制点参数化。所提出的解决方案已通过在高速公路上行驶的测试车辆的实验数据进行了评估。初步结果表明,交通线估计,叠加在卫星图像上,通常与车道标记在横向偏移量上相符。
https://arxiv.org/abs/2405.04290
Most studies in swarm robotics treat the swarm as an isolated system of interest. We argue that the prevailing view of swarms as self-sufficient, independent systems limits the scope of potential applications for swarm robotics. A robot swarm could act as a support in an heterogeneous system comprising other robots and/or human operators, in particular by quickly providing access to a large amount of data acquired in large unknown environments. Tasks such as target identification & tracking, scouting, or monitoring/surveillance could benefit from this approach.
大多数关于群机器人研究将群落视为一个独立系统。我们认为,将群落视为自给自足、独立系统的主流观点限制了群落机器人技术的潜在应用范围。一个机器人群落可以在包含其他机器人以及/或人类操作员的异质系统中充当支持,特别是通过迅速提供在大型未知环境中获取的大量数据的访问。诸如目标识别和追踪、侦察或监控等任务都可以从这种方法中受益。
https://arxiv.org/abs/2405.04079
Direct methods for event-based visual odometry solve the mapping and camera pose tracking sub-problems by establishing implicit data association in a way that the generative model of events is exploited. The main bottlenecks faced by state-of-the-art work in this field include the high computational complexity of mapping and the limited accuracy of tracking. In this paper, we improve our previous direct pipeline \textit{Event-based Stereo Visual Odometry} in terms of accuracy and efficiency. To speed up the mapping operation, we propose an efficient strategy of edge-pixel sampling according to the local dynamics of events. The mapping performance in terms of completeness and local smoothness is also improved by combining the temporal stereo results and the static stereo results. To circumvent the degeneracy issue of camera pose tracking in recovering the yaw component of general 6-DoF motion, we introduce as a prior the gyroscope measurements via pre-integration. Experiments on publicly available datasets justify our improvement. We release our pipeline as an open-source software for future research in this field.
直接方法通过建立事件之间的隐式数据关联来解决基于事件的视觉姿态跟踪子问题,从而利用生成模型的优势。该领域最先进的工作在计算复杂性和跟踪准确性方面面临的主要瓶颈包括计算复杂度高和跟踪准确性有限。在本文中,我们在准确性和效率方面改进了我们的前一个直接路径《基于事件的立体视觉姿态跟踪》。为了加速映射操作,我们提出了根据事件局部动态进行边缘像素采样的高效策略。通过结合时间立体声结果和静态立体声结果,提高了映射的完整性。为了克服相机姿态跟踪中累积偏差的缺陷,我们在预积分中引入了陀螺仪测量。对公开可用的数据集进行实验证明我们的改进。我们将我们的管道作为开源软件发布,供未来研究使用。
https://arxiv.org/abs/2405.04071
We propose a hybrid framework for consistently producing high-quality object tracks by combining an automated object tracker with little human input. The key idea is to tailor a module for each dataset to intelligently decide when an object tracker is failing and so humans should be brought in to re-localize an object for continued tracking. Our approach leverages self-supervised learning on unlabeled videos to learn a tailored representation for a target object that is then used to actively monitor its tracked region and decide when the tracker fails. Since labeled data is not needed, our approach can be applied to novel object categories. Experiments on three datasets demonstrate our method outperforms existing approaches, especially for small, fast moving, or occluded objects.
我们提出了一个混合框架,通过将自动物体跟踪器与少量的手动输入相结合,一致地生产高质量的物体跟踪。关键思想是为每个数据集定制一个模块,使得物体跟踪器在失败时能够智能地决定何时需要人类干预重新定位目标物体。我们的方法利用无标注视频上的自监督学习来学习针对目标物体的定制表示,然后用于积极监控跟踪区域并决定跟踪器何时失败。由于无需标记数据,我们的方法可以应用于新的物体类别。在三个数据集上的实验证明,我们的方法超越了现有方法,特别是对于小、快速移动或被遮挡的物体。
https://arxiv.org/abs/2405.03643
Implementing virtual fixtures in guiding tasks constrains the movement of the robot's end effector to specific curves within its workspace. However, incorporating guiding frameworks may encounter discontinuities when optimizing the reference target position to the nearest point relative to the current robot position. This article aims to give a geometric interpretation of such discontinuities, with specific reference to the commonly adopted Gauss-Newton algorithm. The effect of such discontinuities, defined as Euclidean Distance Singularities, is experimentally proved. We then propose a solution that is based on a Linear Quadratic Tracking problem with minimum jerk command, then compare and validate the performances of the proposed framework in two different human-robot interaction scenarios.
在引导任务中实现虚拟 fixtures 限制了机器人末端执行器的运动,使其在工作空间内沿着特定的曲线运动。然而,在将引导框架集成到机器人中时,在优化参考目标位置与当前机器人位置的最近点之间时,可能会遇到平滑曲线。本文旨在给出这种不连续性的几何解释,并特别针对通常采用的高斯-牛顿算法进行说明。这种不连续性,定义为欧氏距离奇点,已通过实验得到了证明。然后我们提出了一个基于线性二次规划问题最小加速度命令的解决方案,并比较和验证了在两种不同的人机交互场景中,所提出的框架的性能。
https://arxiv.org/abs/2405.03473
This paper explores how deep learning techniques can improve visual-based SLAM performance in challenging environments. By combining deep feature extraction and deep matching methods, we introduce a versatile hybrid visual SLAM system designed to enhance adaptability in challenging scenarios, such as low-light conditions, dynamic lighting, weak-texture areas, and severe jitter. Our system supports multiple modes, including monocular, stereo, monocular-inertial, and stereo-inertial configurations. We also perform analysis how to combine visual SLAM with deep learning methods to enlighten other researches. Through extensive experiments on both public datasets and self-sampled data, we demonstrate the superiority of the SL-SLAM system over traditional approaches. The experimental results show that SL-SLAM outperforms state-of-the-art SLAM algorithms in terms of localization accuracy and tracking robustness. For the benefit of community, we make public the source code at this https URL.
本文探讨了深度学习技术如何通过结合深度特征提取和深度匹配方法来提高基于视觉的SLAM在具有挑战性的环境中的性能。通过结合深度特征提取和深度匹配方法,我们引入了一种多功能的混合视觉SLAM系统,旨在增强在具有挑战性的场景中的适应性,例如低光条件、动态照明、弱纹理区和严重抖动。我们的系统支持多种模式,包括单目、双目、单目-惯性和平面-惯性配置。我们还进行了分析,探讨了如何将视觉SLAM与深度学习方法相结合以启发其他研究者。通过在公开数据集和自采样数据上进行广泛的实验,我们证明了SL-SLAM系统与传统方法相比具有优越性。实验结果表明,SL-SLAM在定位精度和跟踪鲁棒性方面优于最先进的SLAM算法。为了造福社区,我们将SL-SLAM的源代码公开在以下链接处:
https://arxiv.org/abs/2405.03413
Complementary RGB and TIR modalities enable RGB-T tracking to achieve competitive performance in challenging scenarios. Therefore, how to better fuse cross-modal features is the core issue of RGB-T tracking. Some previous methods either insufficiently fuse RGB and TIR features, or depend on intermediaries containing information from both modalities to achieve cross-modal information interaction. The former does not fully exploit the potential of using only RGB and TIR information of the template or search region for channel and spatial feature fusion, and the latter lacks direct interaction between the template and search area, which limits the model's ability to fully exploit the original semantic information of both modalities. To alleviate these limitations, we explore how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features, and propose CSTNet. CSTNet uses ViT as a backbone and inserts cross-modal channel feature fusion modules (CFM) and cross-modal spatial feature fusion modules (SFM) for direct interaction between RGB and TIR features. The CFM performs parallel joint channel enhancement and joint multilevel spatial feature modeling of RGB and TIR features and sums the features, and then globally integrates the sum feature with the original features. The SFM uses cross-attention to model the spatial relationship of cross-modal features and then introduces a convolutional feedforward network for joint spatial and channel integration of multimodal features. Comprehensive experiments show that CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks. Code is available at this https URL.
互补的RGB和TIR模式使RGB-T跟踪在具有挑战性的场景中实现竞争力的性能。因此,如何更好地融合跨模态特征是RGB-T跟踪的核心问题。之前的方法要么不足以充分利用仅使用模板或搜索区域的RGB和TIR信息进行通道和空间特征融合,要么依赖于包含来自两个模态信息的中间体以实现跨模态信息交互。前者没有充分利用使用仅基于RGB和TIR信息的模板或搜索区域进行通道和空间特征融合的潜力,而后者缺乏直接模板和搜索区域之间的交互,从而限制了模型对两种模态原始语义信息的充分利用能力。为了减轻这些限制,我们探讨了如何通过直接融合跨模态通道和空间特征来提高视觉Transformer的性能,并提出了CSTNet。CSTNet使用ViT作为骨干网络,并插入跨模态通道特征融合模块(CFM)和跨模态空间特征融合模块(SFM)进行直接交互,CFM对RGB和TIR特征进行并行联合通道增强和多级空间特征建模,然后将特征加总并全局整合与原始特征。SFM利用跨注意力和一个卷积前馈网络对多模态特征进行联合空间和通道整合。全面的实验结果表明,CSTNet在三个公开的RGB-T跟踪基准上实现了最先进的性能。代码可以从该链接下载。
https://arxiv.org/abs/2405.03177
This work considers the problem of optimal lane changing in a structured multi-agent road environment. A novel motion planning algorithm that can capture long-horizon dependencies as well as short-horizon dynamics is presented. Pivotal to our approach is a geometric approximation of the long-horizon combinatorial transition problem which we formulate in the continuous time-space domain. Moreover, a discrete-time formulation of a short-horizon optimal motion planning problem is formulated and combined with the long-horizon planner. Both individual problems, as well as their combination, are formulated as MIQP and solved in real-time by using state-of-the-art solvers. We show how the presented algorithm outperforms two other state-of-the-art motion planning algorithms in closed-loop performance and computation time in lane changing problems. Evaluations are performed using the traffic simulator SUMO, a custom low-level tracking model predictive controller, and high-fidelity vehicle models and scenarios, provided by the CommonRoad environment.
本工作考虑了在结构多代理道路环境中进行最优车道切换的问题。我们提出了一个新颖的运动规划算法,可以捕捉长时依赖关系和短时动态。我们还在连续时间域中形式化了一个几何近似的长时间依赖关系组合问题,这是我们方法的关键。此外,我们还提出了一个离散时间的短期最优运动规划问题,并将其与长期规划器相结合。我们将其组合问题和解决方案都表示为MIQP,并使用最先进的求解器在实时状态下求解。我们证明了所提出的算法在关闭环路性能和计算时间方面优于另外两个最先进的运动规划算法。评估实验使用了交通仿真器SUMO、由CommonRoad环境提供的低级跟踪预测控制器和高度逼真的车辆模型和场景。
https://arxiv.org/abs/2405.02979
Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel \emph{A}ttention-based \emph{F}usion rou\emph{ter} called AFter, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFter against state-of-the-art RGBT trackers. We release the code in this https URL.
多模态特征融合作为RGBT跟踪的核心调查组件近年来出现了许多融合研究。然而,现有的RGBT跟踪方法普遍采用固定的融合结构来整合多模态特征,这些结构在动态场景中难以处理各种挑战。为了解决这个问题,本文提出了一种名为AFter的新颖注意力基础融合路由器,它优化了融合结构以适应动态挑战场景,从而实现稳健的RGBT跟踪。 特别地,我们基于分层注意力网络设计了一个融合结构空间,每个基于注意力的融合单元对应于一个融合操作和一个由这些注意力单元组成的融合结构。通过优化注意力基于融合单位的组合,我们可以动态选择适合各种挑战场景的融合结构。与神经架构搜索算法中不同结构的复杂搜索不同,我们开发了一种动态路由算法,为每个基于注意力的融合单元配备一个路由器,以预测用于有效优化融合结构的组合权重。 在五个主流RGBT跟踪数据集上的大量实验证明,与最先进的RGBT跟踪器相比,所提出的AFter具有卓越的性能。我们还在本文中提供了代码的URL。
https://arxiv.org/abs/2405.02717
Latent Diffusion Models (LDMs) enable a wide range of applications but raise ethical concerns regarding illegal utilization.Adding watermarks to generative model outputs is a vital technique employed for copyright tracking and mitigating potential risks associated with AI-generated content. However, post-hoc watermarking techniques are susceptible to evasion. Existing watermarking methods for LDMs can only embed fixed messages. Watermark message alteration requires model retraining. The stability of the watermark is influenced by model updates and iterations. Furthermore, the current reconstruction-based watermark removal techniques utilizing variational autoencoders (VAE) and diffusion models have the capability to remove a significant portion of watermarks. Therefore, we propose a novel technique called DiffuseTrace. The goal is to embed invisible watermarks in all generated images for future detection semantically. The method establishes a unified representation of the initial latent variables and the watermark information through training an encoder-decoder model. The watermark information is embedded into the initial latent variables through the encoder and integrated into the sampling process. The watermark information is extracted by reversing the diffusion process and utilizing the decoder. DiffuseTrace does not rely on fine-tuning of the diffusion model components. The watermark is embedded into the image space semantically without compromising image quality. The encoder-decoder can be utilized as a plug-in in arbitrary diffusion models. We validate through experiments the effectiveness and flexibility of DiffuseTrace. DiffuseTrace holds an unprecedented advantage in combating the latest attacks based on variational autoencoders and Diffusion Models.
潜在扩散模型(LDMs)允许应用于广泛的领域,但涉及非法利用的伦理问题。在将水印添加到生成模型的输出中是保护版权跟踪和减轻与AI生成的内容相关的潜在风险的重要技术。然而,后置水印技术易被绕过。现有的LDM水印方法只能嵌入固定的消息。水印消息修改需要模型重构。水印的稳定性受模型更新和迭代的影响。此外,使用变分自编码器(VAE)和扩散模型基于重构的消歧水印去除技术具有去除大量水印的能力。因此,我们提出了名为DiffuseTrace的新技术。目标是将不可见的水印嵌入所有生成的图像中,供未来检测具有语义意义。该方法通过训练编码器-解码器模型,将初始潜在变量和 水印信息建立为统一表示。水印信息通过编码器整合到抽样过程中。通过反转扩散过程并利用解码器提取水印信息。DiffuseTrace不依赖于对扩散模型组件的微调。水印在图像空间语义上嵌入,同时不牺牲图像质量。编码器-解码器可以作为任意扩散模型的插件使用。通过实验验证DiffuseTrace的有效性和灵活性。DiffuseTrace在对抗基于变分自编码器(VAE)和扩散模型的最新攻击方面具有史无前例的优势。
https://arxiv.org/abs/2405.02696
This paper aims to create a deep learning framework that can estimate the deformation vector field (DVF) for directly registering abdominal MRI-CT images. The proposed method assumed a diffeomorphic deformation. By using topology-preserved deformation features extracted from the probabilistic diffeomorphic registration model, abdominal motion can be accurately obtained and utilized for DVF estimation. The model integrated Swin transformers, which have demonstrated superior performance in motion tracking, into the convolutional neural network (CNN) for deformation feature extraction. The model was optimized using a cross-modality image similarity loss and a surface matching loss. To compute the image loss, a modality-independent neighborhood descriptor (MIND) was used between the deformed MRI and CT images. The surface matching loss was determined by measuring the distance between the warped coordinates of the surfaces of contoured structures on the MRI and CT images. The deformed MRI image was assessed against the CT image using the target registration error (TRE), Dice similarity coefficient (DSC), and mean surface distance (MSD) between the deformed contours of the MRI image and manual contours of the CT image. When compared to only rigid registration, DIR with the proposed method resulted in an increase of the mean DSC values of the liver and portal vein from 0.850 and 0.628 to 0.903 and 0.763, a decrease of the mean MSD of the liver from 7.216 mm to 3.232 mm, and a decrease of the TRE from 26.238 mm to 8.492 mm. The proposed deformable image registration method based on a diffeomorphic transformer provides an effective and efficient way to generate an accurate DVF from an MRI-CT image pair of the abdomen. It could be utilized in the current treatment planning workflow for liver radiotherapy.
本文旨在创建一个深度学习框架,可以准确估计直接注册的腹部MRI-CT图像的变形矢量场(DVF)。所提出的方法基于等变形的变形。通过使用概率形态不变的变形特征提取,可以准确获得腹部运动,并用于DVF估计。模型将Swin变换器集成到卷积神经网络(CNN)中,用于变形特征提取。模型使用跨模态图像相似性损失和表面匹配损失进行优化。为了计算图像损失,在MRI和CT图像之间使用了一个模态无关的邻域描述符(MIND)。表面匹配损失通过测量MRI和CT图像上轮廓结构的变形坐标之间的距离来确定。对MRI图像的变形轮廓使用目标注册误差(TRE)、余弦相似度系数(DSC)和平均表面距离(MSD)与手动CT图像的变形轮廓进行比较。与仅刚性注册相比,所提出的方法导致肝脏和门静脉的平均DSC值从0.850和0.628增加至0.903和0.763,肝脏平均MSD从7.216 mm减少至3.232 mm,TRE从26.238 mm减少至8.492 mm。基于等变形的图像注册方法,可以生成准确的可用于腹部MRI-CT图像对中的DVF。它可用于当前的肝脏放射治疗计划工作流程。
https://arxiv.org/abs/2405.02692
Autonomous robots for gathering information on objects of interest has numerous real-world applications because of they improve efficiency, performance and safety. Realizing autonomy demands online planning algorithms to solve sequential decision making problems under uncertainty; because, objects of interest are often dynamic, object state, such as location is not directly observable and are obtained from noisy measurements. Such planning problems are notoriously difficult due to the combinatorial nature of predicting the future to make optimal decisions. For information theoretic planning algorithms, we develop a computationally efficient and effective approximation for the difficult problem of predicting the likely sensor measurements from uncertain belief states}. The approach more accurately predicts information gain from information gathering actions. Our theoretical analysis proves the proposed formulation achieves a lower prediction error than the current efficient-method. We demonstrate improved performance gains in radio-source tracking and localization problems using extensive simulated and field experiments with a multirotor aerial robot.
自主机器人用于收集有关感兴趣物体的信息具有许多实际应用,因为它们可以提高效率、性能和安全性。实现自主需要在线规划算法解决不确定情况下进行序列决策问题;因为,感兴趣的对象通常是不稳定的,对象状态(如位置)是不可直接观测的,而是通过噪声测量获得的。这类规划问题由于预测未来的组合性质,使得困难重重。对于信息论规划算法,我们开发了一种计算效率高且有效的对于从不确定信念状态预测很可能传感器测量的困难问题的近似方法。我们的理论分析证明,与现有高效方法相比,所提出的公式具有较低的预测误差。我们通过使用多旋翼无人机进行广泛的模拟和现场实验,展示了在无线电源跟踪和定位问题中,采用这种方法可以提高性能。
https://arxiv.org/abs/2405.02605
We apply multi-agent deep reinforcement learning (RL) to train end-to-end robot soccer policies with fully onboard computation and sensing via egocentric RGB vision. This setting reflects many challenges of real-world robotics, including active perception, agile full-body control, and long-horizon planning in a dynamic, partially-observable, multi-agent domain. We rely on large-scale, simulation-based data generation to obtain complex behaviors from egocentric vision which can be successfully transferred to physical robots using low-cost sensors. To achieve adequate visual realism, our simulation combines rigid-body physics with learned, realistic rendering via multiple Neural Radiance Fields (NeRFs). We combine teacher-based multi-agent RL and cross-experiment data reuse to enable the discovery of sophisticated soccer strategies. We analyze active-perception behaviors including object tracking and ball seeking that emerge when simply optimizing perception-agnostic soccer play. The agents display equivalent levels of performance and agility as policies with access to privileged, ground-truth state. To our knowledge, this paper constitutes a first demonstration of end-to-end training for multi-agent robot soccer, mapping raw pixel observations to joint-level actions, that can be deployed in the real world. Videos of the game-play and analyses can be seen on our website this https URL .
我们将多智能体深度强化学习(RL)应用于训练具有完全车载计算和感知能力的端到端机器人足球策略,通过采用 ego 中心式 RGB 视觉。这个设置反映了现实世界机器人领域许多挑战,包括积极感知、灵活的全身体控制和动态、部分不可观测的多智能体领域的长距离规划。我们依赖于大规模、基于模拟的数据生成来获得自适应的 behaviors,这些 behaviors 可以成功地传输到物理机器人,利用低成本传感器。为了实现适当的视觉现实,我们的模拟结合了刚体物理和通过多个 Neural Radiance Fields (NeRFs) 学习到的逼真的渲染。我们将基于教师的多智能体 RL 和跨实验数据复用来探索复杂的足球策略。我们分析了在仅优化感知无关足球比赛时出现的积极感知行为,包括物体跟踪和寻找球。代理显示与具有特权、地面真实状态访问权限的政策具有同等的表现和敏捷性。据我们所知,本文是首次将端到端训练多智能体机器人足球的实践,将原始像素观察结果映射到关节级别动作,可以在现实世界中部署。游戏的视频和分析可以在我们的网站 https:// this URL 上查看。
https://arxiv.org/abs/2405.02425
Existing VLMs can track in-the-wild 2D video objects while current generative models provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. Building upon this exciting progress, we present DreamScene4D, the first approach that can generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos with large object motion across occlusions and novel viewpoints. Our key insight is to design a "decompose-then-recompose" scheme to factorize both the whole video scene and each object's 3D motion. We first decompose the video scene by using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is mapped to a set of 3D Gaussians that deform and move in space and time. We also factorize the observed motion into multiple components to handle fast motion. The camera motion can be inferred by re-rendering the background to match the video frames. For the object motion, we first model the object-centric deformation of the objects by leveraging rendering losses and multi-view generative priors in an object-centric frame, then optimize object-centric to world-frame transformations by comparing the rendered outputs against the perceived pixel and optical flow. Finally, we recompose the background and objects and optimize for relative object scales using monocular depth prediction guidance. We show extensive results on the challenging DAVIS, Kubric, and self-captured videos, detail some limitations, and provide future directions. Besides 4D scene generation, our results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, while never explicitly trained to do so.
现有的VLMs可以在野外追踪2D视频对象,而当前的生成模型可以为高度约束的2D到3D物体提升提供强大的视觉先验,以生成新颖的视角。在此基础上,我们提出了DreamScene4D,这是第一个可以从单目野生动物视频生成多物体三维动态场景的方法,具有大物体运动跨越遮挡和新视角。我们的关键见解是设计一个“分解-然后-重构”方案,将整个视频场景和每个对象的3D运动分解。我们首先通过使用开箱即用的词汇mask跟踪器和适应性图像扩散模型来分解视频场景,分割和跟踪视频中的物体和背景。每个物体跟踪映射到一组3D高斯,它们在空间和时间上扭曲和移动。此外,我们还将观察到的运动分解为多个组件,以处理快速运动。通过重新渲染背景以匹配视频帧,可以推断出相机运动。对于物体运动,我们首先通过利用渲染损失和物体中心帧的多视图生成先验在物体中心建模物体本体的变形,然后通过将渲染输出与感知像素和光学流进行比较,优化物体本体到世界帧的变换。最后,我们通过单目深度预测指导来重构背景和物体,并优化相对物体比例。我们在具有挑战性的DAVIS、Kubric和自捕获视频中展示了广泛的结果,详细介绍了其局限性,并提供了未来的方向。除了4D场景生成外,我们的结果表明,DreamScene4D通过将推断的3D轨迹投影到2D来准确跟踪2D物体运动,而从未明确训练过这样做。
https://arxiv.org/abs/2405.02280
Autonomous locomotion for mobile ground robots in unstructured environments such as waypoint navigation or flipper control requires a sufficiently accurate prediction of the robot-terrain interaction. Heuristics like occupancy grids or traversability maps are widely used but limit actions available to robots with active flippers as joint positions are not taken into account. We present a novel iterative geometric method to predict the 3D pose of mobile ground robots with active flippers on uneven ground with high accuracy and online planning capabilities. This is achieved by utilizing the ability of signed distance fields to represent surfaces with sub-voxel accuracy. The effectiveness of the presented approach is demonstrated on two different tracked robots in simulation and on a real platform. Compared to a tracking system as ground truth, our method predicts the robot position and orientation with an average accuracy of 3.11 cm and 3.91°, outperforming a recent heightmap-based approach. The implementation is made available as an open-source ROS package.
自治移动地面机器人在非结构化环境中(如路径规划或翻转控制)实现自主移动需要对机器人与地面之间的相互作用进行足够准确的预测。类似于占用网格或可穿越性地图等启发式方法被广泛使用,但它们限制了具有活动翻板的机器人的可用动作,因为它们没有考虑到关节位置。我们提出了一种新颖的迭代几何方法,可以预测带有活动翻板的移动地面机器人在不平滑地面上的3D姿态,具有高精度和在线规划能力。这是通过利用签名距离场表示具有子像素准确度的表面来实现的。所提出的方法的有效性在模拟中和真实平台上进行了演示。与跟踪系统作为地面真实情况相比,我们的方法预测机器人的位置和方向具有平均准确度为3.11cm和3.91°,超过了最近基于高图的方法的性能。该实现可作为开源ROS包提供。
https://arxiv.org/abs/2405.02121