Existing VLMs can track in-the-wild 2D video objects while current generative models provide powerful visual priors for synthesizing novel views for the highly under-constrained 2D-to-3D object lifting. Building upon this exciting progress, we present DreamScene4D, the first approach that can generate three-dimensional dynamic scenes of multiple objects from monocular in-the-wild videos with large object motion across occlusions and novel viewpoints. Our key insight is to design a "decompose-then-recompose" scheme to factorize both the whole video scene and each object's 3D motion. We first decompose the video scene by using open-vocabulary mask trackers and an adapted image diffusion model to segment, track, and amodally complete the objects and background in the video. Each object track is mapped to a set of 3D Gaussians that deform and move in space and time. We also factorize the observed motion into multiple components to handle fast motion. The camera motion can be inferred by re-rendering the background to match the video frames. For the object motion, we first model the object-centric deformation of the objects by leveraging rendering losses and multi-view generative priors in an object-centric frame, then optimize object-centric to world-frame transformations by comparing the rendered outputs against the perceived pixel and optical flow. Finally, we recompose the background and objects and optimize for relative object scales using monocular depth prediction guidance. We show extensive results on the challenging DAVIS, Kubric, and self-captured videos, detail some limitations, and provide future directions. Besides 4D scene generation, our results show that DreamScene4D enables accurate 2D point motion tracking by projecting the inferred 3D trajectories to 2D, while never explicitly trained to do so.
现有的VLMs可以在野外追踪2D视频对象,而当前的生成模型可以为高度约束的2D到3D物体提升提供强大的视觉先验,以生成新颖的视角。在此基础上,我们提出了DreamScene4D,这是第一个可以从单目野生动物视频生成多物体三维动态场景的方法,具有大物体运动跨越遮挡和新视角。我们的关键见解是设计一个“分解-然后-重构”方案,将整个视频场景和每个对象的3D运动分解。我们首先通过使用开箱即用的词汇mask跟踪器和适应性图像扩散模型来分解视频场景,分割和跟踪视频中的物体和背景。每个物体跟踪映射到一组3D高斯,它们在空间和时间上扭曲和移动。此外,我们还将观察到的运动分解为多个组件,以处理快速运动。通过重新渲染背景以匹配视频帧,可以推断出相机运动。对于物体运动,我们首先通过利用渲染损失和物体中心帧的多视图生成先验在物体中心建模物体本体的变形,然后通过将渲染输出与感知像素和光学流进行比较,优化物体本体到世界帧的变换。最后,我们通过单目深度预测指导来重构背景和物体,并优化相对物体比例。我们在具有挑战性的DAVIS、Kubric和自捕获视频中展示了广泛的结果,详细介绍了其局限性,并提供了未来的方向。除了4D场景生成外,我们的结果表明,DreamScene4D通过将推断的3D轨迹投影到2D来准确跟踪2D物体运动,而从未明确训练过这样做。
https://arxiv.org/abs/2405.02280
Autonomous locomotion for mobile ground robots in unstructured environments such as waypoint navigation or flipper control requires a sufficiently accurate prediction of the robot-terrain interaction. Heuristics like occupancy grids or traversability maps are widely used but limit actions available to robots with active flippers as joint positions are not taken into account. We present a novel iterative geometric method to predict the 3D pose of mobile ground robots with active flippers on uneven ground with high accuracy and online planning capabilities. This is achieved by utilizing the ability of signed distance fields to represent surfaces with sub-voxel accuracy. The effectiveness of the presented approach is demonstrated on two different tracked robots in simulation and on a real platform. Compared to a tracking system as ground truth, our method predicts the robot position and orientation with an average accuracy of 3.11 cm and 3.91°, outperforming a recent heightmap-based approach. The implementation is made available as an open-source ROS package.
自治移动地面机器人在非结构化环境中(如路径规划或翻转控制)实现自主移动需要对机器人与地面之间的相互作用进行足够准确的预测。类似于占用网格或可穿越性地图等启发式方法被广泛使用,但它们限制了具有活动翻板的机器人的可用动作,因为它们没有考虑到关节位置。我们提出了一种新颖的迭代几何方法,可以预测带有活动翻板的移动地面机器人在不平滑地面上的3D姿态,具有高精度和在线规划能力。这是通过利用签名距离场表示具有子像素准确度的表面来实现的。所提出的方法的有效性在模拟中和真实平台上进行了演示。与跟踪系统作为地面真实情况相比,我们的方法预测机器人的位置和方向具有平均准确度为3.11cm和3.91°,超过了最近基于高图的方法的性能。该实现可作为开源ROS包提供。
https://arxiv.org/abs/2405.02121
Recent advancements have showcased the potential of handheld millimeter-wave (mmWave) imaging, which applies synthetic aperture radar (SAR) principles in portable settings. However, existing studies addressing handheld motion errors either rely on costly tracking devices or employ simplified imaging models, leading to impractical deployment or limited performance. In this paper, we present IFNet, a novel deep unfolding network that combines the strengths of signal processing models and deep neural networks to achieve robust imaging and focusing for handheld mmWave systems. We first formulate the handheld imaging model by integrating multiple priors about mmWave images and handheld phase errors. Furthermore, we transform the optimization processes into an iterative network structure for improved and efficient imaging performance. Extensive experiments demonstrate that IFNet effectively compensates for handheld phase errors and recovers high-fidelity images from severely distorted signals. In comparison with existing methods, IFNet can achieve at least 11.89 dB improvement in average peak signal-to-noise ratio (PSNR) and 64.91% improvement in average structural similarity index measure (SSIM) on a real-world dataset.
近年来,便携式毫米波成像(mmWave Imaging)的潜在应用已经得到了展示,这种应用利用了便携式设置下的合成孔径雷达(SAR)原理。然而,现有的研究要么依赖于昂贵的跟踪设备,要么采用简化的成像模型,导致实际部署不实用或性能有限。在本文中,我们提出了IFNet,一种新颖的深度展开网络,结合了信号处理模型的优势和深度神经网络的优点,为手持mmWave系统实现稳健的成像和聚焦。我们首先通过整合多个关于mmWave图像和手持相位误差的多项prior,形式化地定义了手持成像模型。此外,我们将优化过程转化为一个迭代网络结构,以提高和实现高效的成像性能。大量实验证明IFNet能够有效补偿手持相位误差,并从严重扭曲的信号中恢复高保真的图像。与现有方法相比,IFNet可以在真实世界数据集上实现至少11.89 dB的平均峰值信号-噪声比(PSNR)的改进和64.91%的平均结构相似性指数测量(SSIM)。
https://arxiv.org/abs/2405.02023
Augmented reality (AR) has the potential to improve the immersion and efficiency of computer-assisted orthopaedic surgery (CAOS) by allowing surgeons to maintain focus on the operating site rather than external displays in the operating theatre. Successful deployment of AR to CAOS requires a calibration that can accurately calculate the spatial relationship between real and holographic objects. Several studies attempt this calibration through manual alignment or with additional fiducial markers in the surgical scene. We propose a calibration system that offers a direct method for the calibration of AR head-mounted displays (HMDs) with CAOS systems, by using infrared-reflective marker-arrays widely used in CAOS. In our fast, user-agnostic setup, a HoloLens 2 detected the pose of marker arrays using infrared response and time-of-flight depth obtained through sensors onboard the HMD. Registration with a commercially available CAOS system was achieved when an IR marker-array was visible to both devices. Study tests found relative-tracking mean errors of 2.03 mm and 1.12° when calculating the relative pose between two static marker-arrays at short ranges. When using the calibration result to provide in-situ holographic guidance for a simulated wire-insertion task, a pre-clinical test reported mean errors of 2.07 mm and 1.54° when compared to a pre-planned trajectory.
增强现实(AR)通过让外科医生将注意力集中在手术现场而不是外部显示屏上,从而改善了计算机辅助骨科手术(CAOS)的沉浸感和效率。成功地将AR应用于CAOS需要进行校准,以准确计算真实和全息物体之间的空间关系。几项研究通过手动对齐或使用手术场景中的附加引导标记来尝试进行这种校准。我们提出了一个通过使用广泛用于CAOS的IR反射型标记阵列直接校准AR头盔显示器(HMD)与CAOS系统的校准系统。在我们的快速、用户友好的设置中,HoloLens 2使用红外响应和通过HMD上的传感器获得的时间飞行深度来检测标记阵列的姿态。在与两个设备可见的IR标记阵列进行相对对齐时,研究测试发现了2.03mm和1.12°的相对跟踪平均误差。当使用校准结果为模拟电线插入任务提供现场全息指导时,一个早期临床试验报告了与预先规划轨迹相比较的2.07mm和1.54°的平均误差。
https://arxiv.org/abs/2405.01999
An accurate detection and tracking of devices such as guiding catheters in live X-ray image acquisitions is an essential prerequisite for endovascular cardiac interventions. This information is leveraged for procedural guidance, e.g., directing stent placements. To ensure procedural safety and efficacy, there is a need for high robustness no failures during tracking. To achieve that, one needs to efficiently tackle challenges, such as: device obscuration by contrast agent or other external devices or wires, changes in field-of-view or acquisition angle, as well as the continuous movement due to cardiac and respiratory motion. To overcome the aforementioned challenges, we propose a novel approach to learn spatio-temporal features from a very large data cohort of over 16 million interventional X-ray frames using self-supervision for image sequence data. Our approach is based on a masked image modeling technique that leverages frame interpolation based reconstruction to learn fine inter-frame temporal correspondences. The features encoded in the resulting model are fine-tuned downstream. Our approach achieves state-of-the-art performance and in particular robustness compared to ultra optimized reference solutions (that use multi-stage feature fusion, multi-task and flow regularization). The experiments show that our method achieves 66.31% reduction in maximum tracking error against reference solutions (23.20% when flow regularization is used); achieving a success score of 97.95% at a 3x faster inference speed of 42 frames-per-second (on GPU). The results encourage the use of our approach in various other tasks within interventional image analytics that require effective understanding of spatio-temporal semantics.
准确检测和跟踪诸如引导导管等设备在活体X光成像中的操作,是进行内窥镜心脏干预的必要前提。这一信息用于指导操作,例如指导支架植入。为了确保操作的安全性和有效性,需要在跟踪过程中具备高鲁棒性,以避免失败。为了实现这一目标,需要有效地解决一些挑战,例如:对比剂或其他外部设备或导线对设备的遮挡,视野或成像角度的变化,以及由于心脏和呼吸运动而产生的连续运动。为了克服上述挑战,我们提出了一个新颖的方法,从超过1600万干预X光帧的大型数据集中学取空间-时间特征,通过自监督图像序列数据进行图像序列数据。我们的方法基于遮罩图像建模技术,利用基于重构的帧插值学习细粒度时间对应关系。在得到的模型中编码的特征经过下游微调。我们的方法在超优化参考解决方案(使用多级特征融合、多任务和流 regularization)方面实现了最先进的性能和鲁棒性。实验结果表明,我们的方法将最大跟踪误差减少了66.31%(使用流 regularization 时,降低了23.20%);在每秒42帧的推理速度下,实现了97.95%的成功率(在GPU上)。结果鼓励将我们的方法应用于各种需要在操作图像分析中有效理解空间-时间语义的各种其他任务。
https://arxiv.org/abs/2405.01156
Sports analysis and viewing play a pivotal role in the current sports domain, offering significant value not only to coaches and athletes but also to fans and the media. In recent years, the rapid development of virtual reality (VR) and augmented reality (AR) technologies have introduced a new platform for watching games. Visualization of sports competitions in VR/AR represents a revolutionary technology, providing audiences with a novel immersive viewing experience. However, there is still a lack of related research in this area. In this work, we present for the first time a comprehensive system for sports competition analysis and real-time visualization on VR/AR platforms. First, we utilize multiview LiDARs and cameras to collect multimodal game data. Subsequently, we propose a framework for multi-player tracking and pose estimation based on a limited amount of supervised data, which extracts precise player positions and movements from point clouds and images. Moreover, we perform avatar modeling of players to obtain their 3D models. Ultimately, using these 3D player data, we conduct competition analysis and real-time visualization on VR/AR. Extensive quantitative experiments demonstrate the accuracy and robustness of our multi-player tracking and pose estimation framework. The visualization results showcase the immense potential of our sports visualization system on the domain of watching games on VR/AR devices. The multimodal competition dataset we collected and all related code will be released soon.
体育分析和实时观看在当前体育领域中扮演着关键角色,为教练、运动员和球迷以及媒体提供了宝贵的价值。近年来,虚拟现实(VR)和增强现实(AR)技术的快速发展为观看比赛提供了新的平台。在VR/AR中可视化体育比赛代表了一种革命性的技术,为观众提供了新颖的沉浸式观看体验。然而,在这个领域仍然缺乏相关研究。在这项工作中,我们首次提出了一个完整的体育比赛分析及实时可视化在VR/AR平台上的系统。首先,我们利用多视角 LiDAR 和相机收集多模态游戏数据。接着,我们提出了一种基于有限监督数据的多玩家跟踪和姿态估计框架,从点云和图像中提取精确的球员位置和运动。此外,我们还为玩家创建了3D模型。最后,利用这些3D玩家数据,我们在VR/AR上进行比赛分析和实时可视化。大量的定量实验证明了我们多玩家跟踪和姿态估计框架的准确性和稳健性。可视化结果展示了我们在VR/AR设备领域观看游戏的巨大潜力。我们收集的多模态比赛数据和所有相关的代码即将发布。
https://arxiv.org/abs/2405.01112
The development of Audio Description (AD) has been a pivotal step forward in making video content more accessible and inclusive. Traditionally, AD production has demanded a considerable amount of skilled labor, while existing automated approaches still necessitate extensive training to integrate multimodal inputs and tailor the output from a captioning style to an AD style. In this paper, we introduce an automated AD generation pipeline that harnesses the potent multimodal and instruction-following capacities of GPT-4V(ision). Notably, our methodology employs readily available components, eliminating the need for additional training. It produces ADs that not only comply with established natural language AD production standards but also maintain contextually consistent character information across frames, courtesy of a tracking-based character recognition module. A thorough analysis on the MAD dataset reveals that our approach achieves a performance on par with learning-based methods in automated AD production, as substantiated by a CIDEr score of 20.5.
音频描述(AD)的发展是一个使视频内容更加可访问和包容的重要一步。传统上,AD制作需要大量专业劳动,而现有的自动方法仍然需要对多模态输入进行广泛培训,并对句尾风格从字幕风格定制为AD风格。在本文中,我们介绍了一个利用GPT-4V(Vision)的多模态和指令跟随能力来自动生成AD的管道。值得注意的是,我们的方法采用 readily available的组件,无需额外培训。它生产的AD符合已建立的自然语言AD制作标准,并且由于一个跟踪为基础的字符识别模块,保持帧之间的上下文一致的字符信息。对MAD数据集的深入分析证实,我们的方法在自动AD制作方面的表现与基于学习的method相当,据CIDEr分数为20.5。
https://arxiv.org/abs/2405.00983
This paper investigates the differentiable dynamic modeling of mobile manipulators to facilitate efficient motion planning and physical design of actuators, where the actuator design is parameterized by physically meaningful motor geometry parameters. These parameters impact the manipulator's link mass, inertia, center-of-mass, torque constraints, and angular velocity constraints, influencing control authority in motion planning and trajectory tracking control. A motor's maximum torque/speed and how the design parameters affect the dynamics are modeled analytically, facilitating differentiable and analytical dynamic modeling. Additionally, an integrated locomotion and manipulation planning problem is formulated with direct collocation discretization, using the proposed differentiable dynamics and motor parameterization. Such dynamics are required to capture the dynamic coupling between the base and the manipulator. Numerical experiments demonstrate the effectiveness of differentiable dynamics in speeding up optimization and advantages in task completion time and energy consumption over established sequential motion planning approach. Finally, this paper introduces a simultaneous actuator design and motion planning framework, providing numerical results to validate the proposed differentiable modeling approach for co-design problems.
本文研究了可导动态建模方法,以促进移动执行器的高效运动规划与物理设计,其中执行器设计通过固有意义的电机几何参数进行参数化。这些参数影响着操作器的质量、惯性、质心、扭矩限制和角速度限制,影响着运动规划与轨迹跟踪控制的控制权威。通过分析建模,最大扭矩/速度以及设计参数如何影响动态,为不同的动态建模提供了帮助。 此外,本文还使用直接离散化法,将运动与操作规划问题相结合,使用所提出的可导动态建模方法。这种动态需要捕获基础与操作器之间的动态耦合。数值实验证明了可导动态在加速优化和任务完成时间以及能源消耗方面的优势,超过了现有的顺序运动规划方法。最后,本文引入了一种同时执行器设计和运动规划框架,为共同设计问题提供了数值结果,以验证所提出的可导建模方法的有效性。
https://arxiv.org/abs/2405.00882
Incorporating human-perceptual intelligence into model training has shown to increase the generalization capability of models in several difficult biometric tasks, such as presentation attack detection (PAD) and detection of synthetic samples. After the initial collection phase, human visual saliency (e.g., eye-tracking data, or handwritten annotations) can be integrated into model training through attention mechanisms, augmented training samples, or through human perception-related components of loss functions. Despite their successes, a vital, but seemingly neglected, aspect of any saliency-based training is the level of salience granularity (e.g., bounding boxes, single saliency maps, or saliency aggregated from multiple subjects) necessary to find a balance between reaping the full benefits of human saliency and the cost of its collection. In this paper, we explore several different levels of salience granularity and demonstrate that increased generalization capabilities of PAD and synthetic face detection can be achieved by using simple yet effective saliency post-processing techniques across several different CNNs.
将人类感知智能融入模型训练已经在多项困难的生物特征任务中增加了模型的泛化能力,例如展示攻击检测(PAD)和合成样本检测。在初始收集阶段,人类视觉突出(例如,眼跟踪数据或手写注释)可以通过关注机制、增强训练样本或通过损失函数中的人感知相关组件进行整合。尽管它们取得了成功,但任何基于突显的训练中一个似乎被忽视的重要方面是所需的突显粒度水平(例如,边界框、单个突显图或来自多个对象的突显聚合)。在本文中,我们探讨了几个不同的突显粒度水平,并证明了通过使用简单而有效的突显后处理技术,可以在多个不同的卷积神经网络中实现PAD和合成样本检测的泛化能力的提高。
https://arxiv.org/abs/2405.00650
Underwater robots play a crucial role in exploring aquatic environments. The ability to flexibly adjust their attitudes is essential for underwater robots to effectively accomplish tasks in confined space. However, the highly coupled six degrees of freedom dynamics resulting from attitude changes and the complex turbulence within limited spatial areas present significant challenges. To address the problem of attitude control of underwater robots, this letter investigates large-range pitch angle tracking during station holding as well as simultaneous roll and yaw angle control to enable versatile attitude adjustments. Based on dynamic modeling, this letter proposes an adaptive integral sliding mode controller (AISMC) that integrates an integral module into traditional sliding mode control (SMC) and adaptively adjusts the switching gain for improved tracking accuracy, reduced chattering, and enhanced robustness. The stability of the closed-loop control system is established through Lyapunov analysis. Extensive experiments and comparison studies are conducted using a commercial remotely operated vehicle (ROV), the results of which demonstrate that AISMC achieves satisfactory performance in attitude tracking control in confined space with unknown disturbances, significantly outperforming both PID and SMC.
水下机器人对探索水下环境具有关键作用。实现灵活的态度调整对于水下机器人有效执行任务在受限空间内是至关重要的。然而,由态度变化产生的高度耦合的六自由度动力学以及有限空间内的复杂涡流带来了重大挑战。为解决水下机器人的姿态控制问题,本文研究了在站控期间的大范围俯仰角跟踪以及同时控制横滚和偏航角以实现多功能的姿态调整。基于动态建模,本文提出了一种自适应积分滑动模式控制器(AISMC),将积分模块融入传统的滑动模式控制(SMC),并自适应地调整切换增益以提高跟踪精度、减少扰动和增强鲁棒性。通过Lyapunov分析建立了闭环控制系统的稳定性。使用商用遥控器(ROV)进行广泛的实验和比较研究。结果表明,AISMC在未知扰动下,在受限空间内实现令人满意的姿态跟踪控制,显著优于PID和SMC。
https://arxiv.org/abs/2405.00269
Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.
在自动驾驶中,在三维空间中跟踪物体至关重要。为了在驾驶过程中确保安全,跟踪器必须能够可靠地跟踪物体跨越帧并准确估计其状态,如速度和加速度。现有作品经常关注关联任务,或者忽略了对状态估计的模型性能,或者采用复杂的策略预测状态。在本文中,我们提出了一种名为STT的基于Transformer的状态ful跟踪模型,可以在场景中持续跟踪物体并准确预测其状态。STT通过检测检测器的长期历史来消耗丰富的外观、几何和运动信号,并针对数据关联和状态估计任务与数据进行联合优化。由于标准跟踪指标(如MOTA和MOTP)没有涵盖两个任务在更广泛的物体状态范围内的综合性能,我们扩展了它们,并引入了名为S-MOTA和MOTPS的新指标,解决了这一局限。STT在Waymo Open Dataset上实现了与竞争者相当实时性能。
https://arxiv.org/abs/2405.00236
RGBT tracking draws increasing attention due to its robustness in multi-modality warranting (MMW) scenarios, such as nighttime and bad weather, where relying on a single sensing modality fails to ensure stable tracking results. However, the existing benchmarks predominantly consist of videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This makes the data unrepresentative of severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark, MV-RGBT, captured specifically in MMW scenarios. In contrast with the existing datasets, MV-RGBT comprises more object categories and scenes, providing a diverse and challenging benchmark. Furthermore, for severe imaging conditions of MMW scenarios, a new problem is posed, namely \textit{when to fuse}, to stimulate the development of fusion strategies for such data. We propose a new method based on a mixture of experts, namely MoETrack, as a baseline fusion strategy. In MoETrack, each expert generates independent tracking results along with the corresponding confidence score, which is used to control the fusion process. Extensive experimental results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Significantly, the proposed MoETrack method achieves new state-of-the-art results not only on MV-RGBT, but also on standard benchmarks, such as RGBT234, LasHeR, and the short-term split of VTUAV (VTUAV-ST). More information of MV-RGBT and the source code of MoETrack will be released at this https URL.
RGBT跟踪因其多模态保证(MMW)场景中的稳健性而受到越来越多的关注。在这些场景中,仅依赖单个感测模态无法确保稳定的跟踪结果。然而,现有的基准主要是由质量和足够的RGB和热红外(TIR)信息的视频组成的。这使得数据无法代表严重的成像条件,导致在MMW场景中跟踪失败。为了弥合这个空白,我们提出了一个新的基准,即MV-RGBT,专门针对MMW场景进行捕捉。与现有数据集相比,MV-RGBT包含了更多的物体类别和场景,为数据提供了一个多样化和具有挑战性的基准。此外,在MMW场景的严重成像条件下,还提出了一个新的问题,即何时进行融合,以刺激数据中融合策略的发展。我们提出了基于专家混合的方法,即MoETrack,作为基线融合策略。大量的实验结果证明了MV-RGBT在推动RGBT跟踪和激发在MMW场景中的进一步发展方面的显著潜力。值得注意的是,与MV-RGBT一起,MoETrack方法不仅在MV-RGBT上取得了最先进的成果,而且在标准基准,如RGBT234,LasHeR和VTUAV(VTUAV-ST)短时间分割上同样表现出色。关于MV-RGBT和MoETrack的更多详细信息以及源代码将在这个链接处发布。
https://arxiv.org/abs/2405.00168
We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. Recent pedestrian animation frameworks have a significant limitation wherein they primarily focus on either following trajectory [46] or the content of the reference video [57], consequently overlooking the potential diversity of human motion within such scenarios. This limitation restricts the ability to generate pedestrian behaviors that exhibit a wider range of variations and realistic motions and therefore restricts its usage to provide rich motion content for other components in the driving simulation system, e.g., suddenly changed motion to which the autonomous vehicle should respond. In our approach, we strive to surpass the limitation by showcasing diverse human motions obtained from various sources, such as generated human motions, in addition to following the given trajectory. The fundamental contribution of our framework lies in combining the motion tracking task with trajectory following, which enables the tracking of specific motion parts (e.g., upper body) while simultaneously following the given trajectory by a single policy. This way, we significantly enhance both the diversity of simulated human motion within the given scenario and the controllability of the content, including language-based control. Our framework facilitates the generation of a wide range of human motions, contributing to greater realism and adaptability in pedestrian simulations for driving scenarios. More information is on our project page this https URL .
我们在行人仿真中面临内容多样性和可控制性的挑战。最近的人行动画框架有一个显著的局限性,即它们主要关注于跟踪轨迹[46]或参考视频[57],从而忽视了场景中人类运动多样性的潜在可能性。这个局限性限制了能够生成具有更广泛变化和真实感的行人行为的能力,因此限制了它在驾驶仿真系统中的使用,例如突然改变的运动,这是自动驾驶应该响应的。在我们的方法中,我们力求超越这一限制,通过展示从各种来源获得的多样人类运动,包括生成的行人运动,来超越这一限制。我们框架的基本贡献在于将运动跟踪任务与轨迹跟踪相结合,使得在单策略下同时跟踪给定的轨迹和特定运动部分(例如上半身)的同时,能够跟踪给定场景中的人类运动多样性。这样,我们显著增强了给定场景中仿真人类运动的多样性,并提高了内容的可控性,包括基于语言的控制。我们的框架有助于生成一系列行人运动,从而提高驾驶场景中行人仿真的真实性和适应性。更多相关信息,请访问我们的项目页面,链接如下:https://www.example.com/project 。
https://arxiv.org/abs/2404.19722
We propose RTG-SLAM, a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting. RTG-SLAM features a compact Gaussian representation and a highly efficient on-the-fly Gaussian optimization scheme. We force each Gaussian to be either opaque or nearly transparent, with the opaque ones fitting the surface and dominant colors, and transparent ones fitting residual colors. By rendering depth in a different way from color rendering, we let a single opaque Gaussian well fit a local surface region without the need of multiple overlapping Gaussians, hence largely reducing the memory and computation cost. For on-the-fly Gaussian optimization, we explicitly add Gaussians for three types of pixels per frame: newly observed, with large color errors and with large depth errors. We also categorize all Gaussians into stable and unstable ones, where the stable Gaussians are expected to well fit previously observed RGBD images and otherwise unstable. We only optimize the unstable Gaussians and only render the pixels occupied by unstable Gaussians. In this way, both the number of Gaussians to be optimized and pixels to be rendered are largely reduced, and the optimization can be done in real time. We show real-time reconstructions of a variety of real large scenes. Compared with the state-of-the-art NeRF-based RGBD SLAM, our system achieves comparable high-quality reconstruction but with around twice the speed and half the memory cost, and shows superior performance in the realism of novel view synthesis and camera tracking accuracy.
我们提出了RTG-SLAM,一种基于Gaussian分割的大规模环境下的实时3D重建系统。RTG-SLAM具有紧凑的Gaussian表示和高效的on-the-fly Gaussian优化方案。我们强制每个Gaussian要么是透明的,要么是几乎透明的,其中透明的Gaussian适合于表面和主导颜色,而透明的Gaussian适合于残余颜色。通过以与颜色渲染不同的方式渲染深度,我们使得一个透明的Gaussian可以适应用户本地表面区域,而无需多个重叠的Gaussian,从而大大降低了内存和计算成本。 对于on-the-fly Gaussian优化,我们明确地添加了每帧三种不同类型的像素的Gaussian:新观察到的,具有大的颜色误差和大的深度误差。我们还将所有Gaussian分为稳定和不稳定两类,其中稳定Gaussian预计将很好地适应用户之前观察到的RGBD图像,而其他Gaussian则是不稳定的。我们仅优化不稳定Gaussian,并仅渲染稳定Gaussian占用的像素。 通过这种方式,Gaussians要优化的数量和需要渲染的像素数量都大大减少,优化可以在实时过程中进行。我们展示了各种真实大场景的实时重构。与基于NeRF的RGBD SLAM的状态相比,我们的系统在质量和高速度方面具有相似的表现,同时将速度和内存成本降低约一半,并在新颖视图合成和相机跟踪精度的现实性方面具有卓越的表现。
https://arxiv.org/abs/2404.19706
While camera-based capture systems remain the gold standard for recording human motion, learning-based tracking systems based on sparse wearable sensors are gaining popularity. Most commonly, they use inertial sensors, whose propensity for drift and jitter have so far limited tracking accuracy. In this paper, we propose Ultra Inertial Poser, a novel 3D full body pose estimation method that constrains drift and jitter in inertial tracking via inter-sensor distances. We estimate these distances across sparse sensor setups using a lightweight embedded tracker that augments inexpensive off-the-shelf 6D inertial measurement units with ultra-wideband radio-based ranging$-$dynamically and without the need for stationary reference anchors. Our method then fuses these inter-sensor distances with the 3D states estimated from each sensor Our graph-based machine learning model processes the 3D states and distances to estimate a person's 3D full body pose and translation. To train our model, we synthesize inertial measurements and distance estimates from the motion capture database AMASS. For evaluation, we contribute a novel motion dataset of 10 participants who performed 25 motion types, captured by 6 wearable IMU+UWB trackers and an optical motion capture system, totaling 200 minutes of synchronized sensor data (UIP-DB). Our extensive experiments show state-of-the-art performance for our method over PIP and TIP, reducing position error from $13.62$ to $10.65cm$ ($22\%$ better) and lowering jitter from $1.56$ to $0.055km/s^3$ (a reduction of $97\%$).
虽然基于相机的捕捉系统仍然是记录人类运动的黄金标准,但基于稀疏可穿戴传感器的学习跟踪系统正在逐渐受到欢迎。最常见的使用惯性传感器,其漂移和抖动使得跟踪准确性受到限制。在本文中,我们提出了Ultra Inertial Poser,一种新颖的3D全身姿态估计方法,通过跨传感器距离约束漂移和抖动。我们使用轻量化的嵌入跟踪器估计这些距离,该跟踪器通过超宽带无线电基于动态的无需要静止参考锚点来增强6D惯性测量单位。然后将这些跨传感器距离与来自每个传感器的3D状态估计相结合。我们的基于图的机器学习模型处理3D状态和距离以估计一个人的3D全身姿态和 translation。为了训练我们的模型,我们使用运动捕捉数据库AMASS合成运动捕捉数据中的惯性测量和距离估计。为了评估,我们贡献了一个新的动作数据集,由25种不同的动作组成,由6个可穿戴式IMU+UWB跟踪器和光学运动捕捉系统捕获,总共有200分钟的同步传感器数据(UIP-DB)。我们广泛的实验结果表明,我们的方法在PIP和TIP上具有最先进的性能,将位置误差从$13.62$减少到$10.65$厘米($22\%$的降幅$)$,并将抖动从$1.56$减少到$0.055$千米/秒$^3$($97\%$的降幅)。
https://arxiv.org/abs/2404.19541
Neural language models, particularly large-scale ones, have been consistently proven to be most effective in predicting brain neural activity across a range of studies. However, previous research overlooked the comparison of these models with psychologically plausible ones. Moreover, evaluations were reliant on limited, single-modality, and English cognitive datasets. To address these questions, we conducted an analysis comparing encoding performance of various neural language models and psychologically plausible models. Our study utilized extensive multi-modal cognitive datasets, examining bilingual word and discourse levels. Surprisingly, our findings revealed that psychologically plausible models outperformed neural language models across diverse contexts, encompassing different modalities such as fMRI and eye-tracking, and spanning languages from English to Chinese. Among psychologically plausible models, the one incorporating embodied information emerged as particularly exceptional. This model demonstrated superior performance at both word and discourse levels, exhibiting robust prediction of brain activation across numerous regions in both English and Chinese.
神经语言模型,特别是大规模的 ones,在预测跨多个研究的脑神经活动方面一直被证明是最有效的。然而,之前的研究忽略了这些模型与心理上可解释的模型的比较。此外,评估是基于有限的、单模态的英语认知数据集进行的。为了回答这些问题,我们进行了一个比较各种神经语言模型和心理上可解释模型的编码性能的分析。我们的研究利用了广泛的跨模态认知数据集,研究了双语单词和话语水平。令人惊讶的是,我们的研究结果表明,心理上可解释的模型在不同的上下文中均优于神经语言模型,包括不同的模式,如 fMRI 和眼动,跨越了英语到汉语的语言。在心理上可解释的模型中,采用身体信息的模型表现得尤为出色。这个模型在词和 discourse 水平上表现出卓越的性能,展示了 robust 的预测 of brain activation across numerous regions in both English and Chinese.
https://arxiv.org/abs/2404.19364
Video object segmentation (VOS) aims to distinguish and track target objects in a video. Despite the excellent performance achieved by off-the-shell VOS models, existing VOS benchmarks mainly focus on short-term videos lasting about 5 seconds, where objects remain visible most of the time. However, these benchmarks poorly represent practical applications, and the absence of long-term datasets restricts further investigation of VOS in realistic scenarios. Thus, we propose a novel benchmark named LVOS, comprising 720 videos with 296,401 frames and 407,945 high-quality annotations. Videos in LVOS last 1.14 minutes on average, approximately 5 times longer than videos in existing datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Compared to previous benchmarks, our LVOS better reflects VOS models' performance in real scenarios. Based on LVOS, we evaluate 20 existing VOS models under 4 different settings and conduct a comprehensive analysis. On LVOS, these models suffer a large performance drop, highlighting the challenge of achieving precise tracking and segmentation in real-world scenarios. Attribute-based analysis indicates that key factor to accuracy decline is the increased video length, emphasizing LVOS's crucial role. We hope our LVOS can advance development of VOS in real scenes. Data and code are available at this https URL.
视频对象分割(VOS)旨在在视频中区分和跟踪目标对象。尽管通过离线VOS模型的优异性能,已经达到了很好的效果,但现有的VOS基准主要关注持续约5秒的短期视频,其中物体大部分时间都是可见的。然而,这些基准未能很好地代表实际应用场景,缺乏长期数据集也限制了VOS在现实场景中的进一步研究。因此,我们提出了一个名为LVOS的新基准,由720个视频组成,包含296,401帧和407,945个高质量注释。LVOS中的视频平均持续1.14分钟,比现有数据集中的视频长约5倍。每个视频具有各种属性,尤其是来自野生的具有挑战性的属性,例如长期重复和跨时间相关的类似物体。与以前的基准相比,我们的LVOS更能反映VOS模型在现实场景中的性能。基于LVOS,我们对4种不同设置下的20个现有VOS模型进行了评估,并进行了全面分析。在LVOS上,这些模型性能下降较大,突出了在现实场景中实现精确跟踪和分割的挑战。基于属性的分析表明,准确度下降的关键因素是视频长度,强调了LVOS在现实场景中具有关键作用。我们希望我们的LVOS能够促进VOS在现实场景的发展。数据和代码可在此链接处获取:https://www.example.com/
https://arxiv.org/abs/2404.19326
This work introduces DiffuseLoco, a framework for training multi-skill diffusion-based policies for dynamic legged locomotion from offline datasets, enabling real-time control of diverse skills on robots in the real world. Offline learning at scale has led to breakthroughs in computer vision, natural language processing, and robotic manipulation domains. However, scaling up learning for legged robot locomotion, especially with multiple skills in a single policy, presents significant challenges for prior online reinforcement learning methods. To address this challenge, we propose a novel, scalable framework that leverages diffusion models to directly learn from offline multimodal datasets with a diverse set of locomotion skills. With design choices tailored for real-time control in dynamical systems, including receding horizon control and delayed inputs, DiffuseLoco is capable of reproducing multimodality in performing various locomotion skills, zero-shot transfer to real quadrupedal robots, and it can be deployed on edge computing devices. Furthermore, DiffuseLoco demonstrates free transitions between skills and robustness against environmental variations. Through extensive benchmarking in real-world experiments, DiffuseLoco exhibits better stability and velocity tracking performance compared to prior reinforcement learning and non-diffusion-based behavior cloning baselines. The design choices are validated via comprehensive ablation studies. This work opens new possibilities for scaling up learning-based legged locomotion controllers through the scaling of large, expressive models and diverse offline datasets.
本文介绍了一种名为DiffuseLoco的多技能扩散模型的训练框架,用于从离线数据中训练多技能动态腿部运动策略,实现对现实世界中机器人的实时控制。大规模的离线学习在计算机视觉、自然语言处理和机器人操作领域取得了突破。然而,对于具有单一策略的机器人运动控制,尤其是在多个技能的情况下,扩展学习带来了巨大的挑战,对于先前的在线强化学习方法而言。为解决这个问题,我们提出了一个新型的、可扩展的框架,它利用扩散模型从离线多模态数据中直接学习,具有多样化的运动技能。通过针对动态系统进行设计的决策,包括后退视野控制和延迟输入,DiffuseLoco能够复制各种运动技能,实现零散地将机器人转移到真实四足机器人,并且可以部署在边缘计算设备上。此外,DiffuseLoco展示了技能之间的自由转换和对抗环境变化的能力。通过在现实世界实验中进行广泛的基准测试,DiffuseLoco与先前的强化学习和基于非扩散模型的行为克隆基线相比,表现出更好的稳定性和速度跟踪性能。通过全面的消融分析验证了设计选择。这项工作为通过扩展基于学习的机器人运动控制器打开了新的可能性,通过扩展大型、表现力强的模型和多样化的离线数据。
https://arxiv.org/abs/2404.19264
Despite widespread applications of knowledge graphs (KGs) in various tasks such as question answering and intelligent conversational systems, existing KGs face two major challenges: information granularity and deficiency in timeliness. These hinder considerably the retrieval and analysis of in-context, fine-grained, and up-to-date knowledge from KGs, particularly in highly specialized themes (e.g., specialized scientific research) and rapidly evolving contexts (e.g., breaking news or disaster tracking). To tackle such challenges, we propose a theme-specific knowledge graph (i.e., ThemeKG), a KG constructed from a theme-specific corpus, and design an unsupervised framework for ThemeKG construction (named TKGCon). The framework takes raw theme-specific corpus and generates a high-quality KG that includes salient entities and relations under the theme. Specifically, we start with an entity ontology of the theme from Wikipedia, based on which we then generate candidate relations by Large Language Models (LLMs) to construct a relation ontology. To parse the documents from the theme corpus, we first map the extracted entity pairs to the ontology and retrieve the candidate relations. Finally, we incorporate the context and ontology to consolidate the relations for entity pairs. We observe that directly prompting GPT-4 for theme-specific KG leads to inaccurate entities (such as "two main types" as one entity in the query result) and unclear (such as "is", "has") or wrong relations (such as "have due to", "to start"). In contrast, by constructing the theme-specific KG step by step, our model outperforms GPT-4 and could consistently identify accurate entities and relations. Experimental results also show that our framework excels in evaluations compared with various KG construction baselines.
尽管知识图谱(KGs)在各种任务中的广泛应用,如问答和智能对话系统,现有KG面临两个主要挑战:信息粒度和时间不足。这些阻碍了从KGs中检索和分析上下文、细粒度和最新知识的能力,特别是在高度专业化的主题(例如,专业科学研究)和快速变化的环境(例如,新闻或灾害跟踪)中。为了应对这些挑战,我们提出了一个主题特定知识图(即 ThemeKG),一个基于主题特定语料库的知识图谱,并设计了用于 ThemeKG 构建的无监督框架(名为 TKGCon)。该框架从主题特定语料库中提取原始主题,然后通过大型语言模型(LLMs)生成候选关系,构建主题关系本体。为了解析主题语料库中的文档,我们首先将提取到的实体对映射到语料库,并检索候选关系。最后,我们将上下文和本体整合用于关系匹配。我们观察到,直接使用 GPT-4 生成主题特定 KG会导致不准确实体(例如查询结果中的“两个主要类型”作为一个实体),以及不清晰或错误的關係(例如“由於”或“开始于”)。相比之下,通过逐步构建主题特定 KG,我们的模型在比较各种 KG 建设基线方面表现出优异性能。实验结果还显示,我们的框架在各种 KG 建设基线上的评估中表现出色。
https://arxiv.org/abs/2404.19146
We propose a system for visual scene analysis and recognition based on encoding the sparse, latent feature-representation of an image into a high-dimensional vector that is subsequently factorized to parse scene content. The sparse feature representation is learned from image statistics via convolutional sparse coding, while scene parsing is performed by a resonator network. The integration of sparse coding with the resonator network increases the capacity of distributed representations and reduces collisions in the combinatorial search space during factorization. We find that for this problem the resonator network is capable of fast and accurate vector factorization, and we develop a confidence-based metric that assists in tracking the convergence of the resonator network.
我们提出了一个基于编码图像稀疏、潜在特征表示的视觉场景分析和识别系统。该系统将图像稀疏表示编码为高维向量,然后通过分解为解析场景内容。稀疏特征表示通过卷积稀疏编码从图像统计信息中学习,而场景解析由共振器网络完成。将稀疏编码与共振器网络相结合可以增加分布式表示的容量,并在分解过程中减少组合搜索空间中的碰撞。我们发现,对于这个问题,共振器网络能够实现快速和准确的向量分解,并且我们开发了一个基于信心的度量来协助跟踪共振器网络的收敛。
https://arxiv.org/abs/2404.19126