Cell tracking remains a pivotal yet challenging task in biomedical research. The full potential of deep learning for this purpose is often untapped due to the limited availability of comprehensive and varied training data sets. In this paper, we present SynCellFactory, a generative cell video augmentation. At the heart of SynCellFactory lies the ControlNet architecture, which has been fine-tuned to synthesize cell imagery with photorealistic accuracy in style and motion patterns. This technique enables the creation of synthetic yet realistic cell videos that mirror the complexity of authentic microscopy time-lapses. Our experiments demonstrate that SynCellFactory boosts the performance of well-established deep learning models for cell tracking, particularly when original training data is sparse.
细胞追踪在生物医学研究中仍然是一个关键但具有挑战性的任务。由于深度学习在为此目的的全面且多样化的训练数据集的可用性方面往往被低估,因此深度学习在此任务上的全部潜力常常未被充分利用。在本文中,我们提出了SynCellFactory,一种生成细胞视频的增强方法。SynCellFactory的核心是ControlNet架构,该架构已通过在风格和运动模式上合成细胞图像来提高其准确性。这种技术能够创建与真实显微镜时间间隔复杂性相仿的合成细胞视频。我们的实验结果表明,SynCellFactory能够显著提高已有的深度学习模型在细胞追踪方面的性能,特别是当原始训练数据稀疏时。
https://arxiv.org/abs/2404.16421
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises $31,325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $30$ LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
大视图语言模型(LVLMs)在诸如视觉对话和 embodied 导航等通用多模态应用方面取得了显著的进步。然而,现有的多模态评估基准测试的项目数量有限,无法跟踪 LVLM 的开发。在这项研究中,我们提出了 MMT-Bench,一个全面的多模态基准,旨在评估 LVLMs 在需要专家知识和故意视觉识别、定位、推理和规划的大型多模态任务中的能力。MMT-Bench 包括来自各种多模态场景的 $31,325$ 个精心策划的多选题视觉问题,涵盖了 $32$ 个核心元任务和 $162$ 个亚任务的多模态理解。由于其广泛的任务覆盖,MMT-Bench 使使用任务图评估 LVLMs 成为可能,促进发现领域内和领域外任务。评估了 $30$ 个 LVLM,如专有 GPT-4V、GeminiProVision 和开源的 InternVL-Chat,结果表明 MMT-Bench 带来了重大挑战。我们预计,MMT-Bench 将激发社区开发指向实现通用多模态智能的下一代多模态基础模型。
https://arxiv.org/abs/2404.16006
Patellofemoral joint (PFJ) issues affect one in four people, with 20% experiencing chronic knee pain despite treatment. Poor outcomes and pain after knee replacement surgery are often linked to patellar mal-tracking. Traditional imaging methods like CT and MRI face challenges, including cost and metal artefacts, and there's currently no ideal way to observe joint motion without issues such as soft tissue artefacts or radiation exposure. A new system to monitor joint motion could significantly improve understanding of PFJ dynamics, aiding in better patient care and outcomes. Combining 2D ultrasound with motion tracking for 3D reconstruction of the joint using semantic segmentation and position registration can be a solution. However, the need for expensive external infrastructure to estimate the trajectories of the scanner remains the main limitation to implementing 3D bone reconstruction from handheld ultrasound scanning clinically. We proposed the Visual-Inertial Odometry (VIO) and the deep learning-based inertial-only odometry methods as alternatives to motion capture for tracking a handheld ultrasound scanner. The 3D reconstruction generated by these methods has demonstrated potential for assessing the PFJ and for further measurements from free-hand ultrasound scans. The results show that the VIO method performs as well as the motion capture method, with average reconstruction errors of 1.25 mm and 1.21 mm, respectively. The VIO method is the first infrastructure-free method for 3D reconstruction of bone from wireless handheld ultrasound scanning with an accuracy comparable to methods that require external infrastructure.
翻译:Patellofemoral joint (PFJ) 问题影响四分之一的人,即使经过治疗,20%的人仍然会经历慢性膝盖疼痛。腿部置换手术后的不良结果和疼痛通常与膝关节不良运动有关。传统的影像技术如 CT 和 MRI 面临成本和金属伪影等挑战,目前没有理想的方法在没有软组织伪影或辐射暴露等问题的情况下观察关节运动。一种新系统监测关节运动可能显著改善对 PFJ 动态的理解,有助于提高患者护理和治疗效果。将 2D 超声与运动跟踪结合进行关节三维重建可以使用语义分割和位置配准,可能是解决方案。然而,需要昂贵的外部基础设施估计扫描器的轨迹仍然是实施临床超声三维骨重建的主要限制。我们提出了视觉惯性测量 (VIO) 和基于深度学习的惯性仅运动跟踪方法作为手持超声扫描器的运动捕捉替代方法。这些方法产生的 3D 重建已经证明了评估 PFJ 的潜力和从自由手超声扫描中进行进一步测量的可能性。结果表明,VIO 方法与运动捕捉方法的表现相同,平均重建误差分别为 1.25 mm 和 1.21 mm。VIO 方法是第一个无基础设施免费的 3D 骨重建方法,其准确性相当于需要外部基础设施的方法。
https://arxiv.org/abs/2404.15847
This paper introduces FlowMap, an end-to-end differentiable method that solves for precise camera poses, camera intrinsics, and per-frame dense depth of a video sequence. Our method performs per-video gradient-descent minimization of a simple least-squares objective that compares the optical flow induced by depth, intrinsics, and poses against correspondences obtained via off-the-shelf optical flow and point tracking. Alongside the use of point tracks to encourage long-term geometric consistency, we introduce differentiable re-parameterizations of depth, intrinsics, and pose that are amenable to first-order optimization. We empirically show that camera parameters and dense depth recovered by our method enable photo-realistic novel view synthesis on 360-degree trajectories using Gaussian Splatting. Our method not only far outperforms prior gradient-descent based bundle adjustment methods, but surprisingly performs on par with COLMAP, the state-of-the-art SfM method, on the downstream task of 360-degree novel view synthesis (even though our method is purely gradient-descent based, fully differentiable, and presents a complete departure from conventional SfM).
本文介绍了FlowMap,一种端到端的不同iable方法,用于求解视频序列中的精确相机姿态、相机内参和逐帧密集深度。我们的方法通过简单最小二乘目标函数对深度、内参和姿态引起的光学流进行逐视频梯度下降最小化。在点跟踪的使用下,我们引入了可进行一级优化的深度、内参和姿态的可导性重新参数化。我们通过实验验证,我们的方法能够使用高斯平铺实现照片现实感的360度轨迹合成。与基于梯度的 bundle adjustment 方法相比,我们的方法不仅远远超过了先前的结果,而且与最先进的SfM方法COLMAP在360度新视图合成下游任务的表现相当。尽管我们的方法是基于梯度的,完全不同导,完全与传统SfM不同,但它成功地克服了传统SfM的局限性。
https://arxiv.org/abs/2404.15259
Eye-tracking technology is integral to numerous consumer electronics applications, particularly in the realm of virtual and augmented reality (VR/AR). These applications demand solutions that excel in three crucial aspects: low-latency, low-power consumption, and precision. Yet, achieving optimal performance across all these fronts presents a formidable challenge, necessitating a balance between sophisticated algorithms and efficient backend hardware implementations. In this study, we tackle this challenge through a synergistic software/hardware co-design of the system with an event camera. Leveraging the inherent sparsity of event-based input data, we integrate a novel sparse FPGA dataflow accelerator customized for submanifold sparse convolution neural networks (SCNN). The SCNN implemented on the accelerator can efficiently extract the embedding feature vector from each representation of event slices by only processing the non-zero activations. Subsequently, these vectors undergo further processing by a gated recurrent unit (GRU) and a fully connected layer on the host CPU to generate the eye centers. Deployment and evaluation of our system reveal outstanding performance metrics. On the Event-based Eye-Tracking-AIS2024 dataset, our system achieves 81% p5 accuracy, 99.5% p10 accuracy, and 3.71 Mean Euclidean Distance with 0.7 ms latency while only consuming 2.29 mJ per inference. Notably, our solution opens up opportunities for future eye-tracking systems. Code is available at this https URL.
眼动技术是许多消费电子产品(尤其是虚拟和增强现实)的重要组成部分。这些应用需要具备三个关键方面的解决方案:低延迟、低功耗和高精度。然而,在所有这些方面实现最佳性能仍然是一个具有挑战性的任务,需要平衡复杂的算法和高效的后台硬件实现之间的权衡。在这项研究中,我们通过与事件相机协同设计的系统来应对这个挑战。我们利用事件数据固有的稀疏性,为子manifold稀疏卷积神经网络(SCNN)实现了一个新的稀疏FPGA数据流加速器。该加速器对每个事件切片的表现进行处理,仅在激活非零的情况下处理。然后,这些向量通过门控循环单元(GRU)和主机CPU上的全连接层进行进一步处理,生成眼心。部署和评估我们的系统揭示了出色的性能指标。在基于事件的Eye-Tracking-AIS2024数据集上,我们的系统实现81%的p5准确率、99.5%的p10准确率和3.71Mean Euclidean Distance,具有0.7ms的延迟,而仅消耗2.29mJ的推理。值得注意的是,我们的解决方案为未来的眼动系统提供了可能。代码可以从该链接的URL中获取。
https://arxiv.org/abs/2404.14279
Purpose: The recent Segment Anything Model (SAM) has demonstrated impressive performance with point, text or bounding box prompts, in various applications. However, in safety-critical surgical tasks, prompting is not possible due to (i) the lack of per-frame prompts for supervised learning, (ii) it is unrealistic to prompt frame-by-frame in a real-time tracking application, and (iii) it is expensive to annotate prompts for offline applications. Methods: We develop Surgical-DeSAM to generate automatic bounding box prompts for decoupling SAM to obtain instrument segmentation in real-time robotic surgery. We utilise a commonly used detection architecture, DETR, and fine-tuned it to obtain bounding box prompt for the instruments. We then empolyed decoupling SAM (DeSAM) by replacing the image encoder with DETR encoder and fine-tune prompt encoder and mask decoder to obtain instance segmentation for the surgical instruments. To improve detection performance, we adopted the Swin-transformer to better feature representation. Results: The proposed method has been validated on two publicly available datasets from the MICCAI surgical instruments segmentation challenge EndoVis 2017 and 2018. The performance of our method is also compared with SOTA instrument segmentation methods and demonstrated significant improvements with dice metrics of 89.62 and 90.70 for the EndoVis 2017 and 2018. Conclusion: Our extensive experiments and validations demonstrate that Surgical-DeSAM enables real-time instrument segmentation without any additional prompting and outperforms other SOTA segmentation methods.
目的:最近,Segment Anything Model(SAM)通过点、文本或边界框提示在各种应用中展示了出色的性能。然而,在关键手术任务中,由于(i)缺少每个帧的监督学习指导,(ii)在实时跟踪应用程序中逐帧提示是不现实的,(iii)为离线应用程序标注提示成本高昂,我们开发了Surgical-DeSAM,用于生成自动边界框提示,以将SAM与实时机器人手术解耦,并获得器械分割。我们利用了一个常用的检测架构DETR并对其进行了微调,以获得器械的边界框提示。然后,通过用DETR编码器替换图像编码器,并微调提示编码器和遮罩解码器,我们实现了手术器械的实例分割。为了提高检测性能,我们采用了Swin-transformer来更好地表示特征。结果:所提出的方法已通过在EndoVis 2017和2018两个公开可用的数据集上进行验证。我们的方法与其他用于手术器械分割的最好方法进行了比较,并使用迪氏分数(89.62)和吉氏分数(90.70)证明了在EndoVis 2017和2018上显著的改进。结论:我们的大量实验和验证证明,Surgical-DeSAM实现了没有额外提示的实时器械分割,并超越了其他SOTA分割方法。
https://arxiv.org/abs/2404.14040
Visual object tracking and segmentation in omnidirectional videos are challenging due to the wide field-of-view and large spherical distortion brought by 360° images. To alleviate these problems, we introduce a novel representation, extended bounding field-of-view (eBFoV), for target localization and use it as the foundation of a general 360 tracking framework which is applicable for both omnidirectional visual object tracking and segmentation tasks. Building upon our previous work on omnidirectional visual object tracking (360VOT), we propose a comprehensive dataset and benchmark that incorporates a new component called omnidirectional video object segmentation (360VOS). The 360VOS dataset includes 290 sequences accompanied by dense pixel-wise masks and covers a broader range of target categories. To support both the development and evaluation of algorithms in this domain, we divide the dataset into a training subset with 170 sequences and a testing subset with 120 sequences. Furthermore, we tailor evaluation metrics for both omnidirectional tracking and segmentation to ensure rigorous assessment. Through extensive experiments, we benchmark state-of-the-art approaches and demonstrate the effectiveness of our proposed 360 tracking framework and training dataset. Homepage: this https URL
视图对象跟踪和分割在全景视频中具有挑战性,因为360°图像带来的 wide field-of-view 和 large spherical distortion。为了减轻这些问题,我们引入了一种新的表示,扩展边界场视野(eBFoV),用于目标定位,并将其作为通用360跟踪框架的基础,适用于全景视觉对象跟踪和分割任务。在我们的之前工作基础上(360VOT),我们提出了一个全面的 datasets 和 benchmark,其中包含了一个新的组件,称为全景视频对象分割(360VOS)。360VOS 数据集包括 290 个序列,并伴有密集的像素级掩码,涵盖了更广泛的目标类别。为了支持在这个领域的算法的发展和评估,我们将数据集划分为训练集和测试集,其中训练集包含170个序列,测试集包含120个序列。此外,我们还为全景跟踪和分割定义了严格的评估指标,以确保严谨的评估。通过广泛的实验,我们基准了最先进的 approaches,并证明了所提出的360跟踪框架和训练数据集的有效性。主页:https:// this URL
https://arxiv.org/abs/2404.13953
Multi-object tracking (MOT) is a critical and challenging task in computer vision, particularly in situations involving objects with similar appearances but diverse movements, as seen in team sports. Current methods, largely reliant on object detection and appearance, often fail to track targets in such complex scenarios accurately. This limitation is further exacerbated by the lack of comprehensive and diverse datasets covering the full view of sports pitches. Addressing these issues, we introduce TeamTrack, a pioneering benchmark dataset specifically designed for MOT in sports. TeamTrack is an extensive collection of full-pitch video data from various sports, including soccer, basketball, and handball. Furthermore, we perform a comprehensive analysis and benchmarking effort to underscore TeamTrack's utility and potential impact. Our work signifies a crucial step forward, promising to elevate the precision and effectiveness of MOT in complex, dynamic settings such as team sports. The dataset, project code and competition is released at: this https URL.
多目标跟踪(MOT)是计算机视觉领域一个关键而具有挑战性的任务,特别是在涉及具有相似外观但不同运动情况的情况,如团队运动中。当前的方法,主要依赖物体检测和外观,往往无法准确跟踪在这样的复杂场景中的目标。这个限制进一步加剧了缺乏全面和多样数据集覆盖体育场的不足。为了解决这些问题,我们介绍了专门为体育MOT设计的团队跟踪基准数据集TeamTrack。TeamTrack是一个包括各种体育项目完整场地视频数据的广泛收集。此外,我们进行了全面分析和基准测试,以强调TeamTrack的实用性和潜在影响。我们的工作标志着一个关键的进步,有望提高在复杂、动态的体育场景中MOT的精度和效果。该数据集、项目代码和比赛已发布在:https:// this URL。
https://arxiv.org/abs/2404.13868
Video-based eye trackers capture the iris biometric and enable authentication to secure user identity. However, biometric authentication is susceptible to spoofing another user's identity through physical or digital manipulation. The current standard to identify physical spoofing attacks on eye-tracking sensors uses liveness detection. Liveness detection classifies gaze data as real or fake, which is sufficient to detect physical presentation attacks. However, such defenses cannot detect a spoofing attack when real eye image inputs are digitally manipulated to swap the iris pattern of another person. We propose IrisSwap as a novel attack on gaze-based liveness detection. IrisSwap allows attackers to segment and digitally swap in a victim's iris pattern to fool iris authentication. Both offline and online attacks produce gaze data that deceives the current state-of-the-art defense models at rates up to 58% and motivates the need to develop more advanced authentication methods for eye trackers.
基于视频的眼跟踪器可以捕获眼部生物特征并实现用户身份验证,但生物特征身份验证容易被通过物理或数字操纵伪造另一个用户的身份。目前对识别眼跟踪器上的物理伪造攻击的标准是使用活力检测。活力检测将目光数据分为真实或伪造,这足以检测物理展示攻击。然而,这种防御无法检测到当真实眼睛图像被数字操纵以交换另一个人眼睛图案时产生的伪造攻击。我们提出IrisSwap作为一种新颖的攻击方式。IrisSwap允许攻击者在其受害者的眼部图案上进行分割和数字交换,以欺骗眼部认证。离线和在线攻击都会产生使最先进的防御模型误判率为58%的 gaze数据,并促使开发更先进的眼跟踪器身份验证方法。
https://arxiv.org/abs/2404.13827
We address the challenging task of identifying, segmenting, and tracking hand-held objects, which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion, rapid motion, and the transitory nature of objects being hand-held, where an object may be held, released, and subsequently picked up again. To tackle these challenges, we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other, ensuring that the processes of identification, segmentation, and tracking of hand-held objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover, we also contribute an in-the-wild video dataset called HOIST, which comprises 4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs for hand-held objects. Through experiments on the HOIST dataset and two additional public datasets, we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects.
我们 addressing the challenging task of identifying, segmenting, and tracking handheld objects, which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion, rapid motion, and the transient nature of objects being handheld, where an object may be held, released, and subsequently picked up again. To tackle these challenges, we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other, ensuring that the processes of identification, segmentation, and tracking of handheld objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover, we also contribute an in-the-wild video dataset called HOIST, which comprises 4,125 videos complete with bounding boxes, segmentation masks, and tracking IDs for handheld objects. Through experiments on the HOIST dataset and two additional public datasets, we demonstrate the efficacy of HOIST-Former in segmenting and tracking handheld objects.
https://arxiv.org/abs/2404.13819
This paper introduces a safe force/position tracking control strategy designed for Free-Floating Mobile Manipulator Systems (MMSs) engaging in compliant contact with planar surfaces. The strategy uniquely integrates the Control Barrier Function (CBF) to manage operational limitations and safety concerns. It effectively addresses safety-critical aspects in the kinematic as well as dynamic level, such as manipulator joint limits, system velocity constraints, and inherent system dynamic uncertainties. The proposed strategy remains robust to the uncertainties of the MMS dynamic model, external disturbances, or variations in the contact stiffness model. The proposed control method has low computational demand ensures easy implementation on onboard computing systems, endorsing real-time operations. Simulation results verify the strategy's efficacy, reflecting enhanced system performance and safety.
本文提出了一种适用于自由漂浮移动机器人系统(MMSs)与平面表面安全接触的力/位置跟踪控制策略。该策略独特地将控制屏障功能(CBF)集成到其中,以管理操作限制和安全问题。它有效地解决了机械运动和动态级别的安全关键问题,如操作器关节限制、系统速度限制和固有系统动态不确定性。所提出的策略对MMS动态模型的不确定性、外部干扰或接触刚度模型变异性保持鲁棒。所提出的控制方法具有低计算需求,确保在车载计算系统中易于实施,支持实时操作。仿真结果证实了策略的有效性,反映了增强的系统性能和安全。
https://arxiv.org/abs/2404.13626
We introduce EC-SLAM, a real-time dense RGB-D simultaneous localization and mapping (SLAM) system utilizing Neural Radiance Fields (NeRF). Although recent NeRF-based SLAM systems have demonstrated encouraging outcomes, they have yet to completely leverage NeRF's capability to constrain pose optimization. By employing an effectively constrained global bundle adjustment (BA) strategy, our system makes use of NeRF's implicit loop closure correction capability. This improves the tracking accuracy by reinforcing the constraints on the keyframes that are most pertinent to the optimized current frame. In addition, by implementing a feature-based and uniform sampling strategy that minimizes the number of ineffective constraint points for pose optimization, we mitigate the effects of random sampling in NeRF. EC-SLAM utilizes sparse parametric encodings and the truncated signed distance field (TSDF) to represent the map in order to facilitate efficient fusion, resulting in reduced model parameters and accelerated convergence velocity. A comprehensive evaluation conducted on the Replica, ScanNet, and TUM datasets showcases cutting-edge performance, including enhanced reconstruction accuracy resulting from precise pose estimation, 21 Hz run time, and tracking precision improvements of up to 50\%. The source code is available at this https URL.
我们提出了EC-SLAM,一种利用Neural Radiance Fields(NeRF)实现实时密集的RGB-D同时定位和映射(SLAM)系统。尽管基于NeRF的SLAM系统已经取得了鼓舞人心的成果,但它们尚未完全利用NeRF约束优化姿态的能力。通过采用一种有效约束全局 bundle adjustment(BA)策略,我们的系统利用了NeRF的隐式环路纠正能力。这提高了跟踪精度,通过加强与优化当前帧关键帧相关的约束来提高跟踪精度。此外,通过实现基于特征的统一采样策略,最小化姿态优化的有效约束点的数量,我们减轻了NeRF中随机抽样的影响。EC-SLAM利用稀疏参数编码和截断签名距离场(TSDF)表示地图,以促进高效的融合,从而导致模型参数减少和加速收敛速度。在 Replica、ScanNet 和 TUM 数据集上进行全面评估,展示了尖端性能,包括精确姿态估计、21 Hz 运行时间和跟踪精度提高50%等。源代码可在此处访问:https:// this URL.
https://arxiv.org/abs/2404.13346
Parkinson's disease ranks as the second most prevalent neurodegenerative disorder globally. This research aims to develop a system leveraging Mixed Reality capabilities for tracking and assessing eye movements. In this paper, we present a medical scenario and outline the development of an application designed to capture eye-tracking signals through Mixed Reality technology for the evaluation of neurodegenerative diseases. Additionally, we introduce a pipeline for extracting clinically relevant features from eye-gaze analysis, describing the capabilities of the proposed system from a medical perspective. The study involved a cohort of healthy control individuals and patients suffering from Parkinson's disease, showcasing the feasibility and potential of the proposed technology for non-intrusive monitoring of eye movement patterns for the diagnosis of neurodegenerative diseases. Clinical relevance - Developing a non-invasive biomarker for Parkinson's disease is urgently needed to accurately detect the disease's onset. This would allow for the timely introduction of neuroprotective treatment at the earliest stage and enable the continuous monitoring of intervention outcomes. The ability to detect subtle changes in eye movements allows for early diagnosis, offering a critical window for intervention before more pronounced symptoms emerge. Eye tracking provides objective and quantifiable biomarkers, ensuring reliable assessments of disease progression and cognitive function. The eye gaze analysis using Mixed Reality glasses is wireless, facilitating convenient assessments in both home and hospital settings. The approach offers the advantage of utilizing hardware that requires no additional specialized attachments, enabling examinations through personal eyewear.
Parkinson's disease is ranked as the second most prevalent neurodegenerative disorder globally.这项研究旨在开发一个利用混合现实技术跟踪和评估眼动功能的系统。在本文中,我们提出了一个医学场景,概述了开发旨在评估神经退行性疾病眼动功能的应用程序。此外,我们还介绍了从眼动分析中提取与临床相关的特征的途径,从医学角度描述了所提议系统的能力。该研究涉及了一组健康对照者和患有帕金森病的患者,展示了所提出技术非侵入性监测眼动模式对神经退行性疾病诊断的潜力和可行性。临床意义 - 迫切需要为帕金森病开发一种非侵入性的生物标志物,以准确检测疾病的发生。这将使得在疾病最早期引入神经保护治疗,并能够连续监测治疗效果。能够检测眼动模式的变化,使得早期诊断成为可能,为干预提供关键窗口。眼跟踪提供了客观和量化的生物标志物,确保了疾病进展和认知功能的有据可依的评估。使用混合现实眼镜进行眼动分析是无偿的、方便的,在家庭和医院环境中都提供了便利的评估。这种方法利用了无需额外专业附件的硬件,通过个人眼镜进行检查。
https://arxiv.org/abs/2404.12984
With the current demand for automation in the agro-food industry, accurately detecting and localizing relevant objects in 3D is essential for successful robotic operations. However, this is a challenge due the presence of occlusions. Multi-view perception approaches allow robots to overcome occlusions, but a tracking component is needed to associate the objects detected by the robot over multiple viewpoints. Multi-object tracking (MOT) algorithms can be categorized between two-stage and single-stage methods. Two-stage methods tend to be simpler to adapt and implement to custom applications, while single-stage methods present a more complex end-to-end tracking method that can yield better results in occluded situations at the cost of more training data. The potential advantages of single-stage methods over two-stage methods depends on the complexity of the sequence of viewpoints that a robot needs to process. In this work, we compare a 3D two-stage MOT algorithm, 3D-SORT, against a 3D single-stage MOT algorithm, MOT-DETR, in three different types of sequences with varying levels of complexity. The sequences represent simpler and more complex motions that a robot arm can perform in a tomato greenhouse. Our experiments in a tomato greenhouse show that the single-stage algorithm consistently yields better tracking accuracy, especially in the more challenging sequences where objects are fully occluded or non-visible during several viewpoints.
随着农业食品工业中自动化需求的增加,准确地检测和定位相关物体在3D中的关键是成功的机器人操作。然而,由于存在遮挡,这是一个挑战。多视角感知方法允许机器人克服遮挡,但需要一个跟踪组件来将机器人检测到的物体与多个视角相关联。多对象跟踪(MOT)算法可以分为两阶段和单阶段方法。两阶段方法通常更容易适应定制应用程序,而单阶段方法则呈现出了更复杂的端到端跟踪方法,在遮挡情况下可以获得更好的结果,但需要更多的训练数据。单阶段方法相对于两阶段方法的潜在优势取决于机器人需要处理视点的序列的复杂程度。在本研究中,我们比较了3D两阶段MOT算法(3D-SORT)与3D单阶段MOT算法(MOT-DETR)在三种不同复杂程度的序列中的效果。这些序列代表机器人手臂在番茄温室中可以执行的更简单和更复杂的动作。我们在番茄温室中的实验结果表明,单阶段算法在跟踪准确性方面始终优于双阶段算法,尤其是在更具有挑战性的序列中,对象在多个视角中都被完全遮挡或不可见的情况下。
https://arxiv.org/abs/2404.12963
We present RetailOpt, a novel opt-in, easy-to-deploy system for tracking customer movements in indoor retail environments. The system utilizes information presently accessible to customers through smartphones and retail apps: motion data, store map, and purchase records. The approach eliminates the need for additional hardware installations/maintenance and ensures customers maintain full control of their data. Specifically, RetailOpt first employs inertial navigation to recover relative trajectories from smartphone motion data. The store map and purchase records are then cross-referenced to identify a list of visited shelves, providing anchors to localize the relative trajectories in a store through continuous and discrete optimization. We demonstrate the effectiveness of our system through systematic experiments in five diverse environments. The proposed system, if successful, would produce accurate customer movement data, essential for a broad range of retail applications, including customer behavior analysis and in-store navigation. The potential application could also extend to other domains such as entertainment and assistive technologies.
我们提出了RetailOpt,一种新型的opt-in,易于部署的系统,用于在室内零售环境中跟踪顾客的运动。该系统利用现有顾客通过智能手机和零售应用程序可获得的信息:运动数据、商店地图和购买记录。这种方法消除了额外硬件安装/维护的需求,并确保客户对他们的数据保持完全控制。具体来说,RetailOpt首先采用惯性导航从智能手机运动数据中恢复相对轨迹。商店地图和购买记录随后交叉参考,以确定已访问的货架列表,提供通过连续和离散优化在商店中定位相对轨迹的锚点。我们在五种不同的环境中进行了系统实验,以证明我们系统的有效性。如果成功,该系统将产生准确的客户运动数据,这是广泛的零售应用程序所必需的,包括客户行为分析和在店导航。该系统的潜在应用还可能扩展到其他领域,如娱乐和辅助技术。
https://arxiv.org/abs/2404.12548
With the development of remote sensing technology in recent decades, spaceborne sensors with sub-meter and meter spatial resolution (Worldview and PlanetScope) have achieved a considerable image quality to generate 3D geospatial data via a stereo matching pipeline. These achievements have significantly increased the data accessibility in 3D, necessitating adapting these 3D geospatial data to analyze human and natural environments. This dissertation explores several novel approaches based on stereo and multi-view satellite image-derived 3D geospatial data, to deal with remote sensing application issues for built-up area modeling and natural environment monitoring, including building model 3D reconstruction, glacier dynamics tracking, and lake algae monitoring. Specifically, the dissertation introduces four parts of novel approaches that deal with the spatial and temporal challenges with satellite-derived 3D data. The first study advances LoD-2 building modeling from satellite-derived Orthophoto and DSMs with a novel approach employing a model-driven workflow that generates building rectangular 3D geometry models. Secondly, we further enhanced our building reconstruction framework for dense urban areas and non-rectangular purposes, we implemented deep learning for unit-level segmentation and introduced a gradient-based circle reconstruction for circular buildings to develop a polygon composition technique for advanced building LoD2 reconstruction. Our third study utilizes high-spatiotemporal resolution PlanetScope satellite imagery for glacier tracking at 3D level in mid-latitude regions. Finally, we proposed a term as "Algal Behavior Function" to refine the quantification of chlorophyll-a concentrations from satellite imagery in water quality monitoring, addressing algae fluctuations and timing discrepancies between satellite observations and field measurements, thus enhancing the precision of underwater algae volume estimates. Overall, this dissertation demonstrates the extensive potential of satellite photogrammetry applications in addressing urban and environmental challenges. It further showcases innovative analytical methodologies that enhance the applicability of adapting stereo and multi-view very high-resolution satellite-derived 3D data. (See full abstract in the document)
在近几十年来,遥感技术的发展使得空间分辨率亚米级和米级的陆地传感器(Worldview和PlanetScope)能够获得相当大的图像质量,通过立体匹配方法生成三维地理数据。这些成就是显著增加了三维数据的可访问性,迫使将这些三维地理数据应用于人类和自然环境的研究。本论文探讨了基于立体和多维卫星图像的几个新的方法,以解决建筑建模和自然环境监测中的遥感应用问题,包括建筑模型3D重建、冰川 dynamics跟踪和湖泊藻类监测。具体来说,论文介绍了一种处理卫星基于3D数据的空间和时间挑战的新方法。第一研究进展了一种利用新方法生成基于卫星Orthophoto和DSM的LOD-2级建模方法。第二,我们进一步提高了建筑重建框架,以应对密度的城市地区和非矩形目的,我们为环形建筑进行了深度学习,并引入了基于圆的曲线重建方法,以发展高级建模LOD2的聚合法。我们的第三研究利用高空间时间分辨率PlanScope卫星图像在 Mid-latitude 地区对冰川进行跟踪。最后,我们提出了一个称为"Algal Behavior Function"的术语,以定量测量水质量监测中卫星图像中蓝藻浓度,解决藻类波动和卫星观测与现场测量之间的时间差异,从而提高水下藻类体积估计的精度。总的来说,本论文展示了卫星摄影测量在城市和环境挑战中广泛应用的潜力。它还展示了创新的数据分析方法,提高了适应立体和多维高分辨率卫星数据的可行性。( see the full abstract in the document)
https://arxiv.org/abs/2404.12487
Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at this https URL.
today,大多数图像理解任务的方法都依赖于前馈神经网络。虽然这种方法通过微调获得了 empirical 的准确性和效率,但同时也存在一些基本缺陷。现有的网络往往难以在不同的数据集上泛化,即使是相同任务。通过设计,这些网络最终在预训练的3D对象表示的潜在空间中进行推理,这是具有挑战性的。尤其是在试图根据2D图像预测3D信息时,这更是如此。我们提出将从RGB相机中的3D多对象跟踪重新建模为同义词{反向渲染(IR)问题,通过优化通过不同的渲染管道在预训练3D对象表示的潜在空间中进行优化,并检索在给定输入图像中最好地表示物体实例的潜在。为此,我们优化了一个在生成性潜在空间上进行的图像损失。我们研究了不仅是对跟踪的另一种看法,而且我们的方法还允许我们检查生成的物体,推理失败情况,并解决模糊情况。我们通过仅从合成数据中学习生成先验来评估我们的方法的泛化能力和扩展能力。我们在 nuScenes 和 Waymo 数据集上对相机基于3D跟踪的性能进行了评估。这两个数据集完全未见对我们的方法,也不需要微调。视频和代码可在此处 https:// URL 下载。
https://arxiv.org/abs/2404.12359
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) in computer vision has enabled promising downstream performance on top of the learned self-supervised ViT features. In this paper, we question if the extremely simple ViTs' fine-tuning performance with a small-scale architecture can also benefit from this pre-training paradigm, which is considerably less studied yet in contrast to the well-established lightweight architecture design methodology with sophisticated components introduced. By carefully adapting various typical MIM pre-training methods to this lightweight regime and comparing them with the contrastive learning (CL) pre-training on various downstream image classification and dense prediction tasks, we systematically observe different behaviors between MIM and CL with respect to the downstream fine-tuning data scales. Furthermore, we analyze the frozen features under linear probing evaluation and also the layer representation similarities and attention maps across the obtained models, which clearly show the inferior learning of MIM pre-training on higher layers, leading to unsatisfactory fine-tuning performance on data-insufficient downstream tasks. This finding is naturally a guide to choosing appropriate distillation strategies during pre-training to solve the above deterioration problem. Extensive experiments on various vision tasks demonstrate the effectiveness of our observation-analysis-solution flow. In particular, our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design (5.7M/6.5M) can achieve 79.4%/78.9% top-1 accuracy on ImageNet-1K. It also enables SOTA performance on the ADE20K semantic segmentation task (42.8% mIoU) and LaSOT visual tracking task (66.1% AUC) in the lightweight regime. The latter even surpasses all the current SOTA lightweight CPU-realtime trackers.
遮罩图像建模(MIM)在计算机视觉中的大规模ViT预训练已经实现了在学到的自监督ViT特征上具有 promising 的下游性能。在本文中,我们怀疑极简单的ViT在小规模架构上的微调性能是否也能从中获得好处,相比之下,这种预训练方法在研究方面还比较薄弱。与具有复杂组件的成熟轻量级架构设计方法相比,这种预训练方法的研究程度要低得多。通过谨慎地适应各种常见的MIM预训练方法到轻量级状态,并将其与各种下游图像分类和密集预测任务中的对比学习(CL)预训练进行比较,我们系统地观察到MIM和CL在下游细粒度数据上的行为存在差异。此外,我们分析了几种典型MIM预训练方法在轻量级状态下的冻结特征以及获得的模型中层表示相似度和注意图,这显然表明了在较高层的学习不足,导致在数据不足的下游任务上的不令人满意的细粒度预训练性能。这一发现自然地为指导在预训练过程中选择合适的去混淆策略来解决上述恶化问题提供了指导。在各种视觉任务上的广泛实验证明了我们观察-分析和解决方案流程的有效性。特别是,我们在纯轻量级ViT上进行去混淆的预训练,具有(5.7M/6.5M)ImageNet-1K的79.4%/78.9% top-1准确率。这还在轻量状态下实现了ADE20K语义分割任务(42.8% mIoU)和LaSOT视觉跟踪任务(66.1% AUC)的SOTA性能。后一个甚至超过了所有当前的SOTA轻量级CPU实时跟踪器的性能。
https://arxiv.org/abs/2404.12210
Event-based eye tracking has shown great promise with the high temporal resolution and low redundancy provided by the event camera. However, the diversity and abruptness of eye movement patterns, including blinking, fixating, saccades, and smooth pursuit, pose significant challenges for eye localization. To achieve a stable event-based eye-tracking system, this paper proposes a bidirectional long-term sequence modeling and time-varying state selection mechanism to fully utilize contextual temporal information in response to the variability of eye movements. Specifically, the MambaPupil network is proposed, which consists of the multi-layer convolutional encoder to extract features from the event representations, a bidirectional Gated Recurrent Unit (GRU), and a Linear Time-Varying State Space Module (LTV-SSM), to selectively capture contextual correlation from the forward and backward temporal relationship. Furthermore, the Bina-rep is utilized as a compact event representation, and the tailor-made data augmentation, called as Event-Cutout, is proposed to enhance the model's robustness by applying spatial random masking to the event image. The evaluation on the ThreeET-plus benchmark shows the superior performance of the MambaPupil, which secured the 1st place in CVPR'2024 AIS Event-based Eye Tracking challenge.
基于事件的眼跟踪在具有高时间分辨率和高容错性的事件相机提供的功能方面表现出巨大的潜力。然而,眼运动模式的多样性和突然性,包括眨眼、固定、扫视和流畅跟踪,对眼定位提出了严重的挑战。为了实现一个稳定的基于事件的眼跟踪系统,本文提出了双向长时序列建模和时间可变状态选择机制,以充分利用眼睛运动变化对上下文时间信息的响应。具体来说,提出了MambaPupil网络,它由多层卷积编码器提取事件表示的 features,双向Gated Recurrent Unit (GRU) 和线性时间可变状态空间模块 (LTV-SSM) 组成,用于选择性地捕捉上下文关系中的局部相关性。此外,Bina-rep被用作紧凑的事件表示,而提出的数据增强技术,称为事件裁剪,通过应用空间随机掩码对事件图像进行空间随机遮盖,来增强模型的鲁棒性。在 ThreeET-plus 基准上进行的评估显示,MambaPupil 的性能优越,它在 CVPR'2024 AIS Event-based Eye Tracking挑战中获得了第 1 名。
https://arxiv.org/abs/2404.12083
The new trend in multi-object tracking task is to track objects of interest using natural language. However, the scarcity of paired prompt-instance data hinders its progress. To address this challenge, we propose a high-quality yet low-cost data generation method base on Unreal Engine 5 and construct a brand-new benchmark dataset, named Refer-UE-City, which primarily includes scenes from intersection surveillance videos, detailing the appearance and actions of people and vehicles. Specifically, it provides 14 videos with a total of 714 expressions, and is comparable in scale to the Refer-KITTI dataset. Additionally, we propose a multi-level semantic-guided multi-object framework called MLS-Track, where the interaction between the model and text is enhanced layer by layer through the introduction of Semantic Guidance Module (SGM) and Semantic Correlation Branch (SCB). Extensive experiments on Refer-UE-City and Refer-KITTI datasets demonstrate the effectiveness of our proposed framework and it achieves state-of-the-art performance. Code and datatsets will be available.
跨对象跟踪任务的新趋势是使用自然语言跟踪感兴趣的对象。然而,缺乏成对提示实例数据会阻碍其进展。为解决这个问题,我们提出了一个高质量但成本低的数据生成方法,基于Unreal Engine 5,并构建了一个名为Refer-UE-City的新基准数据集,主要包括路口监视视频的场景,详细描述了人和车辆的外观和行为。具体来说,它提供了14个视频,总共有714个表情,与Refer-KITTI数据集的规模相当。此外,我们提出了一个多层语义引导多对象框架MLS-Track,通过引入语义引导模块(SGM)和语义相关分支(SCB)来增强模型和文本之间的交互。对Refer-UE-City和Refer-KITTI数据集的实验表明,我们提出的框架的有效性得到了证明,并且实现了最先进的性能。代码和数据集将可用。
https://arxiv.org/abs/2404.12031