We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate. More details at this https URL
我们通过实证研究了从视频中进行自回归预训练的方法。为了开展这项研究,我们构建了一系列名为Toto的自回归视频模型。我们将视频视为视觉令牌序列,并训练变压器模型以自回归方式预测未来的令牌。我们的模型在包含超过1万亿个视觉令牌的多样化数据集(包括视频和图像)上进行了预训练。我们在架构、训练和推理设计选择方面做了各种探索。我们评估了所学习到的视觉表示形式在一系列下游任务上的表现,包括图像识别、视频分类、对象跟踪和机器人技术。我们的研究结果表明,尽管具有最少的归纳偏置,自回归预训练仍然能在一个广泛的标准上取得竞争性的性能。 最后,我们发现随着视频模型规模的增长,其扩展曲线与语言模型相似,不过增长率有所不同。更多详情请参阅此链接(在实际回答中应提供具体网址,此处以“https URL”表示)。
https://arxiv.org/abs/2501.05453
Late gadolinium enhancement MRI (LGE MRI) is the gold standard for the detection of myocardial scars for post myocardial infarction (MI). LGE MRI requires the injection of a contrast agent, which carries potential side effects and increases scanning time and patient discomfort. To address these issues, we propose a novel framework that combines cardiac motion observed in cine MRI with image texture information to segment the myocardium and scar tissue in the left ventricle. Cardiac motion tracking can be formulated as a full cardiac image cycle registration problem, which can be solved via deep neural networks. Experimental results prove that the proposed method can achieve scar segmentation based on non-contrasted cine images with comparable accuracy to LGE MRI. This demonstrates its potential as an alternative to contrast-enhanced techniques for scar detection.
延迟钆增强磁共振成像(LGE MRI)是检测心肌梗死后心肌疤痕的金标准。然而,LGE MRI需要注射对比剂,这可能带来潜在副作用,并且会增加扫描时间和患者的不适感。为了应对这些问题,我们提出了一种新的框架,该框架结合了心脏运动在电影磁共振成像(cine MRI)中的观察情况与图像纹理信息,以对左心室的心肌和疤痕组织进行分割。心脏运动追踪可以被表述为一个完整心脏影像周期的配准问题,并且可以通过深度神经网络来解决。实验结果证明,所提出的方法能够基于非对比剂的电影图像实现疤痕分割,其准确度与LGE MRI相当。这表明该方法有可能成为检测疤痕的一种替代对比剂增强技术的方案。
https://arxiv.org/abs/2501.05241
The new era of large-scale data collection and analysis presents an opportunity for diagnosing and understanding the causes of health inequities. In this study, we describe a framework for systematically analyzing health disparities using causal inference. The framework is illustrated by investigating racial and ethnic disparities in intensive care unit (ICU) outcome between majority and minority groups in Australia (Indigenous vs. Non-Indigenous) and the United States (African-American vs. White). We demonstrate that commonly used statistical measures for quantifying inequity are insufficient, and focus on attributing the observed disparity to the causal mechanisms that generate it. We find that minority patients are younger at admission, have worse chronic health, are more likely to be admitted for urgent and non-elective reasons, and have higher illness severity. At the same time, however, we find a protective direct effect of belonging to a minority group, with minority patients showing improved survival compared to their majority counterparts, with all other variables kept equal. We demonstrate that this protective effect is related to the increased probability of being admitted to ICU, with minority patients having an increased risk of ICU admission. We also find that minority patients, while showing improved survival, are more likely to be readmitted to ICU. Thus, due to worse access to primary health care, minority patients are more likely to end up in ICU for preventable conditions, causing a reduction in the mortality rates and creating an effect that appears to be protective. Since the baseline risk of ICU admission may serve as proxy for lack of access to primary care, we developed the Indigenous Intensive Care Equity (IICE) Radar, a monitoring system for tracking the over-utilization of ICU resources by the Indigenous population of Australia across geographical areas.
大数据收集和分析的新时代为诊断和理解健康不平等的原因提供了机会。在这项研究中,我们描述了一个使用因果推断系统地分析卫生差异的框架。该框架通过探讨澳大利亚(土著人与非土著人)和美国(非裔美国人与白人)之间重症监护病房 (ICU) 结果的种族和族裔差异来加以说明。我们证明了常用的量化不平等的统计措施是不足的,并专注于将观察到的差异归因于产生该差异的原因机制。我们的研究发现,少数群体患者在入院时年龄更小,慢性健康状况更差,更有可能因紧急或非计划原因住院,并且病情更为严重。 然而,同时我们还发现了一个保护性的直接效应:与他们的多数群体相比,少数群体患者在所有其他变量保持不变的情况下表现出生存率的提高。我们证明这种保护性效果与其被收治到ICU的可能性增加有关,少数群体患者的ICU入院风险更高。此外,我们还发现在生存改善的同时,少数群体患者更有可能再次进入ICU。 因此,由于初级卫生保健获取途径较差,少数群体患者更容易因可预防的情况而住院,这导致了死亡率的降低,并产生了一种看似保护性的效果。鉴于ICU入院的基本风险可能作为缺乏初级医疗访问的一个代理指标,我们开发了土著重症监护平等(IICE)雷达,这是一个用于监测澳大利亚各地土著人口过度使用ICU资源情况的监控系统。
https://arxiv.org/abs/2501.05197
Reinforcement learning demonstrated immense success in modelling complex physics-driven systems, providing end-to-end trainable solutions by interacting with a simulated or real environment, maximizing a scalar reward signal. In this work, we propose, building upon previous work, a multi-agent reinforcement learning approach with assignment constraints for reconstructing particle tracks in pixelated particle detectors. Our approach optimizes collaboratively a parametrized policy, functioning as a heuristic to a multidimensional assignment problem, by jointly minimizing the total amount of particle scattering over the reconstructed tracks in a readout frame. To satisfy constraints, guaranteeing a unique assignment of particle hits, we propose a safety layer solving a linear assignment problem for every joint action. Further, to enforce cost margins, increasing the distance of the local policies predictions to the decision boundaries of the optimizer mappings, we recommend the use of an additional component in the blackbox gradient estimation, forcing the policy to solutions with lower total assignment costs. We empirically show on simulated data, generated for a particle detector developed for proton imaging, the effectiveness of our approach, compared to multiple single- and multi-agent baselines. We further demonstrate the effectiveness of constraints with cost margins for both optimization and generalization, introduced by wider regions with high reconstruction performance as well as reduced predictive instabilities. Our results form the basis for further developments in RL-based tracking, offering both enhanced performance with constrained policies and greater flexibility in optimizing tracking algorithms through the option for individual and team rewards.
强化学习在建模复杂的物理驱动系统方面展示了巨大的成功,通过与模拟或真实环境的交互提供端到端可训练的解决方案,并最大化标量奖励信号。在此工作中,我们基于先前的研究提出了一种带有分配约束的多智能体强化学习方法,用于重建像素粒子探测器中的粒子轨迹。我们的方法协同优化参数化策略,在一个读出帧中共同最小化重建轨迹中粒子散射总量,该策略作为多维分配问题的一种启发式方法。为满足保证每个粒子击中唯一性分配的约束条件,我们提出了一种安全层,用于解决每一个联合动作的线性分配问题。此外,为了强制执行成本边界,增加局部策略预测与优化器映射决策边界的距离,我们建议在黑盒梯度估计中使用一个额外组件,促使政策向具有较低总分配成本的解决方案发展。 我们在为质子成像开发的粒子探测器上生成的模拟数据集上进行了实证研究,并将我们的方法与多个单智能体和多智能体基线模型的效果进行了比较。我们进一步展示了约束条件结合成本边界的有效性能,即通过更广泛的具有高重建性能区域以及预测不稳定性的减少来实现优化和泛化的提升。 我们的研究成果为基于强化学习的跟踪技术的发展奠定了基础,提供了增强的受限策略性能,并且在通过个体与团队奖励选项进行跟踪算法优化方面拥有更大的灵活性。
https://arxiv.org/abs/2501.05113
Oceanographers rely on visual analysis to interpret model simulations, identify events and phenomena, and track dynamic ocean processes. The ever increasing resolution and complexity of ocean data due to its dynamic nature and multivariate relationships demands a scalable and adaptable visualization tool for interactive exploration. We introduce pyParaOcean, a scalable and interactive visualization system designed specifically for ocean data analysis. pyParaOcean offers specialized modules for common oceanographic analysis tasks, including eddy identification and salinity movement tracking. These modules seamlessly integrate with ParaView as filters, ensuring a user-friendly and easy-to-use system while leveraging the parallelization capabilities of ParaView and a plethora of inbuilt general-purpose visualization functionalities. The creation of an auxiliary dataset stored as a Cinema database helps address I/O and network bandwidth bottlenecks while supporting the generation of quick overview visualizations. We present a case study on the Bay of Bengal (BoB) to demonstrate the utility of the system and scaling studies to evaluate the efficiency of the system.
海洋学家依赖于视觉分析来解释模型模拟、识别事件和现象,并追踪动态的海洋过程。由于其动态特性和多变量关系,海洋数据的分辨率和复杂性不断上升,这需要一个可扩展且适应性强的可视化工具来进行交互式探索。我们引入了pyParaOcean,这是一个专门为海洋数据分析设计的可扩展和互动可视化系统。 pyParaOcean提供了专门针对常见海洋学分析任务(如涡旋识别、盐度运动追踪等)的功能模块,并将这些模块无缝集成到ParaView中作为过滤器使用。这样既保证了一个用户友好的易于使用的系统,同时利用了ParaView的并行化能力以及众多内置的一般用途可视化功能。 创建一个存储为Cinema数据库的辅助数据集有助于解决输入输出和网络带宽瓶颈问题,并支持快速概览可视化的生成。我们通过孟加拉湾(BoB)的一个案例研究展示了该系统的实用性,并进行了扩展性研究以评估其效率。
https://arxiv.org/abs/2501.05009
Autonomous vessels potentially enhance safety and reliability of seaborne trade. To facilitate the development of autonomous vessels, high-fidelity simulations are required to model realistic interactions with other vessels. However, modeling realistic interactive maritime traffic is challenging due to the unstructured environment, coarsely specified traffic rules, and largely varying vessel types. Currently, there is no standard for simulating interactive maritime environments in order to rigorously benchmark autonomous vessel algorithms. In this paper, we introduce the first intelligent sailing model (ISM), which simulates rule-compliant vessels for navigation on the open sea. An ISM vessel reacts to other traffic participants according to maritime traffic rules while at the same time solving a motion planning task characterized by waypoints. In particular, the ISM monitors the applicable rules, generates rule-compliant waypoints accordingly, and utilizes a model predictive control for tracking the waypoints. We evaluate the ISM in two environments: interactive traffic with only ISM vessels and mixed traffic where some vessel trajectories are from recorded real-world maritime traffic data or handcrafted for criticality. Our results show that simulations with many ISM vessels of different vessel types are rule-compliant and scalable. We tested 4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no collisions occurred while goal-reaching rates of about 97 percent were achieved. We believe that our ISM can serve as a standard for challenging and realistic maritime traffic simulation to accelerate autonomous vessel development.
自主船只有可能增强海上贸易的安全性和可靠性。为了促进自主船舶的发展,需要高保真的模拟来建模与其他船舶的真实互动情况。然而,由于海洋环境的无结构化、交通规则规定模糊以及船型差异大等因素,模拟现实中的相互作用性海事交通颇具挑战性。目前尚没有用于严格评估自主船舶算法的标准海事交互式环境仿真方法。 在本文中,我们引入了首个智能航行模型(ISM),该模型可模拟符合海洋交通规则的船只,在开放海域进行导航。一个ISM船只会根据海上交通规则对其他交通参与者作出反应,并同时解决由航路点特征化的运动规划任务。特别地,ISM会监测适用的规定,生成相应的合规航路点,并利用预测性控制模型来跟踪这些航路点。 我们在两个环境中评估了ISM的表现:仅包含ISM船只的互动交通环境以及混合交通环境(其中一些船舶轨迹来自记录的真实世界海事交通数据或为关键情况手工创建)。我们的结果显示,具有多种船型的大量ISM船只模拟结果符合规则并且具备扩展性。我们测试了4,049个关键交通场景。对于仅包含ISM船只的互动交通而言,没有发生碰撞事故,同时目标达成率约为97%。 我们认为,我们的ISM可以作为挑战性和现实性的海事交通仿真标准,以加速自主船舶的发展。
https://arxiv.org/abs/2501.04988
This paper presents the design a Proportional-Integral-Derivative (PID) controller with optimized parameters for a two-degree-of-freedom robotic arm. A genetic algorithm (GA) is proposed to optimize the controller parameters, addressing the challenges in determining PID controller parameters for highly nonlinear systems like robotic arms compared to traditional methods. The GA-optimized PID controller significantly improves control accuracy and performance over traditional control methods. Simulation results demonstrate that the robotic arm system operates with high precision and stability. Additionally, the shortened trajectory tracking response time enhances the feasibility of applying this control algorithm in realworld scenarios. This research not only confirms the suitability of PID-GA for robotic arms and similar systems but also opens new avenues for applying this algorithm to real physical systems.
本文提出了一种针对两自由度机械臂设计的、具有优化参数的比例-积分-微分(PID)控制器。采用遗传算法(GA)来优化控制器参数,以解决相对于传统方法,在像机械臂这样的高度非线性系统中确定PID控制器参数所面临的挑战。经过GA优化后的PID控制器在控制精度和性能方面显著优于传统的控制方法。模拟结果显示,该机械臂系统具有高精度和稳定性,并且缩短的轨迹跟踪响应时间增强了将此控制算法应用于实际场景中的可行性。这项研究不仅证实了PID-GA对于机械臂及其他类似系统的适用性,而且还为这一算法应用于真实物理系统开辟了新的途径。
https://arxiv.org/abs/2501.04759
Cylindrical manipulators are extensively used in industrial automation, especially in emerging technologies like 3D printing, which represents a significant future trend. However, controlling the trajectory of nonlinear models with system uncertainties remains a critical challenge, often leading to reduced accuracy and reliability. To address this, the study develops an Adaptive Sliding Mode Controller (ASMC) integrated with Neural Networks (NNs) to improve trajectory tracking for cylindrical manipulators. The ASMC leverages the robustness of sliding mode control and the adaptability of neural networks to handle uncertainties and dynamic variations effectively. Simulation results validate that the proposed ASMC-NN achieves high trajectory tracking accuracy, fast response time, and enhanced reliability, making it a promising solution for applications in 3D printing and beyond.
圆柱型机械臂在工业自动化中得到了广泛的应用,特别是在如3D打印这样的新兴技术领域,这预示着未来的一大发展趋势。然而,对于非线性模型且存在系统不确定性的轨迹控制仍然是一个关键挑战,常常导致精度和可靠性下降。为了解决这个问题,本研究开发了一种结合了神经网络(NN)的自适应滑模控制器(ASMC),旨在提高圆柱型机械臂的轨迹追踪性能。该ASMC利用滑模控制的强大鲁棒性和神经网络的可调性来有效处理不确定性和动态变化。 模拟结果显示,所提出的ASMC-NN方案实现了高精度轨迹追踪、快速响应时间和增强的可靠性,使其成为3D打印及其它应用领域中极具前景的解决方案。
https://arxiv.org/abs/2501.04754
Video-based vehicle detection and counting play a critical role in managing transport infrastructure. Traditional image-based counting methods usually involve two main steps: initial detection and subsequent tracking, which are applied to all video frames, leading to a significant increase in computational complexity. To address this issue, this work presents an alternative and more efficient method for vehicle detection and counting. The proposed approach eliminates the need for a tracking step and focuses solely on detecting vehicles in key video frames, thereby increasing its efficiency. To achieve this, we developed a system that combines YOLO, for vehicle detection, with Visual Rhythm, a way to create time-spatial images that allows us to focus on frames that contain useful information. Additionally, this method can be used for counting in any application involving unidirectional moving targets to be detected and identified. Experimental analysis using real videos shows that the proposed method achieves mean counting accuracy around 99.15% over a set of videos, with a processing speed three times faster than tracking based approaches.
基于视频的车辆检测和计数在管理交通运输基础设施中扮演着关键角色。传统的图像基计量方法通常包括两个主要步骤:初始检测和后续跟踪,这些步骤应用于所有视频帧,导致计算复杂度显著增加。为了解决这个问题,这项工作提出了一种替代且更高效的车辆检测和计数方法。所提出的方案消除了跟踪步骤的需要,并专注于在关键视频帧中进行车辆检测,从而提高了效率。为此,我们开发了一个结合YOLO(用于车辆检测)与视觉节奏(一种创建时空图像的方法,使我们能够关注包含有用信息的帧)的系统。此外,该方法可用于任何涉及单向移动目标检测和识别的应用中的计数。使用真实视频进行的实验分析表明,所提出的方法在一组视频上的平均计数准确率约为99.15%,处理速度比基于跟踪的方法快三倍。
https://arxiv.org/abs/2501.04534
Despite widespread adoption of deep learning models to address a variety of computer vision tasks, planetary science has yet to see extensive utilization of such tools to address its unique problems. On Titan, the largest moon of Saturn, tracking seasonal trends and weather patterns of clouds provides crucial insights into one of the most complex climates in the Solar System, yet much of the available image data are still analyzed in a conventional way. In this work, we apply a Mask R-CNN trained via transfer learning to perform instance segmentation of clouds in Titan images acquired by the Cassini spacecraft - a previously unexplored approach to a big data problem in planetary science. We demonstrate that an automated technique can provide quantitative measures for clouds, such as areas and centroids, that may otherwise be prohibitively time-intensive to produce by human mapping. Furthermore, despite Titan specific challenges, our approach yields accuracy comparable to contemporary cloud identification studies on Earth and other worlds. We compare the efficiencies of human-driven versus algorithmic approaches, showing that transfer learning provides speed-ups that may open new horizons for data investigation for Titan. Moreover, we suggest that such approaches have broad potential for application to similar problems in planetary science where they are currently under-utilized. Future planned missions to the planets and remote sensing initiatives for the Earth promise to provide a deluge of image data in the coming years that will benefit strongly from leveraging machine learning approaches to perform the analysis.
尽管深度学习模型已被广泛应用于解决各种计算机视觉任务,但行星科学领域尚未充分利用此类工具来应对其特有的问题。在土卫六——土星最大的卫星上,追踪季节趋势和云层天气模式为了解太阳系中最为复杂的气候之一提供了关键见解;然而,目前可用的大部分图像数据仍以传统方式分析处理。在这项工作中,我们采用了一种通过迁移学习训练的Mask R-CNN模型来对卡西尼号探测器获取的土卫六图像中的云层进行实例分割——这是一种以前未曾探索过的行星科学大数据问题解决方案。我们展示了自动化技术可以提供定量测量指标(如面积和中心点),这些指标对于人类制图来说耗时过长,难以实现。 此外,尽管存在特定于土卫六的技术挑战,我们的方法仍可达到与地球上及其他世界现行云层识别研究相媲美的准确度水平。我们将人力驱动的分析方式与算法化手段进行对比,证明了迁移学习提供了速度提升,可能为土卫六的数据调查开启新的前景。此外,我们建议此类方法在行星科学中具有广泛的应用潜力,尤其是在目前这些工具尚未充分利用的情况下。 未来计划中的行星探测任务和地球遥感项目有望在未来几年内产生大量图像数据,而利用机器学习方法进行分析将极大地受益于这种技术进步。
https://arxiv.org/abs/2501.04459
Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.
长篇视频理解面临的挑战是利用大型视觉语言模型分析在有限上下文窗口内分散但集中于空间的关键时刻。在这项工作中,我们引入了VideoMindPalace框架,该框架受到“心智宫殿”(Mind Palace)的启发,将关键视频时刻组织成一个拓扑结构化的语义图。 VideoMindPalace通过以下三个方面来组织关键信息: 1. **手部和物体跟踪及交互**:通过追踪视频中人手与物体之间的互动情况。 2. **聚类活动区域**:表示重复发生的特定活动区,以反映场景中的动态变化。 3. **环境布局映射**:将视频的物理空间结构化为图的形式。 这种方法允许大型语言模型进行自然语言解析,从而提供基于时间和空间上下文的语义理解。此外,我们提出了Video MindPalace基准(VMB),用于评估类似于人类的理解能力,包括空间定位、时间推理和布局感知序列理解。 在VMB以及已建立的视频问答数据集上进行测试,包括EgoSchema、NExT-QA、IntentQA和Active Memories Benchmark,VideoMindPalace展示了显著的空间-时间一致性提升和与人类一致性的推断能力,从而推动了视觉语言模型中长篇视频分析的能力。
https://arxiv.org/abs/2501.04336
Deep Anterior Lamellar Keratoplasty (DALK) is a partial-thickness corneal transplant procedure used to treat corneal stromal diseases. A crucial step in this procedure is the precise separation of the deep stroma from Descemet's membrane (DM) using the Big Bubble technique. To simplify the tasks of needle insertion and pneumo-dissection in this technique, we previously developed an Optical Coherence Tomography (OCT)-guided, eye-mountable robot that uses real-time tracking of corneal layers from M-mode OCT signals for control. However, signal noise and instability during manipulation of the OCT fiber sensor-integrated needle have hindered the performance of conventional deep-learning segmentation methods, resulting in rough and inaccurate detection of corneal layers. To address these challenges, we have developed a topology-based deep-learning segmentation method that integrates a topological loss function with a modified network architecture. This approach effectively reduces the effects of noise and improves segmentation speed, precision, and stability. Validation using in vivo, ex vivo, and hybrid rabbit eye datasets demonstrates that our method outperforms traditional loss-based techniques, providing fast, accurate, and robust segmentation of the epithelium and DM to guide surgery.
深板层角膜移植术(DALK,Deep Anterior Lamellar Keratoplasty)是一种用于治疗角膜基质疾病的部分厚度角膜移植手术。该程序中的关键步骤是使用大泡技术精确地将深层基质从Descemet膜(DM)分离出来。为了简化这一技术中针头插入和气动解剖的任务,我们之前开发了一种基于光学相干断层扫描(OCT)的眼部可安装机器人,它利用M模式OCT信号的实时跟踪来控制操作。然而,在操纵集成有OCT光纤传感器的针头过程中产生的信号噪声和不稳定性阻碍了传统深度学习分割方法的表现,导致角膜层的检测粗糙且不准确。 为了解决这些挑战,我们开发了一种基于拓扑结构的深度学习分割方法,该方法将拓扑损失函数与修改后的网络架构相结合。这种方法有效地减少了噪声的影响,并提高了分割的速度、精度和稳定性。通过活体、离体以及混合兔眼数据集的验证,我们的方法在速度、准确性和鲁棒性方面均优于传统的基于损失的方法,能够为手术提供快速且精确地定位上皮层和DM的指导。
https://arxiv.org/abs/2501.04735
Gesture recognition is a perceptual user interface, which is based on CV technology that allows the computer to interpret human motions as commands, allowing users to communicate with a computer without the use of hands, thus making the mouse and keyboard superfluous. Gesture recognition's main weakness is a light condition because gesture control is based on computer vision, which heavily relies on cameras. These cameras are used to interpret gestures in 2D and 3D, so the extracted information can vary depending on the source of light. The limitation of the system cannot work in a dark environment. A simple night vision camera can be used as our camera for motion capture as they also blast out infrared light which is not visible to humans but can be clearly seen with a camera that has no infrared filter this majorly overcomes the limitation of systems which cannot work in a dark environment. So, the video stream from the camera is fed into a Raspberry Pi which has a Python program running OpenCV module which is used for detecting, isolating and tracking the path of dynamic gesture, then we use an algorithm of machine learning to recognize the pattern drawn and accordingly control the GPIOs of the raspberry pi to perform some activities.
手势识别是一种感知用户界面,它基于计算机视觉(CV)技术,使计算机能够将人类的动作解释为指令。这种技术使得用户可以在不使用手的情况下与计算机进行互动,从而让鼠标和键盘变得不再必要。 然而,手势识别的主要弱点在于光照条件的影响,因为该技术依赖于摄像头捕捉的图像信息来解析手势动作,在二维或三维空间中解读这些动作时,提取的信息会受到光源的影响。因此,系统在黑暗环境中无法正常工作。 为了解决这一问题,可以使用夜视摄像机作为运动捕捉设备,这类摄像机会发出人类肉眼不可见但相机能够清晰捕获的红外光,这大大克服了系统不能在黑暗环境下工作的局限性。 接下来,从摄像头获取的视频流会被输入到运行有OpenCV模块的树莓派(Raspberry Pi)中。这个程序用于检测、分离并跟踪动态手势路径。然后使用机器学习算法识别所画出的模式,并相应地控制树莓派的GPIO接口以执行一些操作。 这种方法能够有效地利用夜视摄像头和计算机视觉技术,实现即使在黑暗环境中也能进行可靠的手势识别与交互。
https://arxiv.org/abs/2501.04002
Tracking and acquiring simultaneous optical images of randomly moving targets obscured by scattering media remains a challenging problem of importance to many applications that require precise object localization and identification. In this work we develop an end-to-end neuromorphic optical engineering and computational approach to demonstrate how to track and image normally invisible objects by combining an event detecting camera with a multistage neuromorphic deep learning strategy. Photons emerging from dense scattering media are detected by the event camera and converted to pixel-wise asynchronized spike trains - a first step in isolating object-specific information from the dominant uninformative background. Spiking data is fed into a deep spiking neural network (SNN) engine where object tracking and image reconstruction are performed by two separate yet interconnected modules running in parallel in discrete time steps over the event duration. Through benchtop experiments we demonstrate tracking and imaging randomly moving objects in dense turbid media as well as image reconstruction of spatially stationary but optically dynamic objects. Standardized character sets serve as representative proxies for geometrically complex objects, underscoring the method's generality. The results highlight the advantages of a fully neuromorphic approach in meeting a major imaging technology with high computational efficiency and low power consumption.
追踪和获取被散射介质遮挡的随机移动目标的同时光学图像,仍然是许多需要精确对象定位和识别的应用中的一个挑战性问题。在这项工作中,我们开发了一种端到端神经形态光工程与计算方法,展示如何通过结合事件检测相机与多阶段神经形态深度学习策略来追踪并成像通常不可见的物体。从密集散射介质中发出的光子被事件相机检测,并转换为像素级异步尖峰序列——这是从占主导地位的无用背景中提取特定于对象信息的第一步。 尖峰数据被输入到深层尖峰神经网络(SNN)引擎,在此引擎中,通过两个相互关联但在离散时间步骤中并行运行的模块执行对象追踪和图像重建。通过台式实验,我们展示了在密集浑浊介质中随机移动物体的追踪与成像以及空间静止但光学动态物体的图像重建。标准化字符集作为几何复杂对象的代表性代理,强调了该方法的通用性。 结果突显了一个完全神经形态的方法在满足高计算效率和低功耗的主要成像技术方面的优势。
https://arxiv.org/abs/2501.03874
In this paper, we present a novel synergistic framework for learning shape estimation and a shape-aware whole-body control policy for tendon-driven continuum robots. Our approach leverages the interaction between two Augmented Neural Ordinary Differential Equations (ANODEs) -- the Shape-NODE and Control-NODE -- to achieve continuous shape estimation and shape-aware control. The Shape-NODE integrates prior knowledge from Cosserat rod theory, allowing it to adapt and account for model mismatches, while the Control-NODE uses this shape information to optimize a whole-body control policy, trained in a Model Predictive Control (MPC) fashion. This unified framework effectively overcomes limitations of existing data-driven methods, such as poor shape awareness and challenges in capturing complex nonlinear dynamics. Extensive evaluations in both simulation and real-world environments demonstrate the framework's robust performance in shape estimation, trajectory tracking, and obstacle avoidance. The proposed method consistently outperforms state-of-the-art end-to-end, Neural-ODE, and Recurrent Neural Network (RNN) models, particularly in terms of tracking accuracy and generalization capabilities.
在这篇论文中,我们提出了一种新颖的协同框架,用于学习形状估计和基于肌腱驱动的连续机器人全身感知控制策略。我们的方法利用了两个增强神经常微分方程(ANODE)——形体-NODE 和 控制-NODE ——之间的交互作用,以实现持续的形状估计和形体感知控制。Shape-NODE 集成了 Cosserat 杆理论的先验知识,使其能够适应并解决模型不匹配问题,而 Control-NODE 则利用这些形体信息来优化一个基于模型预测控制(MPC)方法训练的全身控制策略。 这一统一框架有效地克服了现有数据驱动方法的局限性,如较差的形状感知能力和捕捉复杂非线性动态过程的挑战。在仿真和现实世界环境中的广泛评估证明了该框架在形体估计、轨迹跟踪以及障碍物规避方面的稳健性能。所提出的方法在跟踪精度和泛化能力方面始终优于最先进的端到端模型、神经常微分方程(Neural-ODE)模型及循环神经网络(RNN)模型。
https://arxiv.org/abs/2501.03859
Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process, such as camera manipulation or content editing, remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.
扩散模型在从文本提示或图像生成高质量视频方面展示了令人印象深刻的性能。然而,对视频生成过程进行精确控制(如相机操作或内容编辑)仍是一个重大挑战。现有的受控视频生成方法通常仅限于一种类型的控制,缺乏处理多样化控制需求的灵活性。在这篇论文中,我们介绍了“Diffusion as Shader”(DaS),这是一种新颖的方法,支持在统一架构内执行多种视频控制任务。我们的关键见解是实现多功能视频控制需要利用3D控制信号,因为视频本质上是从动态3D内容生成的2D渲染图像。与仅限于使用2D控制信号的先前方法不同,DaS利用3D追踪视频作为控制输入,使视频扩散过程具有内在的3D感知能力。这一创新使得通过简单地操纵3D追踪视频就能实现广泛的视频控制功能。使用3D追踪视频的一个额外优势在于它们能够有效链接帧,从而极大地提高了生成视频的时间一致性。仅需在8个H800 GPU上进行为期三天、不到1万段视频的微调,DaS就在包括网格到视频生成、相机控制、运动转移和对象操作在内的各种任务中展示了强大的控制能力。
https://arxiv.org/abs/2501.03847
The development of general robotic systems capable of manipulating in unstructured environments is a significant challenge. While Vision-Language Models(VLM) excel in high-level commonsense reasoning, they lack the fine-grained 3D spatial understanding required for precise manipulation tasks. Fine-tuning VLM on robotic datasets to create Vision-Language-Action Models(VLA) is a potential solution, but it is hindered by high data collection costs and generalization issues. To address these challenges, we propose a novel object-centric representation that bridges the gap between VLM's high-level reasoning and the low-level precision required for manipulation. Our key insight is that an object's canonical space, defined by its functional affordances, provides a structured and semantically meaningful way to describe interaction primitives, such as points and directions. These primitives act as a bridge, translating VLM's commonsense reasoning into actionable 3D spatial constraints. In this context, we introduce a dual closed-loop, open-vocabulary robotic manipulation system: one loop for high-level planning through primitive resampling, interaction rendering and VLM checking, and another for low-level execution via 6D pose tracking. This design ensures robust, real-time control without requiring VLM fine-tuning. Extensive experiments demonstrate strong zero-shot generalization across diverse robotic manipulation tasks, highlighting the potential of this approach for automating large-scale simulation data generation.
开发能够在非结构化环境中操作的通用机器人系统是一项重大挑战。虽然视觉-语言模型(VLM)在高层次常识推理方面表现出色,但在需要精细3D空间理解的精确操作任务中却显得不足。将VLM在机器人数据集上进行微调以创建视觉-语言-动作模型(VLA),是一种潜在解决方案,但该方法受到高昂的数据收集成本和泛化问题的阻碍。为了解决这些问题,我们提出了一种新的基于对象的表示方法,它弥合了VLM高层次推理与操作所需的低层次精确度之间的差距。我们的关键见解在于:一个物体的功能空间(由其功能作用定义)提供了一种结构化且语义上明确的方式来描述交互原语,如点和方向。这些原语作为桥梁,将VLM的常识推理转化为可执行的3D空间约束条件。 在此背景下,我们引入了一个双闭环、开放词汇的机器人操作系统:一个用于通过原语重采样、交互渲染及VLM验证进行高层次规划的循环;另一个则负责通过6维姿态跟踪实现低层次的实际操作。这种设计确保了无需对VLM进行微调即可实现稳健且实时的控制。 大量的实验表明,该方法在各种机器人操作任务中展示了强大的零样本泛化能力,突显出这种方法在大规模仿真数据生成自动化方面的潜力。
https://arxiv.org/abs/2501.03841
RGB-T tracking leverages the complementary strengths of RGB and thermal infrared (TIR) modalities to address challenging scenarios such as low illumination and adverse weather. However, existing methods often fail to effectively integrate temporal information and perform efficient cross-modal interactions, which constrain their adaptability to dynamic targets. In this paper, we propose BTMTrack, a novel framework for RGB-T tracking. The core of our approach lies in the dual-template backbone network and the Temporal-Modal Candidate Elimination (TMCE) strategy. The dual-template backbone effectively integrates temporal information, while the TMCE strategy focuses the model on target-relevant tokens by evaluating temporal and modal correlations, reducing computational overhead and avoiding irrelevant background noise. Building upon this foundation, we propose the Temporal Dual Template Bridging (TDTB) module, which facilitates precise cross-modal fusion through dynamically filtered tokens. This approach further strengthens the interaction between templates and the search region. Extensive experiments conducted on three benchmark datasets demonstrate the effectiveness of BTMTrack. Our method achieves state-of-the-art performance, with a 72.3% precision rate on the LasHeR test set and competitive results on RGBT210 and RGBT234 datasets.
RGB-T跟踪方法利用了RGB和热红外(TIR)模式的互补优势,以应对低光照和恶劣天气等挑战性场景。然而,现有的方法往往未能有效整合时间信息并执行高效的跨模态交互,这限制了它们对动态目标的适应能力。在本文中,我们提出了BTMTrack,这是一种用于RGB-T跟踪的新颖框架。我们的方法的核心在于双模板骨干网络和时空模式候选消除(TMCE)策略。双模板骨干网有效地整合了时间信息,而TMCE策略则通过评估时间和模态的相关性将模型聚焦于与目标相关的标记,从而减少了计算开销并避免了无关的背景噪声干扰。 在此基础上,我们提出了时序双模板融合(TDTB)模块,该模块通过动态过滤标记来促进精确的跨模式融合。这种方法进一步加强了模板和搜索区域之间的互动。 在三个基准数据集上进行的广泛实验表明BTMTrack的有效性。我们的方法达到了最先进的性能,在LasHeR测试集中实现了72.3%的精度,并且在RGBT210和RGBT234数据集上的结果也颇具竞争力。
https://arxiv.org/abs/2501.03616
One of the pivotal challenges in a multi-robot system is how to give attention to accuracy and efficiency while ensuring safety. Prior arts cannot strictly guarantee collision-free for an arbitrarily large number of robots or the results are considerably conservative. Smoothness of the avoidance trajectory also needs to be further optimized. This paper proposes an accelerationactuated simultaneous obstacle avoidance and trajectory tracking method for arbitrarily large teams of robots, that provides a nonconservative collision avoidance strategy and gives approaches for deadlock avoidance. We propose two ways of deadlock resolution, one involves incorporating an auxiliary velocity vector into the error function of the trajectory tracking module, which is proven to have no influence on global convergence of the tracking error. Furthermore, unlike the traditional methods that they address conflicts after a deadlock occurs, our decision-making mechanism avoids the near-zero velocity, which is much more safer and efficient in crowed environments. Extensive comparison show that the proposed method is superior to the existing studies when deployed in a large-scale robot system, with minimal invasiveness.
在多机器人系统中,一个关键挑战是如何在确保安全性的前提下兼顾准确性和效率。现有的技术无法严格保证任意数量的机器人都能避免碰撞,或者其结果过于保守。避碰轨迹的平滑性也有待进一步优化。本文提出了一种基于加速器的同时障碍物规避和轨迹跟踪方法,适用于任意规模的机器人团队,并提供了一种非保守性的碰撞规避策略以及解决死锁问题的方法。 我们提出了两种死锁解决方案:一种是通过在轨迹跟踪模块的误差函数中引入辅助速度向量来实现。经过证明,这种方法不会影响全局跟踪误差的收敛性。此外,与传统的在发生冲突后处理冲突的方法不同,我们的决策机制可以避免机器人在拥挤环境中的低速运行状态,这不仅更安全,而且更加高效。 大量的比较研究表明,在大规模机器人系统中部署所提出的方法时,其性能优于现有的研究方法,并且具有最小的侵入性。
https://arxiv.org/abs/2501.03585
In this paper, we propose ProTracker, a novel framework for robust and accurate long-term dense tracking of arbitrary points in videos. The key idea of our method is incorporating probabilistic integration to refine multiple predictions from both optical flow and semantic features for robust short-term and long-term tracking. Specifically, we integrate optical flow estimations in a probabilistic manner, producing smooth and accurate trajectories by maximizing the likelihood of each prediction. To effectively re-localize challenging points that disappear and reappear due to occlusion, we further incorporate long-term feature correspondence into our flow predictions for continuous trajectory generation. Extensive experiments show that ProTracker achieves the state-of-the-art performance among unsupervised and self-supervised approaches, and even outperforms supervised methods on several benchmarks. Our code and model will be publicly available upon publication.
在这篇论文中,我们提出了一种名为ProTracker的新型框架,用于在视频中对任意点进行稳健且准确的长期密集跟踪。我们的方法的核心思想是通过结合概率集成来优化来自光流和语义特征的多个预测结果,从而实现短期和长期内的稳健跟踪。具体来说,我们将光流估计以概率的方式整合起来,在最大化每个预测可能性的同时生成平滑而精确的轨迹。为了有效地重新定位由于遮挡而消失又重新出现的具有挑战性的点,我们进一步在我们的光流预测中引入了长期特征对应关系,从而实现连续轨迹的生成。广泛的实验表明,ProTracker在无监督和自监督方法中的性能处于行业领先水平,并且甚至在多个基准测试上超越了有监督的方法。论文发布后,我们的代码和模型将公开提供。
https://arxiv.org/abs/2501.03220