Autoregressive sequence models, such as Transformer-based vision-language action (VLA) policies, can be tremendously effective for capturing complex and generalizable robotic behaviors. However, such models require us to choose a tokenization of our continuous action signals, which determines how the discrete symbols predicted by the model map to continuous robot actions. We find that current approaches for robot action tokenization, based on simple per-dimension, per-timestep binning schemes, typically perform poorly when learning dexterous skills from high-frequency robot data. To address this challenge, we propose a new compression-based tokenization scheme for robot actions, based on the discrete cosine transform. Our tokenization approach, Frequency-space Action Sequence Tokenization (FAST), enables us to train autoregressive VLAs for highly dexterous and high-frequency tasks where standard discretization methods fail completely. Based on FAST, we release FAST+, a universal robot action tokenizer, trained on 1M real robot action trajectories. It can be used as a black-box tokenizer for a wide range of robot action sequences, with diverse action spaces and control frequencies. Finally, we show that, when combined with the pi0 VLA, our method can scale to training on 10k hours of robot data and match the performance of diffusion VLAs, while reducing training time by up to 5x.
自回归序列模型,如基于Transformer的视觉-语言-行动(VLA)策略,在捕捉复杂且可泛化的机器人行为方面非常有效。然而,这类模型需要我们选择连续动作信号的标记化方案,这决定了模型预测出的离散符号如何映射到连续的机器人动作上。我们发现,目前基于每维度、每时间步简单分箱方法的机器人行动标记化技术,在从高频机器人数据中学习灵巧技能时通常表现不佳。为了解决这一挑战,我们提出了一种新的基于离散余弦变换的机器人动作压缩式标记化方案。我们的标记化方法,即频域操作序列标记化(FAST),使我们能够训练自回归VLA模型来处理高度灵巧且高频的任务,在这些任务中标准的离散化方法完全失败了。基于FAST,我们发布了FAST+,这是一个通用的机器人动作标记器,它是在100万条真实机器人的行动轨迹上进行训练的。它可以作为一个黑盒标记器用于各种不同操作空间和控制频率范围内的机器人行动序列。最后,我们展示了当与pi0 VLA结合使用时,我们的方法可以扩展到在10,000小时的数据集上进行训练,并且性能能够匹敌扩散VLA模型,同时将训练时间减少最多5倍。
https://arxiv.org/abs/2501.09747
With the number of people with disabilities (PWD) increasing worldwide each year, the demand for mobility support to enable independent living and social integration is also growing. Wheelchairs commonly support the mobility of PWD in both indoor and outdoor environments. However, current powered wheelchairs (PWC) often fail to meet the needs of PWD, who may find it difficult to operate them. Furthermore, existing research on robotic wheelchairs typically focuses either on full autonomy or enhanced manual control, which can lead to reduced efficiency and user trust. To address these issues, this paper proposes a Robot Operating System (ROS)-based smart wheelchair, called CoNav Chair, that incorporates a shared control navigation algorithm and obstacle avoidance to support PWD while fostering efficiency and trust between the robot and the user. Our design consists of hardware and software components. Experimental results conducted in a typical indoor social environment demonstrate the performance and effectiveness of the smart wheelchair hardware and software design. This integrated design promotes trust and autonomy, which are crucial for the acceptance of assistive mobility technologies in the built environment.
随着全球残疾人士(PWD)数量的逐年增加,对支持独立生活和社会融入的移动性需求也在增长。轮椅通常在室内和室外环境中帮助PWD实现行动能力。然而,现有的电动轮椅(PWC)常常无法满足PWD的需求,他们可能会发现操作这些设备很困难。此外,关于机器人轮椅的研究大多集中在完全自主或增强的手动控制上,这可能导致效率降低和用户信任度下降。 为了应对这些问题,本文提出了一种基于Robot Operating System (ROS)的智能轮椅——CoNav Chair。该智能轮椅结合了共享控制导航算法和避障功能,以支持PWD的同时提高机器人与用户的效率和信任度。我们的设计包括硬件和软件两部分组件。在典型室内社交环境中进行的实验结果表明,智能轮椅的硬件和软件设计方案具有良好的性能和有效性。 这种集成设计促进了用户对机器人的信任和自主性,这对在建筑环境中接受辅助移动技术至关重要。
https://arxiv.org/abs/2501.09680
Autonomous docking remains one of the most challenging maneuvers in marine robotics, requiring precise control and robust perception in confined spaces. This paper presents a novel approach integrating Model Predictive Path Integral(MPPI) control with real-time LiDAR-based dock detection for autonomous surface vessel docking. Our framework uniquely combines probabilistic trajectory optimization with a multiobjective cost function that simultaneously considers docking precision, safety constraints, and motion efficiency. The MPPI controller generates optimal trajectories by intelligently sampling control sequences and evaluating their costs based on dynamic clearance requirements, orientation alignment, and target position objectives. We introduce an adaptive dock detection pipeline that processes LiDAR point clouds to extract critical geometric features, enabling real-time updates of docking parameters. The proposed method is extensively validated in a physics-based simulation environment that incorporates realistic sensor noise, vessel dynamics, and environmental constraints. Results demonstrate successful docking from various initial positions while maintaining safe clearances and smooth motion characteristics.
自主对接仍然是海洋机器人技术中最具挑战性的操作之一,要求在狭小空间内进行精确控制和稳健感知。本文提出了一种新颖的方法,将模型预测路径积分(MPPI)控制与实时LiDAR-based船坞检测相结合,用于自主水面船舶的靠泊。我们的框架独特地结合了概率轨迹优化和一个多目标成本函数,同时考虑了对接精度、安全约束以及运动效率。 MPPI控制器通过智能抽样控制序列并根据动态避碰要求、方向对齐及目标位置目标来评估其成本,从而生成最优轨迹。我们引入了一种自适应船坞检测流水线,该流程处理LiDAR点云以提取关键几何特征,使对接参数能够在实时中更新。 所提出的方法在物理基础的仿真环境中进行了广泛的验证,该环境包括了现实传感器噪声、船舶动力学以及环境约束等要素。结果表明,从各种初始位置成功实现靠泊,并且保持安全距离和流畅运动特性。
https://arxiv.org/abs/2501.09668
Online motion planning is a challenging problem for intelligent robots moving in dense environments with dynamic obstacles, e.g., crowds. In this work, we propose a novel approach for optimal and safe online motion planning with minimal information about dynamic obstacles. Specifically, our approach requires only the current position of the obstacles and their maximum speed, but it does not need any information about their exact trajectories or dynamic model. The proposed methodology combines Monte Carlo Tree Search (MCTS), for online optimal planning via model simulations, with Velocity Obstacles (VO), for obstacle avoidance. We perform experiments in a cluttered simulated environment with walls, and up to 40 dynamic obstacles moving with random velocities and directions. With an ablation study, we show the key contribution of VO in scaling up the efficiency of MCTS, selecting the safest and most rewarding actions in the tree of simulations. Moreover, we show the superiority of our methodology with respect to state-of-the-art planners, including Non-linear Model Predictive Control (NMPC), in terms of improved collision rate, computational and task performance.
在线运动规划是智能机器人在密集环境中(如人群)移动时面临的一个挑战,尤其是在存在动态障碍物的情况下。在这项工作中,我们提出了一种新颖的方法来实现仅需最少的动态障碍物信息就能进行最优且安全的在线运动规划。具体而言,我们的方法只需要当前障碍物的位置及其最大速度的信息,并不需要有关它们的确切轨迹或动力学模型的数据。 该方法结合了蒙特卡洛树搜索(MCTS)和速度障碍物(Velocity Obstacles, VO)。其中,MCTS通过模拟模型来进行在线最优规划,而VO用于避免碰撞。我们在一个充满墙壁和其他多达40个随机移动的动态障碍物的仿真环境中进行了实验。通过消融研究,我们展示了VO在扩展MCTS效率中的关键作用,它可以在模拟树中选择最安全且最具奖励性的动作。 此外,我们的方法在碰撞率、计算性能和任务执行效果方面均优于现有的先进规划器,包括非线性模型预测控制(NMPC)。
https://arxiv.org/abs/2501.09649
SLAM is a foundational technique with broad applications in robotics and AR/VR. SLAM simulations evaluate new concepts, but testing on resource-constrained devices, such as VR HMDs, faces challenges: high computational cost and restricted sensor data access. This work proposes a sparse framework using mesh geometry projections as features, which improves efficiency and circumvents direct sensor data access, advancing SLAM research as we demonstrate in VR and through numerical evaluation.
SLAM( simultaneous localization and mapping,即同步定位与地图构建)是一项在机器人技术和AR/VR领域具有广泛应用的基础技术。SLAM模拟可以用来评估新的概念,但将其应用于资源受限的设备上,如VR头显,则面临高昂计算成本和有限传感器数据访问权的问题。这项工作提出了一种稀疏框架,该框架利用网格几何投影作为特征,从而提高效率并绕过直接获取传感器数据的需求。我们在虚拟现实环境中的实验证明了这一点,并通过数值评估展示了其在推进SLAM研究方面的进展。
https://arxiv.org/abs/2501.09600
Group theory has been used in machine learning to provide a theoretically grounded approach for incorporating known symmetry transformations in tasks from robotics to protein modeling. In these applications, equivariant neural networks use known symmetry groups with predefined representations to learn over geometric input data. We propose MatrixNet, a neural network architecture that learns matrix representations of group element inputs instead of using predefined representations. MatrixNet achieves higher sample efficiency and generalization over several standard baselines in prediction tasks over the several finite groups and the Artin braid group. We also show that MatrixNet respects group relations allowing generalization to group elements of greater word length than in the training set.
群论已在机器学习中被用于提供一种理论基础的方法,将已知的对称性变换融入从机器人学到蛋白质建模等任务。在这些应用中,等变神经网络使用预定义表示的已知对称群来处理几何输入数据。我们提出了MatrixNet,这是一种神经网络架构,它能够学习群元素输入的矩阵表示,而不是使用预定义的表示形式。MatrixNet在几个标准基准上实现了更高的样本效率和泛化能力,在涉及若干有限群以及阿廷辫子群的预测任务中表现尤为突出。此外,还证明了MatrixNet尊重群关系,从而允许其将训练集中单词长度较小的群元素进行推广至更大的单词长度的群元素。
https://arxiv.org/abs/2501.09571
This article presents a comparative analysis of a mobile robot trajectories computed by various ROS-based SLAM systems. For this reason we developed a prototype of a mobile robot with common sensors: 2D lidar, a monocular and ZED stereo cameras. Then we conducted experiments in a typical office environment and collected data from all sensors, running all tested SLAM systems based on the acquired dataset. We studied the following SLAM systems: (a) 2D lidar-based: GMapping, Hector SLAM, Cartographer; (b) monocular camera-based: Large Scale Direct monocular SLAM (LSD SLAM), ORB SLAM, Direct Sparse Odometry (DSO); and (c) stereo camera-based: ZEDfu, Real-Time Appearance-Based Mapping (RTAB map), ORB SLAM, Stereo Parallel Tracking and Mapping (S-PTAM). Since all SLAM methods were tested on the same dataset we compared results for different SLAM systems with appropriate metrics, demonstrating encouraging results for lidar-based Cartographer SLAM, Monocular ORB SLAM and Stereo RTAB Map methods.
本文对基于ROS的SLAM系统计算出的不同移动机器人轨迹进行了比较分析。为此,我们开发了一个配备有常见传感器(2D激光雷达、单目摄像头和ZED立体摄像头)的移动机器人原型。然后在典型的办公室环境中进行了实验,并从所有传感器收集了数据,在所获得的数据集上运行了所有测试过的SLAM系统。本文研究了以下几种SLAM系统: (a) 基于2D激光雷达:GMapping、Hector SLAM和Cartographer; (b) 基于单目摄像头:大范围直接单目SLAM(LSD SLAM)、ORB SLAM以及直接稀疏里程计(DSO); (c) 基于立体摄像头:ZEDfu、实时基于外观的映射(RTAB map)、ORB SLAM和立体并行跟踪与地图构建(S-PTAM)。 由于所有SLAM方法都是在同一数据集上进行测试,因此我们使用适当的指标对不同SLAM系统的结果进行了比较。结果显示基于激光雷达的Cartographer SLAM、单目ORB SLAM以及立体RTAB Map方法取得了令人鼓舞的结果。
https://arxiv.org/abs/2501.09490
How are robots becoming smarter at interacting with their surroundings? Recent advances have reshaped how robots use tactile sensing to perceive and engage with the world. Tactile sensing is a game-changer, allowing robots to embed sensorimotor control strategies to interact with complex environments and skillfully handle heterogeneous objects. Such control frameworks plan contact-driven motions while staying responsive to sudden changes. We review the latest methods for building perception and control systems in tactile robotics while offering practical guidelines for their design and implementation. We also address key challenges to shape the future of intelligent robots.
机器人是如何在与周围环境互动时变得更聪明的?最近的技术进步重新定义了机器人如何使用触觉传感来感知和参与世界。触觉传感是一项重大突破,它使机器人能够嵌入传感器-运动控制策略,以应对复杂的环境并熟练地处理不同类型的物体。这种控制系统可以规划接触驱动的动作,并对突发变化保持响应能力。本文回顾了在触觉机器人中构建感知与控制系统的方法,同时提供了这些系统设计和实施的实用指南。此外,我们还讨论了一些关键挑战,旨在塑造智能机器人的未来发展方向。
https://arxiv.org/abs/2501.09468
Stereo matching is a key technique for metric depth estimation in computer vision and robotics. Real-world challenges like occlusion and non-texture hinder accurate disparity estimation from binocular matching cues. Recently, monocular relative depth estimation has shown remarkable generalization using vision foundation models. Thus, to facilitate robust stereo matching with monocular depth cues, we incorporate a robust monocular relative depth model into the recurrent stereo-matching framework, building a new framework for depth foundation model-based stereo-matching, DEFOM-Stereo. In the feature extraction stage, we construct the combined context and matching feature encoder by integrating features from conventional CNNs and DEFOM. In the update stage, we use the depth predicted by DEFOM to initialize the recurrent disparity and introduce a scale update module to refine the disparity at the correct scale. DEFOM-Stereo is verified to have comparable performance on the Scene Flow dataset with state-of-the-art (SOTA) methods and notably shows much stronger zero-shot generalization. Moreover, DEFOM-Stereo achieves SOTA performance on the KITTI 2012, KITTI 2015, Middlebury, and ETH3D benchmarks, ranking 1st on many metrics. In the joint evaluation under the robust vision challenge, our model simultaneously outperforms previous models on the individual benchmarks. Both results demonstrate the outstanding capabilities of the proposed model.
立体匹配是计算机视觉和机器人技术中用于度量深度估计的关键技术。现实世界中的挑战,如遮挡和非纹理区域,会妨碍双目线索的准确视差估算。最近,单目相对深度估计算法在使用视觉基础模型时表现出显著的泛化能力。因此,为了利用单目深度线索实现稳健的立体匹配,我们将在递归立体匹配框架中加入一个鲁棒的单目相对深度模型,构建一个新的基于深度基础模型的立体匹配框架——DEFOM-Stereo。 在特征提取阶段,我们通过整合传统CNN和DEFOM(Depth Foundation for OMnomalous Matching)的特性来构造组合上下文与匹配特性编码器。更新阶段中,使用DEFOM预测的深度信息初始化递归视差,并引入尺度更新模块以在正确尺度下精炼视差。 DEFOM-Stereo模型在Scene Flow数据集上的表现可媲美当前最佳(SOTA)方法,在零样本泛化能力方面尤为突出。此外,该模型还在KITTI 2012、KITTI 2015、Middlebury和ETH3D基准测试中取得了SOTA性能,并且在许多指标上排名第一。 在鲁棒视觉挑战的联合评估中,我们的模型同时超越了先前方法在各个单独基准上的表现。这些结果都证明了所提模型卓越的能力。
https://arxiv.org/abs/2501.09466
Industrial robotics demands significant energy to operate, making energy-reduction methodologies increasingly important. Strategies for planning minimum-energy trajectories typically involve solving nonlinear optimal control problems (OCPs), which rarely cope with real-time requirements. In this paper, we propose a paradigm for generating near minimum-energy trajectories for manipulators by learning from optimal solutions. Our paradigm leverages a residual learning approach, which embeds boundary conditions while focusing on learning only the adjustments needed to steer a standard solution to an optimal one. Compared to a computationally expensive OCP-based planner, our paradigm achieves 87.3% of the performance near the training dataset and 50.8% far from the dataset, while being two to three orders of magnitude faster.
工业机器人操作需要消耗大量能源,因此减少能耗的方法变得越来越重要。规划最小能量轨迹的策略通常涉及解决非线性最优控制问题(OCPs),但这些问题很少能满足实时需求。在本文中,我们提出了一种通过学习最优解来生成接近最低能量轨迹的新范式,适用于机械臂操作。我们的范式采用残差学习方法,这种方法嵌入边界条件,并专注于仅学习将标准解决方案调整为最佳方案所需的调整部分。 与计算成本高昂的基于OCP的规划器相比,在训练数据集附近和远离训练数据集的情况下,本范式的性能分别达到了其87.3%和50.8%,同时速度提高了两到三个数量级。
https://arxiv.org/abs/2501.09450
In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed state-action trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at this https URL.
在现实世界的顺序决策任务中,如自动驾驶、机器人技术和医疗保健领域,从观察到的状态-动作轨迹(state-action trajectories)中学习对于模仿、分类和聚类等任务至关重要。例如,无人驾驶汽车需要复制人类驾驶行为,而机器人系统和医疗健康系统则可以从建模决策序列中受益,不论这些数据是否来自专家。现有的轨迹编码方法通常专注于特定的任务或依赖于奖励信号,这限制了它们在跨领域和任务中的泛化能力。 受嵌入模型如CLIP(Contrastive Language–Image Pre-training)和BERT(Bidirectional Encoder Representations from Transformers)在静态域中取得成功的启发,我们提出了一种将状态-动作轨迹嵌入到一个潜在空间的方法。这种方法旨在捕捉动态决策过程中的技能与能力,并且不需要奖励标签,从而能够在不同领域和任务之间实现更好的泛化。 我们的贡献主要包括三个方面: 1. 我们引入了一种新的轨迹嵌入方法,该方法能够从状态-动作数据中捕获多种能力。 2. 学习到的嵌入具有强大的跨下游任务表示能力,包括模仿、分类、聚类和回归等。 3. 嵌入还展示了独特的性质,如在IQ-Learn中控制代理行为以及潜在空间中的加性结构。 实验结果证实了我们提出的方法优于传统方法,为各种应用提供了更灵活且强大的轨迹表示。我们的代码可在以下网址获取:[这个URL应该是一个实际可用的链接,在此处用占位符"this https URL"代替]。
https://arxiv.org/abs/2501.09327
As robotic technology rapidly develops, robots are being employed in an increasing number of fields. However, due to the complexity of deployment environments or the prevalence of ambiguous-condition objects, the practical application of robotics still faces many challenges, leading to frequent errors. Traditional methods and some LLM-based approaches, although improved, still require substantial human intervention and struggle with autonomous error correction in complex this http URL this work, we propose RoboReflect, a novel framework leveraging large vision-language models (LVLMs) to enable self-reflection and autonomous error correction in robotic grasping tasks. RoboReflect allows robots to automatically adjust their strategies based on unsuccessful attempts until successful execution is this http URL corrected strategies are saved in a memory for future task this http URL evaluate RoboReflect through extensive testing on eight common objects prone to ambiguous conditions of three this http URL results demonstrate that RoboReflect not only outperforms existing grasp pose estimation methods like AnyGrasp and high-level action planning techniques using GPT-4V but also significantly enhances the robot's ability to adapt and correct errors independently. These findings underscore the critical importance of autonomous selfreflection in robotic systems while effectively addressing the challenges posed by ambiguous environments.
随着机器人技术的迅速发展,机器人被越来越多地应用到各个领域。然而,由于部署环境的复杂性或模糊条件物体的普遍存在,机器人的实际应用仍面临许多挑战,导致频繁出现错误。传统的解决方案和一些基于大型语言模型的方法虽然有所改进,但仍需大量的人工干预,并且在复杂的环境中难以实现自主纠错。 在此工作中,我们提出了一种名为RoboReflect的新框架,该框架利用大规模视觉-语言模型(LVLMs)使机器人能够在抓取任务中进行自我反思并实现自主错误纠正。RoboReflect允许机器人根据未成功的尝试自动调整策略,直到成功执行为止,并将这些修正后的策略保存在内存中以备将来任务使用。 我们通过广泛的测试,在三种具有模糊条件倾向的常见物体上评估了RoboReflect的表现。结果表明,与现有的抓取姿态估计方法(如AnyGrasp)和使用GPT-4V的高度行动规划技术相比,RoboReflect不仅表现更佳,还显著增强了机器人在复杂环境中独立适应和纠正错误的能力。 这些发现强调了在机器人系统中实现自主自我反思的重要性,并有效解决了模糊环境带来的挑战。
https://arxiv.org/abs/2501.09307
Building autonomous mobile robots (AMRs) with optimized efficiency and adaptive capabilities-able to respond to changing task demands and dynamic environments-is a strongly desired goal for advancing construction robotics. Such robots can play a critical role in enabling automation, reducing operational carbon footprints, and supporting modular construction processes. Inspired by the adaptive autonomy of living organisms, we introduce interoception, which centers on the robot's internal state representation, as a foundation for developing self-reflection and conscious learning to enable continual learning and adaptability in robotic agents. In this paper, we factorize internal state variables and mathematical properties as "cognitive dissonance" in shared control paradigms, where human interventions occasionally occur. We offer a new perspective on how interoception can help build adaptive motion planning in AMRs by integrating the legacy of heuristic costs from grid/graph-based algorithms with recent advances in neuroscience and reinforcement learning. Declarative and procedural knowledge extracted from human semantic inputs is encoded into a hypergraph model that overlaps with the spatial configuration of onsite layout for path planning. In addition, we design a velocity-replay module using an encoder-decoder architecture with few-shot learning to enable robots to replicate velocity profiles in contextualized scenarios for multi-robot synchronization and handover collaboration. These "cached" knowledge representations are demonstrated in simulated environments for multi-robot motion planning and stacking tasks. The insights from this study pave the way toward artificial general intelligence in AMRs, fostering their progression from complexity to competence in construction automation.
构建具备优化效率和适应能力的自主移动机器人(AMR),使其能够应对任务需求的变化和动态环境,是推进建筑机器人技术发展的一个重要目标。这类机器人可以在自动化、减少运营碳足迹和支持模块化施工流程方面发挥关键作用。受生物体自适应自主性的启发,我们引入了内感受(interoception)的概念,它侧重于机器人的内部状态表示,并以此为基础开发自我反思和有意识的学习能力,以实现持续学习和适应性。 本文中,我们将内部状态变量和数学属性视为在共享控制范式中的“认知失调”,其中偶尔有人类干预。我们提出了一种新视角,说明内感受如何通过整合基于网格/图算法的传统启发式成本与神经科学及强化学习的最新进展来帮助构建具有适应性运动规划能力的AMR。 从人类语义输入中提取的声明性和程序性知识被编码到一个超图模型中,该模型与其现场布局的空间配置重叠,用于路径规划。此外,我们设计了一个速度回放模块,采用带有少量样本学习能力的编码器-解码器架构,使机器人能够在情境化的场景中复制速度曲线,以实现多机器人同步和交接协作。 这些“缓存”的知识表示在模拟环境中展示了多机器人运动规划和堆叠任务的效果。本研究的见解为AMR的人工通用智能铺平了道路,并推动它们从复杂性向建筑自动化能力的发展。
https://arxiv.org/abs/2501.09290
Vision-based tactile sensors have drawn increasing interest in the robotics community. However, traditional lens-based designs impose minimum thickness constraints on these sensors, limiting their applicability in space-restricted settings. In this paper, we propose ThinTact, a novel lensless vision-based tactile sensor with a sensing field of over 200 mm2 and a thickness of less than 10 this http URL utilizes the mask-based lensless imaging technique to map the contact information to CMOS signals. To ensure real-time tactile sensing, we propose a real-time lensless reconstruction algorithm that leverages a frequency-spatial-domain joint filter based on discrete cosine transform (DCT). This algorithm achieves computation significantly faster than existing optimization-based methods. Additionally, to improve the sensing quality, we develop a mask optimization method based on the generic algorithm and the corresponding system matrix calibration this http URL evaluate the performance of our proposed lensless reconstruction and tactile sensing through qualitative and quantitative experiments. Furthermore, we demonstrate ThinTact's practical applicability in diverse applications, including texture recognition and contact-rich object manipulation. The paper will appear in the IEEE Transactions on Robotics: this https URL. Video: this https URL
基于视觉的触觉传感器在机器人学领域引起了越来越多的关注。然而,传统的透镜设计对这些传感器施加了最小厚度限制,从而在空间受限的应用场景中应用受到限制。本文提出了ThinTact,这是一种新型无镜头的基于视觉的触觉传感器,其传感区域超过200平方毫米,厚度小于10毫米。它采用掩模无镜头成像技术将接触信息映射为CMOS信号。为了确保实时触觉感知,我们提出了一种实时无透镜重建算法,该算法利用基于离散余弦变换(DCT)的频域和空域联合滤波器。相比现有的优化方法,此算法实现了显著更快的计算速度。此外,为进一步提高传感质量,我们开发了一种基于通用算法及相应的系统矩阵校准的掩模优化方法。通过定性和定量实验来评估所提出的无透镜重建技术和触觉感知性能。另外,我们展示了ThinTact在包括纹理识别和接触密集型对象操作在内的多种应用中的实际适用性。 相关研究论文已发表于IEEE机器人技术汇刊(IEEE Transactions on Robotics)。视频演示链接见:[此处插入原文视频链接]。
https://arxiv.org/abs/2501.09273
The construction industry has long explored robotics and computer vision, yet their deployment on construction sites remains very limited. These technologies have the potential to revolutionize traditional workflows by enhancing accuracy, efficiency, and safety in construction management. Ground robots equipped with advanced vision systems could automate tasks such as monitoring mechanical, electrical, and plumbing (MEP) systems. The present research evaluates the applicability of open-vocabulary vision-language models compared to fine-tuned, lightweight, closed-set object detectors for detecting MEP components using a mobile ground robotic platform. A dataset collected with cameras mounted on a ground robot was manually annotated and analyzed to compare model performance. The results demonstrate that, despite the versatility of vision-language models, fine-tuned lightweight models still largely outperform them in specialized environments and for domain-specific tasks.
长期以来,建筑行业一直在探索机器人技术和计算机视觉的应用,然而这些技术在施工现场的部署仍然非常有限。这些技术有潜力通过提高施工管理中的精度、效率和安全性来彻底改变传统的作业流程。配备先进视觉系统的地面机器人可以自动化诸如监测机械、电气和管道(MEP)系统之类的任务。当前的研究评估了开放词汇视觉语言模型与针对检测移动地面机器人平台上的MEP组件进行了微调的轻量级封闭集物体探测器之间的适用性。使用安装在地面机器人上的摄像机收集的数据集经过手动注释并分析,以比较不同模型的表现。研究结果显示,尽管视觉语言模型具有多功能性,在专门环境和特定领域任务中,微调过的轻量级模型依然表现出更为优越的性能。
https://arxiv.org/abs/2501.09267
Owing to recent advances in machine learning and the ability to harvest large amounts of data during robotic-assisted surgeries, surgical data science is ripe for foundational work. We present a large dataset of surgical videos and their accompanying labels for this purpose. We describe how the data was collected and some of its unique attributes. Multiple example problems are outlined. Although the dataset was curated for a particular set of scientific challenges (in an accompanying paper), it is general enough to be used for a broad range machine learning questions. Our hope is that this dataset exposes the larger machine learning community to the challenging problems within surgical data science, and becomes a touchstone for future research. The videos are available at this https URL, the labels at this https URL, and a validation set for tool detection problem at this https URL.
由于机器学习的近期进展以及在机器人辅助手术中收集大量数据的能力,外科数据科学正处于基础研究的重要阶段。为此,我们提供了一个大型的包含手术视频及其对应标签的数据集。本文描述了该数据是如何被收集的,并介绍了一些其独特的属性和特点。还列举了多个示例问题。尽管此数据集是为了应对特定的一系列科学研究挑战而整理出来的(详见附带论文),它也足够通用,可用于广泛的机器学习研究领域。我们的期望是,这个数据集能够向更广泛的人工智能社区展示外科数据科学中的复杂难题,并成为未来研究的一个重要参考点。 视频可以在[此处](https://example.com/videos)下载,标签可以在[此处](https://example.com/labels)获取,验证集中用于工具检测问题的数据集则在[此处](https://example.com/validation_set)提供。
https://arxiv.org/abs/2501.09209
This paper presents a modular framework for motion planning using movement primitives. Central to the approach is Contraction Theory, a modular stability tool for nonlinear dynamical systems. The approach extends prior methods by achieving parallel and sequential combinations of both discrete and rhythmic movements, while enabling independent modulation of each movement. This modular framework enables a divide-and-conquer strategy to simplify the programming of complex robot motion planning. Simulation examples illustrate the flexibility and versatility of the framework, highlighting its potential to address diverse challenges in robot motion planning.
本文提出了一种使用运动原语进行运动规划的模块化框架。该方法的核心是收缩理论,这是一种用于非线性动力系统的模块化稳定性工具。此方法通过实现离散和节律运动的同时组合与顺序组合,并允许独立调节每项运动,从而扩展了先前的方法。这种模块化框架使采用分而治之策略来简化复杂机器人运动规划的编程成为可能。模拟示例展示了该框架的灵活性和多功能性,突显了其解决机器人运动规划中各种挑战的巨大潜力。
https://arxiv.org/abs/2501.09198
This work introduces a novel Retention Layer mechanism for Transformer based architectures, addressing their inherent lack of intrinsic retention capabilities. Unlike human cognition, which can encode and dynamically recall symbolic templates, Generative Pretrained Transformers rely solely on fixed pretrained weights and ephemeral context windows, limiting their adaptability. The proposed Retention Layer incorporates a persistent memory module capable of real time data population, dynamic recall, and guided output generation. This enhancement allows models to store, update, and reuse observed patterns across sessions, enabling incremental learning and bridging the gap between static pretraining and dynamic, context sensitive adaptation. The Retention Layer design parallels social learning processes, encompassing attention, retention, reproduction, and motivation stages. Technically, it integrates a memory attention mechanism and episodic buffers to manage memory scalability, mitigate overfitting, and ensure efficient recall. Applications span adaptive personal assistants, real time fraud detection, autonomous robotics, content moderation, and healthcare diagnostics. In each domain, the retention mechanism enables systems to learn incrementally, personalize outputs, and respond to evolving real world challenges effectively. By emulating key aspects of human learning, this retention enhanced architecture fosters a more fluid and responsive AI paradigm, paving the way for dynamic, session aware models that extend the capabilities of traditional Transformers into domains requiring continual adaptation.
这项工作引入了一种新颖的保留层(Retention Layer)机制,用于基于Transformer架构的设计中。此机制解决了这些模型固有的内在保持能力不足的问题。与人类认知不同,后者能够编码并动态回忆象征性模板,生成式预训练Transformer仅依赖于固定的预训练权重和短暂的有效上下文窗口,这限制了它们的适应性。 所提出的保留层(Retention Layer)包含一个持久内存模块,该模块能够在实时数据填充、动态召回以及指导输出生成方面发挥作用。这一增强使模型能够跨会话存储、更新并重复使用观察到的模式,从而实现增量学习,并弥合静态预训练与动态上下文敏感适应之间的差距。 保留层的设计借鉴了社会学习过程中的注意力、记忆保持、再现和激励阶段。技术上,它集成了内存注意机制和情景缓冲区,以管理内存可扩展性、缓解过拟合以及确保高效召回。该方法的应用范围广泛,包括自适应个人助手、实时欺诈检测、自主机器人系统、内容审核和医疗诊断等。 在每个领域中,保留机制使得系统能够进行增量学习,个性化输出,并有效应对不断变化的现实挑战。通过模仿人类学习的关键方面,这种改进后的架构促进了更加流畅且响应迅速的人工智能范式的发展,为动态会话感知模型铺平了道路,从而将传统的Transformer能力扩展到需要持续适应性的领域中去。
https://arxiv.org/abs/2501.09166
Turn-taking is a fundamental aspect of conversation, but current Human-Robot Interaction (HRI) systems often rely on simplistic, silence-based models, leading to unnatural pauses and interruptions. This paper investigates, for the first time, the application of general turn-taking models, specifically TurnGPT and Voice Activity Projection (VAP), to improve conversational dynamics in HRI. These models are trained on human-human dialogue data using self-supervised learning objectives, without requiring domain-specific fine-tuning. We propose methods for using these models in tandem to predict when a robot should begin preparing responses, take turns, and handle potential interruptions. We evaluated the proposed system in a within-subject study against a traditional baseline system, using the Furhat robot with 39 adults in a conversational setting, in combination with a large language model for autonomous response generation. The results show that participants significantly prefer the proposed system, and it significantly reduces response delays and interruptions.
换言之,对话中的轮流发言是交流的基本方面,但当前的人机互动(HRI)系统通常依赖于基于静默的简单模型,这导致了不自然的停顿和打断。本文首次研究了一般轮流发言模型——特别是TurnGPT和声音活动预测(VAP)的应用,以改善人机对话中的交流动态。这些模型通过自我监督学习目标在人类之间的对话数据上进行训练,并且不需要特定领域的微调。我们提出了利用这两种模型的组合方法来预测机器人何时应开始准备回应、何时发言以及如何处理可能的打断。我们在一次实验中评估了所提出的系统,该实验采用了一个传统的基线系统,在39名成年人与Furhat机器人的对话环境中进行,并结合大型语言模型自动生成自主响应。结果表明,参与者明显更偏好于我们提出的系统,它在减少回应延迟和打断方面也显著有效。
https://arxiv.org/abs/2501.08946
Mobile robot fleets are currently used in different scenarios such as medical environments or logistics. The management of these systems provides different challenges that vary from the control of the movement of each robot to the allocation of tasks to be performed. Task Allocation (TA) problem is a key topic for the proper management of mobile robot fleets to ensure the minimization of energy consumption and quantity of necessary robots. Solutions on this aspect are essential to reach economic and environmental sustainability of robot fleets, mainly in industry applications such as warehouse logistics. The minimization of energy consumption introduces TA problem as an optimization issue which has been treated in recent studies. This work focuses on the analysis of current trends in solving TA of mobile robot fleets. Main TA optimization algorithms are presented, including novel methods based on Artificial Intelligence (AI). Additionally, this work showcases most important results extracted from simulations, including frameworks utilized for the development of the simulations. Finally, some conclusions are obtained from the analysis to target on gaps that must be treated in the future.
当前,移动机器人舰队被应用于多种场景中,如医疗环境或物流。这些系统的管理提供了从控制每个机器人的运动到任务分配的各种挑战。任务分配(TA)问题是确保移动机器人舰队高效运行的关键问题之一,旨在最小化能耗和所需机器人数量。在诸如仓库物流等工业应用领域实现机器人车队的经济效益和环保可持续性方面,解决这一问题至关重要。 随着对能源消耗最小化的追求,任务分配问题逐渐成为一个优化难题,并受到了近期研究的关注。本文重点分析了目前解决移动机器人舰队任务分配(TA)的趋势。文中介绍了主要的任务分配优化算法,包括基于人工智能(AI)的新方法。此外,文章展示了从模拟实验中提取的最重要结果,其中包括用于开发这些模拟的框架。最后,通过对现有问题进行分析,指出了未来研究需要填补的一些空白。 总之,本文旨在概述当前解决移动机器人舰队任务分配优化问题的方法和技术进展,并为未来的研究方向提供了指导和建议。
https://arxiv.org/abs/2501.08726