Abstract
Temporally locating and classifying fine-grained sub-task segments in long, untrimmed videos is crucial to safe human-robot collaboration. Unlike generic activity recognition, collaborative manipulation requires sub-task labels that are directly robot-executable. We present RoboSubtaskNet, a multi-stage human-to-robot sub-task segmentation framework that couples attention-enhanced I3D features (RGB plus optical flow) with a modified MS-TCN employing a Fibonacci dilation schedule to capture better short-horizon transitions such as reach-pick-place. The network is trained with a composite objective comprising cross-entropy and temporal regularizers (truncated MSE and a transition-aware term) to reduce over-segmentation and to encourage valid sub-task progressions. To close the gap between vision benchmarks and control, we introduce RoboSubtask, a dataset of healthcare and industrial demonstrations annotated at the sub-task level and designed for deterministic mapping to manipulator primitives. Empirically, RoboSubtaskNet outperforms MS-TCN and MS-TCN++ on GTEA and our RoboSubtask benchmark (boundary-sensitive and sequence metrics), while remaining competitive on the long-horizon Breakfast benchmark. Specifically, RoboSubtaskNet attains F1 @ 50 = 79.5%, Edit = 88.6%, Acc = 78.9% on GTEA; F1 @ 50 = 30.4%, Edit = 52.0%, Acc = 53.5% on Breakfast; and F1 @ 50 = 94.2%, Edit = 95.6%, Acc = 92.2% on RoboSubtask. We further validate the full perception-to-execution pipeline on a 7-DoF Kinova Gen3 manipulator, achieving reliable end-to-end behavior in physical trials (overall task success approx 91.25%). These results demonstrate a practical path from sub-task level video understanding to deployed robotic manipulation in real-world settings.
Abstract (translated)
在长时间未剪辑的视频中定位和分类精细粒度的任务子段对于安全的人机协作至关重要。与通用活动识别不同,协作操作需要可以直接由机器人执行的任务子标签。我们提出了一种多阶段的人类到机器人的任务分割框架RoboSubtaskNet,该框架结合了增强注意力机制的I3D特征(RGB和光流)以及采用斐波那契膨胀时间表修改后的MS-TCN架构,以更好地捕捉如“伸手-抓取-放置”这类短时域内的转换。网络通过一个包含交叉熵和时间正则化器(截断MSE和过渡感知项)的复合目标进行训练,以减少过度分割并鼓励有效的子任务进展。 为了弥合视觉基准与控制之间的差距,我们引入了RoboSubtask数据集,该数据集包含了医疗保健和工业演示,并在子任务级别进行了注释,旨在确定性地映射到机械臂的基本操作。经验表明,在GTEA和我们的RoboSubtask基准测试(边界敏感性和序列度量)上,RoboSubtaskNet的表现优于MS-TCN和MS-TCN++,同时在长时域的Breakfast数据集上也具有竞争力。具体而言,RoboSubtaskNet在GTEA上的F1@50 = 79.5%,Edit = 88.6%,Acc = 78.9%;在Breakfast上的F1@50 = 30.4%,Edit = 52.0%,Acc = 53.5%;以及在RoboSubtask上的F1@50 = 94.2%,Edit = 95.6%,Acc = 92.2%。我们进一步在一个7自由度的Kinova Gen3机械臂上验证了整个感知到执行的管道,在物理试验中实现了可靠的整体端到端行为(总体任务成功率约为91.25%)。这些结果展示了从子任务级别视频理解到现实世界环境中部署机器人操作的实际路径。 上述内容翻译自原文,详细描述了RoboSubtaskNet框架及其在不同数据集上的性能表现。
URL
https://arxiv.org/abs/2602.10015