One goal of dexterous robotic grasping is to allow robots to handle objects with the same level of flexibility and adaptability as humans. However, it remains a challenging task to generate an optimal grasping strategy for dexterous hands, especially when it comes to delicate manipulation and accurate adjustment the desired grasping poses for objects of varying shapes and sizes. In this paper, we propose a novel dexterous grasp generation scheme called \textbf{\textit{GrainGrasp}} that provides fine-grained contact guidance for each fingertip. In particular, we employ a generative model to predict separate contact maps for each fingertip on the object point cloud, effectively capturing the specifics of finger-object interactions. In addition, we develop a new dexterous grasping optimization algorithm that solely relies on the point cloud as input, eliminating the necessity for complete mesh information of the object. By leveraging the contact maps of different fingertips, the proposed optimization algorithm can generate precise and determinable strategies for human-like object grasping. Experimental results confirm the efficiency of the proposed scheme. Our code is available at this https URL
灵巧机器人抓取的一个目标是使机器人能够像人类一样处理具有相同程度的灵活性和适应性的物体。然而,为灵巧的手生成最优抓取策略仍然是一个具有挑战性的任务,尤其是在处理形状和大小不等的物体时,更是如此。在本文中,我们提出了一个名为 \textbf{\textit{GrainGrasp}} 的新颖灵巧抓取生成方案,为每个手指提供细粒度的接触指导。 特别是,我们采用生成模型预测物体点云上每个手指的单独接触图,有效捕捉了手指与物体之间互动的特定细节。此外,我们还开发了一种仅依赖点云的灵巧抓取优化算法,消除了需要物体完整网格信息的必要性。通过利用不同手指的接触图,所提出的优化算法可以生成人类式物体抓取的精确和可确定策略。实验结果证实了所提出方案的有效性。我们的代码可以从该链接获取:
https://arxiv.org/abs/2405.09310
The graph neural networks has been proved to be an efficient machine learning technique in real life applications. The handwritten recognition is one of the useful area in real life use where both offline and online handwriting recognition are required. The chain code as feature extraction technique has shown significant results in literature and we have been able to use chain codes with graph neural networks. To the best of our knowledge, this work presents first time a novel combination of handwritten trajectories features as chain codes and graph neural networks together. The handwritten trajectories for offline handwritten text has been evaluated using recovery of drawing order, whereas online handwritten trajectories are directly used with chain codes. Our results prove that present combination surpass previous results and minimize error rate in few epochs only.
已经证明,图神经网络在现实生活中应用是有效的机器学习技术。手写识别是现实生活中的一个有用的领域,需要同时进行离线和在线手写识别。作为特征提取技术,链式码在文献中已经显示出显著的成果,我们能够使用图神经网络与链式码一起工作。据我们所知,这项工作首次将手写轨迹特征与链式码和图神经网络相结合,形成了一种新的组合。我们使用恢复绘制顺序来评估手写在线文本的手写轨迹,而在线手写轨迹则直接使用链式码。我们的结果证明,这种结合超出了以前的结果,并且在几轮训练后仅能最小化误差率。
https://arxiv.org/abs/2405.09247
Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.
当前的讲话者语音识别系统在提取讲话者嵌入之前依赖于外部语音活动检测模型。在本文中,我们证明了发言者嵌入提取器的注意系统充当一个弱监督的内部VAD模型,并且其表现与相应的监督VAD系统相当或者更好。随后,通过同时提取VAD日志和相应的讲话者嵌入,可以高效地实现发言者识别。我们详细分析了当前讲话者验证模型中帧级注意系统的行为,并使用ECAPA2讲话者嵌入提出了用于VAD和嵌入提取的新讲话者识别流程。所提出的策略在AMI、VoxConverse和DIHARD III语调基准上获得了最先进的性能。
https://arxiv.org/abs/2405.09142
Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods. However, random permutations of human interference cause fit oscillations in the model training, and Iterative Refinement (IR) operation to improve multimodal information decoupling also introduces additional overhead. To address these issues, this paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability, improving autoregressive generalization with internal LM. First, we propose Implicit Permutation Neurons (IPN) to generate adaptive attention masks to dynamically exploit token dependencies. The adaptive masks increase the diversity of training data and prevent model dependency on a specific order. It reduces the training overhead of PLM while avoiding training fit oscillations. Second, we develop Cross-modal Hierarchical Attention mechanism (CHA) to couple context and image features. This processing establishes rich positional semantic dependencies between context and image while avoiding IR. Extensive experimental results show the proposed HAAP achieves state-of-the-art (SOTA) performance in terms of accuracy, complexity, and latency on several datasets.
内部语言模型(LM)方法使用变换语言建模(PLM)来解决外部LM方法中由条件独立性引起的错误纠正。然而,人类干扰的随机变换导致训练中的拟合振荡,迭代优化(IR)操作为了提高多模态信息解耦还会引入额外的开销。为了应对这些问题,本文提出了具有自适应变换的层次注意力自回归模型(HAAP),以增强位置上下文-图像交互能力,提高自回归模型的泛化。首先,我们提出隐式变换神经元(IPN)来生成自适应的注意力掩码以动态利用词依赖关系。自适应掩码增加了训练数据的多样性,防止了模型对特定顺序的依赖。同时,降低了PLM的训练开销,并避免了训练中的拟合振荡。其次,我们开发了跨模态层次注意力机制(CHA)来将上下文和图像特征耦合。这种处理建立了上下文和图像之间的丰富位置语义依赖关系,同时避免了IR。大量的实验结果表明,与最先进的(SOTA)性能相比,HAAP在准确性、复杂性和延迟方面都取得了卓越的成绩。
https://arxiv.org/abs/2405.09125
Humans use collaborative robots as tools for accomplishing various tasks. The interaction between humans and robots happens in tight shared workspaces. However, these machines must be safe to operate alongside humans to minimize the risk of accidental collisions. Ensuring safety imposes many constraints, such as reduced torque and velocity limits during operation, thus increasing the time to accomplish many tasks. However, for applications such as using collaborative robots as haptic interfaces with intermittent contacts for virtual reality applications, speed limitations result in poor user experiences. This research aims to improve the efficiency of a collaborative robot while improving the safety of the human user. We used Gaussian process models to predict human hand motion and developed strategies for human intention detection based on hand motion and gaze to improve the time for the robot and human security in a virtual environment. We then studied the effect of prediction. Results from comparisons show that the prediction models improved the robot time by 3\% and safety by 17\%. When used alongside gaze, prediction with Gaussian process models resulted in an improvement of the robot time by 2\% and the safety by 13\%.
人类使用协作机器人作为完成各种任务的工具。人类和机器人之间的互动发生在紧密共享的工作空间中。然而,为了最小化意外碰撞的风险,这些机器必须安全地与人类一起操作。确保安全性会带来许多限制,例如在操作期间减小扭矩和速度限制,从而增加完成许多任务的所需时间。然而,对于将协作机器人用作虚拟现实应用中的触觉接口的应用,速度限制会导致用户体验差。这项研究旨在提高协作机器人的效率,同时提高人类用户的可靠性。我们使用高斯过程模型预测人类手部运动,并基于手部动作和眼神来开发了人类意图检测策略,以提高机器人和人类在虚拟环境中的安全时间。然后我们研究了预测的影响。比较结果表明,预测模型提高了机器人的时间3%,安全性提高了17%。当与眼神结合使用时,使用高斯过程模型的预测提高了机器人的时间2%,安全性提高了13%。
https://arxiv.org/abs/2405.09109
This study developed an explainable AI for ship collision avoidance. Initially, a critic network composed of sub-task critic networks was proposed to individually evaluate each sub-task in collision avoidance to clarify the AI decision-making processes involved. Additionally, an attempt was made to discern behavioral intentions through a Q-value analysis and an Attention mechanism. The former focused on interpreting intentions by examining the increment of the Q-value resulting from AI actions, while the latter incorporated the significance of other ships in the decision-making process for collision avoidance into the learning objective. AI's behavioral intentions in collision avoidance were visualized by combining the perceived collision danger with the degree of attention to other ships. The proposed method was evaluated through a numerical experiment. The developed AI was confirmed to be able to safely avoid collisions under various congestion levels, and AI's decision-making process was rendered comprehensible to humans. The proposed method not only facilitates the understanding of DRL-based controllers/systems in the ship collision avoidance task but also extends to any task comprising sub-tasks.
这项研究开发了一个可解释性AI用于避碰。最初,提出了一种由子任务批评网络组成的批评网络,以单独评估避碰中的每个子任务,以阐明涉及避碰AI决策过程。此外,通过Q值分析和关注机制试图通过分析AI行动产生的Q值增量来辨别行为意图。前一个方法专注于通过检查AI行动产生的Q值的增加来解释意图,而另一个方法将其他船舶在避碰决策过程中的重要性纳入学习目标。通过将感知避碰危险与关注其他船舶的程度相结合,可视化了避碰AI的行为意图。所提出的方法通过数值实验进行了评估。经证实,该AI在各种拥塞级别下能够安全避碰,并且AI的决策过程对人类是可理解的。所提出的方法不仅有助于在避碰任务中理解基于强化学习的控制器/系统,而且还能扩展到包括子任务在内的任何任务。
https://arxiv.org/abs/2405.09081
Although face analysis has achieved remarkable improvements in the past few years, designing a multi-task face analysis model is still challenging. Most face analysis tasks are studied as separate problems and do not benefit from the synergy among related tasks. In this work, we propose a novel task-adaptive multi-task face analysis method named as Q-Face, which simultaneously performs multiple face analysis tasks with a unified model. We fuse the features from multiple layers of a large-scale pre-trained model so that the whole model can use both local and global facial information to support multiple tasks. Furthermore, we design a task-adaptive module that performs cross-attention between a set of query vectors and the fused multi-stage features and finally adaptively extracts desired features for each face analysis task. Extensive experiments show that our method can perform multiple tasks simultaneously and achieves state-of-the-art performance on face expression recognition, action unit detection, face attribute analysis, age estimation, and face pose estimation. Compared to conventional methods, our method opens up new possibilities for multi-task face analysis and shows the potential for both accuracy and efficiency.
尽管在过去的几年中,面部识别已经取得了显著的进步,但设计一个多任务面部识别模型仍然具有挑战性。大多数面部识别任务都被单独研究,并没有从相关任务之间的协同作用中受益。在本文中,我们提出了一种名为 Q-Face 的具有新颖性的多任务面部识别方法,该方法使用一个统一模型同时执行多个面部识别任务。我们将来自大型预训练模型的多个层次的特征融合在一起,使整个模型可以利用局部和全局面部信息来支持多个任务。此外,我们还设计了一个任务适应模块,在查询向量和融合多级特征之间进行跨注意,并最终根据每个面部识别任务自适应地提取所需特征。大量实验证明,我们的方法可以同时执行多个任务,在面部表情识别、动作单元检测、面部属性分析、年龄估计和面部姿态估计方面的表现均为最先进的水平。与传统方法相比,我们的方法为多任务面部识别提供了新的可能性,并展示了准确性和效率的潜力。
https://arxiv.org/abs/2405.09059
The detection and tracking of small targets in passive optical remote sensing (PORS) has broad applications. However, most of the previously proposed methods seldom utilize the abundant temporal features formed by target motion, resulting in poor detection and tracking performance for low signal-to-clutter ratio (SCR) targets. In this article, we analyze the difficulty based on spatial features and the feasibility based on temporal features of realizing effective detection. According to this analysis, we use a multi-frame as a detection unit and propose a detection method based on temporal energy selective scaling (TESS). Specifically, we investigated the composition of intensity temporal profiles (ITPs) formed by pixels on a multi-frame detection unit. For the target-present pixel, the target passing through the pixel will bring a weak transient disturbance on the ITP and introduce a change in the statistical properties of ITP. We use a well-designed function to amplify the transient disturbance, suppress the background and noise components, and output the trajectory of the target on the multi-frame detection unit. Subsequently, to solve the contradiction between the detection rate and the false alarm rate brought by the traditional threshold segmentation, we associate the temporal and spatial features of the output trajectory and propose a trajectory extraction method based on the 3D Hough transform. Finally, we model the trajectory of the target and propose a trajectory-based multi-target tracking method. Compared with the various state-of-the-art detection and tracking methods, experiments in multiple scenarios prove the superiority of our proposed methods.
被动光学遥感(PORS)中检测和跟踪小目标具有广泛的应用价值。然而,之前提出的大多数方法很少利用目标运动产生的丰富时变特征,导致低信号-噪声比(SCR)目标检测和跟踪性能较差。在本文中,我们分析基于空间特征和基于时变特征实现有效检测的难度,并根据分析结果提出了一种基于时变能量选择性缩放(TESS)的检测方法。具体来说,我们研究了多帧中像素产生的强度时变轮廓(ITP)的组成。对于目标存在的像素,穿过像素的目标会对ITP产生弱暂态干扰,并改变ITP的统计特性。我们使用一个精心设计的函数来放大暂态干扰,抑制背景和噪声分量,并输出目标在多帧检测单元上的轨迹。为了解决传统阈值分割带来的检测率和误报警率之间的矛盾,我们将输出轨迹的时域和空间特征相关联,并提出了基于3D Hough变换的轨迹提取方法。最后,我们建模了目标轨迹,并提出了基于轨迹的多目标跟踪方法。与各种最先进的检测和跟踪方法相比,多个场景下的实验证明了我们提出方法的优越性。
https://arxiv.org/abs/2405.09054
"How does the person in the bounding box feel?" Achieving human-level recognition of the apparent emotion of a person in real world situations remains an unsolved task in computer vision. Facial expressions are not enough: body pose, contextual knowledge, and commonsense reasoning all contribute to how humans perform this emotional theory of mind task. In this paper, we examine two major approaches enabled by recent large vision language models: 1) image captioning followed by a language-only LLM, and 2) vision language models, under zero-shot and fine-tuned setups. We evaluate the methods on the Emotions in Context (EMOTIC) dataset and demonstrate that a vision language model, fine-tuned even on a small dataset, can significantly outperform traditional baselines. The results of this work aim to help robots and agents perform emotionally sensitive decision-making and interaction in the future.
这个人如何感觉?在现实生活中,对一个人情感的明显认知仍是一个在计算机视觉中尚未解决的任务。仅仅依靠面部表情是不够的:身体姿势、上下文知识以及常识推理都参与了人类完成这个情感理论思维任务的方式。在本文中,我们研究了两种由最近的大型视觉语言模型推动的主要方法:1)图像标题 followed by a language-only LLM,2)在零散和微调设置下的视觉语言模型。我们在情感在上下文中(EMOTIC)数据集上评估这些方法,并证明了即使是对于小型数据集,经过微调的视觉语言模型也显著优于传统基线。本工作的结果旨在帮助机器人和代理在未来的情感敏感决策和交互中发挥作用。
https://arxiv.org/abs/2405.08992
Current datasets for long-form video understanding often fall short of providing genuine long-form comprehension challenges, as many tasks derived from these datasets can be successfully tackled by analyzing just one or a few random frames from a video. To address this issue, we present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding. This paper details our innovative approach for creating a question-answer dataset, utilizing advanced LLMs with human-in-the-loop and building upon human-generated raw data. Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene. Additionally, we evaluate recent video-centric LLMs, both open-source and proprietary, on the test split of our dataset. The findings reveal that even state-of-the-art video-centric LLMs significantly lag behind human performance in these tasks, highlighting the complexity and challenge inherent in video understanding. The dataset is available at this https URL
目前用于长视频理解的数据集往往无法提供真正的长视频理解挑战,因为许多来源于这些数据集的任务只要分析视频中的一两个随机帧就可以成功解决。为解决这个问题,我们提出了一个名为CinePile的新数据集和基准,专门为真正的长视频理解而设计。本文详细介绍了我们创造问题-答案数据集的方法,利用先进的神经网络与人类交互并基于人类生成的原始数据。我们全面的数据集包括305,000个多选题问题(MCQs),涵盖各种视觉和多模态方面,包括时间理解、理解人-对象交互和推理场景中事件或动作。此外,我们还评估了我们的数据集中的最新视频相关LLM,包括开源和专有版本,在测试集上。研究结果表明,即使是最先进的视频相关LLM在这些任务上也无法与人类 performance相媲美,这突出了视频理解的复杂性和挑战性。该数据集可在此处访问:<https://this URL>
https://arxiv.org/abs/2405.08813
Navigating the complex landscape of news articles involves understanding the various actors or entities involved, referred to as news stakeholders. These stakeholders, ranging from policymakers to opposition figures, citizens, and more, play pivotal roles in shaping news narratives. Recognizing their stakeholder types, reflecting their roles, political alignments, social standing, and more, is paramount for a nuanced comprehension of news content. Despite existing works focusing on salient entity extraction, coverage variations, and political affiliations through social media data, the automated detection of stakeholder roles within news content remains an underexplored domain. In this paper, we bridge this gap by introducing an effective approach to classify stakeholder types in news articles. Our method involves transforming the stakeholder classification problem into a natural language inference task, utilizing contextual information from news articles and external knowledge to enhance the accuracy of stakeholder type detection. Moreover, our proposed model showcases efficacy in zero-shot settings, further extending its applicability to diverse news contexts.
浏览新闻文章的复杂多变的景观,需要理解各种参与者的身份,这些参与者从决策者到反对派人物、公民等,在塑造新闻故事中发挥着关键作用。了解他们的角色、政治观点、社会地位等,对于深入理解新闻内容至关重要。尽管现有的工作集中于通过社交媒体数据突出实体、覆盖差异和政治立场,但自动检测新闻内容中的参与者角色仍然是一个未被探索的领域。在本文中,我们通过引入一种有效的分类新闻文章中参与者类型的方法,跨越了这个领域的空白。我们的方法将参与者分类问题转化为自然语言推理任务,利用新闻文章的上下文信息和外部知识来提高参与者类型检测的准确性。此外,我们所提出的模型在零散设置中表现出优异效果,进一步拓展了其在各种新闻环境中的应用。
https://arxiv.org/abs/2405.08751
Non-prehensile manipulation enables fast interactions with objects by circumventing the need to grasp and ungrasp as well as handling objects that cannot be grasped through force closure. Current approaches to non-prehensile manipulation focus on static contacts, avoiding the underactuation that comes with sliding. However, the ability to control sliding contact, essentially removing the no-slip constraint, opens up new possibilities in dynamic manipulation. In this paper, we explore a challenging dynamic non-prehensile manipulation task that requires the consideration of the full spectrum of hybrid contact modes. We leverage recent methods in contact-implicit MPC to handle the multi-modal planning aspect of the task. We demonstrate, with careful consideration of integration between the simple model used for MPC and the low-level tracking controller, how contact-implicit MPC can be adapted to dynamic tasks. Surprisingly, despite the known inaccuracies of frictional rigid contact models, our method is able to react to these inaccuracies while still quickly performing the task. Moreover, we do not use common aids such as reference trajectories or motion primitives, highlighting the generality of our approach. To the best of our knowledge, this is the first application of contact-implicit MPC to a dynamic manipulation task in three dimensions.
非抓取操作使通过绕开抓取和解抓取的需要,以及处理无法通过力闭合来抓取的对象,实现了与物体的高速互动。目前,非抓取操作方法主要关注静态接触,避免与滑动相关的松动。然而,控制滑动接触的能力,本质上消除松动约束,为动态操作带来了新的可能性。在本文中,我们探讨了一个具有挑战性的动态非抓取操作任务,需要考虑混合接触模式的完整范围。我们利用最近在接触隐式MPC中的方法来处理任务的 Multi-modal 规划方面。我们证明了,在仔细考虑简单模型用于MPC和低级跟踪控制器之间的集成的情况下,接触隐式MPC可以适应动态任务。令人惊讶的是,尽管已知摩擦刚性接触模型的不准确度,但我们的方法仍然能够应对这些不准确度,同时仍然快速地完成任务。此外,我们没有使用常见的辅助工具,如参考轨迹或运动原型,这突出了我们方法的普遍性。据我们所知,这是在三维空间中第一次将接触隐式MPC应用于动态操作任务。
https://arxiv.org/abs/2405.08731
Addressing the challenge of low-resource information extraction remains an ongoing issue due to the inherent information scarcity within limited training examples. Existing data augmentation methods, considered potential solutions, struggle to strike a balance between weak augmentation (e.g., synonym augmentation) and drastic augmentation (e.g., conditional generation without proper guidance). This paper introduces a novel paradigm that employs targeted augmentation and back validation to produce augmented examples with enhanced diversity, polarity, accuracy, and coherence. Extensive experimental results demonstrate the effectiveness of the proposed paradigm. Furthermore, identified limitations are discussed, shedding light on areas for future improvement.
由于有限训练样本内的固有信息稀缺性,解决低资源信息抽取的挑战仍然是一个持续进行的问题,这也使得现有的数据增强方法,被认为是潜在的解决方案,在弱增强(例如同义词增强)和剧增增强(例如缺乏指导时的条件生成)之间难以取得平衡。本文介绍了一种新的范式,它采用有针对性增强和反向验证来产生具有增强多样性、极化、准确性和连贯性的增强示例。大量的实验结果证明了所提出的范式的有效性。此外,还讨论了已识别的局限性,并阐明了未来改进的领域。
https://arxiv.org/abs/2405.08729
Ensuring safety and adapting to the user's behavior are of paramount importance in physical human-robot interaction. Thus, incorporating elastic actuators in the robot's mechanical design has become popular, since it offers intrinsic compliance and additionally provide a coarse estimate for the interaction force by measuring the deformation of the elastic components. While observer-based methods have been shown to improve these estimates, they rely on accurate models of the system, which are challenging to obtain in complex operating environments. In this work, we overcome this issue by learning the unknown dynamics components using Gaussian process (GP) regression. By employing the learned model in a Bayesian filtering framework, we improve the estimation accuracy and additionally obtain an observer that explicitly considers local model uncertainty in the confidence measure of the state estimate. Furthermore, we derive guaranteed estimation error bounds, thus, facilitating the use in safety-critical applications. We demonstrate the effectiveness of the proposed approach experimentally in a human-exoskeleton interaction scenario.
在物理人机交互中确保安全和适应用户行为至关重要。因此,将弹性执行器纳入机器设计中已成为一种流行的方法,因为它提供了固有的顺应性,并且通过测量弹性部件的变形程度,还提供了一个粗略的估计值来计算交互力。虽然基于观察者的方法已经证明了这些估计的改善,但是它们依赖于准确系统模型的准确性,而在复杂操作环境中获得这种准确性是非常困难的。在这项工作中,我们通过使用高斯过程(GP)回归学习未知动态组件。通过将学习到的模型应用于贝叶斯滤波框架,我们提高了估计精度和附加观测器,它明确考虑了状态估计中局部模型不确定性。此外,我们导出了保证估计误差上限,从而促进在关键应用中使用。我们在人机协同操作场景中验证了所提出的方法的实效性。
https://arxiv.org/abs/2405.08711
Addressing multi-label action recognition in videos represents a significant challenge for robotic applications in dynamic environments, especially when the robot is required to cooperate with humans in tasks that involve objects. Existing methods still struggle to recognize unseen actions or require extensive training data. To overcome these problems, we propose Dual-VCLIP, a unified approach for zero-shot multi-label action recognition. Dual-VCLIP enhances VCLIP, a zero-shot action recognition method, with the DualCoOp method for multi-label image classification. The strength of our method is that at training time it only learns two prompts, and it is therefore much simpler than other methods. We validate our method on the Charades dataset that includes a majority of object-based actions, demonstrating that -- despite its simplicity -- our method performs favorably with respect to existing methods on the complete dataset, and promising performance when tested on unseen actions. Our contribution emphasizes the impact of verb-object class-splits during robots' training for new cooperative tasks, highlighting the influence on the performance and giving insights into mitigating biases.
在视频中的多标签动作识别是一个对机器人动态环境应用的显著挑战,尤其是在机器人需要与人类在涉及对象的任務中進行合作时。现有的方法仍然很难识别未见到的动作,或者需要大量的训练数据。为了克服这些问题,我们提出了Dual-VCLIP,一种用于零散标签多标签动作识别的统一方法。Dual-VCLIP通过DualCoOp方法增强了VCLIP,一种用于零散标签图像分类的零散动作识别方法。我们方法的优势在于,在训练时它只学习两个提示,因此它比其他方法要简单得多。我们在包含大量物体为基础的动作的Charades数据集上验证我们的方法,证明了--尽管其简单性--我们的方法在完整数据集上与现有方法的表现相当,而在测试未见到的动作时具有 promising 的表现。我们的贡献强调了在机器人训练过程中动词-物体类别的分割对新型合作任务的影响,突出了对表现和减轻偏见的影响。
https://arxiv.org/abs/2405.08695
The nature of diversity in real-world environments necessitates neural network models to expand from closed category settings to accommodate novel emerging categories. In this paper, we study the open-vocabulary object detection (OVD), which facilitates the detection of novel object classes under the supervision of only base annotations and open-vocabulary knowledge. However, we find that the inadequacy of neighboring relationships between regions during the alignment process inevitably constrains the performance on recent distillation-based OVD strategies. To this end, we propose Neighboring Region Attention Alignment (NRAA), which performs alignment within the attention mechanism of a set of neighboring regions to boost the open-vocabulary inference. Specifically, for a given proposal region, we randomly explore the neighboring boxes and conduct our proposed neighboring region attention (NRA) mechanism to extract relationship information. Then, this interaction information is seamlessly provided into the distillation procedure to assist the alignment between the detector and the pre-trained vision-language models (VLMs). Extensive experiments validate that our proposed model exhibits superior performance on open-vocabulary benchmarks.
现实环境中的多样性需要神经网络模型从封闭的类别设置扩展到以容纳新颖的浮现类别。在本文中,我们研究了开放词汇对象检测(OVD),它通过仅基于基本注释的监督来检测新颖的对象类别。然而,我们发现,在配准过程中,区域之间相邻关系的不足会必然限制最近基于蒸馏的OD策略的性能。为此,我们提出了邻居区域注意对齐(NRAA),它通过一组邻居区域的注意机制在注意力机制内进行对齐,以提高开放词汇的推理。具体来说,对于给定的建议区域,我们随机探索邻居框,并执行我们提出的邻居区域注意(NRA)机制来提取关系信息。然后,这种交互信息被无缝地提供到蒸馏过程中,以协助检测器与预训练的视觉语言模型(VLMs)之间的对齐。大量实验证实,与开放词汇基准相比,我们提出的模型具有卓越的性能。
https://arxiv.org/abs/2405.08593
Task and Motion Planning (TAMP) algorithms solve long-horizon robotics tasks by integrating task planning with motion planning; the task planner proposes a sequence of actions towards a goal state and the motion planner verifies whether this action sequence is geometrically feasible for the robot. However, state-of-the-art TAMP algorithms do not scale well with the difficulty of the task and require an impractical amount of time to solve relatively small problems. We propose Constraints and Streams for Task and Motion Planning (COAST), a probabilistically-complete, sampling-based TAMP algorithm that combines stream-based motion planning with an efficient, constrained task planning strategy. We validate COAST on three challenging TAMP domains and demonstrate that our method outperforms baselines in terms of cumulative task planning time by an order of magnitude. You can find more supplementary materials on our project \href{this https URL}{website}.
任务和动作规划(TAMP)算法通过将任务规划与运动规划相结合来解决长视野机器人任务;任务规划器提出了一系列动作序列以达到目标状态,而运动规划器验证这些动作序列是否对机器人几何可行。然而,最先进的TAMP算法在任务难度上表现不佳,需要解决相对较小的問題的时间相当长。我们提出了一种约束和流式任务和动作规划(COAST)算法,这是一种概率完整性、基于采样的TAMP算法,将流式运动规划与高效、约束的任务规划策略相结合。我们在三个具有挑战性的TAMP领域上验证了COAST,并证明了我们的方法在累积任务规划时间方面优于基线。您可以在我们的项目页面上找到更多信息 \href{this <https://website.com>}.
https://arxiv.org/abs/2405.08572
In visual tasks, large teacher models capture essential features and deep information, enhancing performance. However, distilling this information into smaller student models often leads to performance loss due to structural differences and capacity limitations. To tackle this, we propose a distillation framework based on graph knowledge, including a multi-level feature alignment strategy and an attention-guided mechanism to provide a targeted learning trajectory for the student model. We emphasize spectral embedding (SE) as a key technique in our distillation process, which merges the student's feature space with the relational knowledge and structural complexities similar to the teacher network. This method captures the teacher's understanding in a graph-based representation, enabling the student model to more accurately mimic the complex structural dependencies present in the teacher model. Compared to methods that focus only on specific distillation areas, our strategy not only considers key features within the teacher model but also endeavors to capture the relationships and interactions among feature sets, encoding these complex pieces of information into a graph structure to understand and utilize the dynamic relationships among these pieces of information from a global perspective. Experiments show that our method outperforms previous feature distillation methods on the CIFAR-100, MS-COCO, and Pascal VOC datasets, proving its efficiency and applicability.
在视觉任务中,大型教师模型能够捕获关键特征和深度信息,提高性能。然而,将这种信息压缩到学生模型中往往会导致性能损失,因为学生模型和教师模型的结构存在差异和容量限制。为了解决这个问题,我们提出了一个基于图知识的蒸馏框架,包括多级特征对齐策略和关注引导机制,为学生的模型提供目标学习路径。我们强调谱聚类(SE)作为我们蒸馏过程的关键技术,它将学生的特征空间与类似于教师网络的结构关系和复杂性相结合。这种方法捕捉了教师在基于图的关系知识,使得学生模型能够更准确地模仿教师模型的复杂结构依赖关系。与只关注特定蒸馏领域的方法相比,我们的策略不仅考虑了教师模型中的关键特征,而且努力捕捉特征集之间的关系和相互作用,将这些复杂信息编码为图结构,从全局视角理解并利用这些信息。实验证明,我们的方法在CIFAR-100、MS-COCO和Pascal VOC数据集上的性能优于之前的特征蒸馏方法,证明了其高效性和适用性。
https://arxiv.org/abs/2405.08547
Conversation requires a substantial amount of coordination between dialogue participants, from managing turn taking to negotiating mutual understanding. Part of this coordination effort surfaces as the reuse of linguistic behaviour across speakers, a process often referred to as alignment. While the presence of linguistic alignment is well documented in the literature, several questions remain open, including the extent to which patterns of reuse across speakers have an impact on the emergence of labelling conventions for novel referents. In this study, we put forward a methodology for automatically detecting shared lemmatised constructions -- expressions with a common lexical core used by both speakers within a dialogue -- and apply it to a referential communication corpus where participants aim to identify novel objects for which no established labels exist. Our analyses uncover the usage patterns of shared constructions in interaction and reveal that features such as their frequency and the amount of different constructions used for a referent are associated with the degree of object labelling convergence the participants exhibit after social interaction. More generally, the present study shows that automatically detected shared constructions offer a useful level of analysis to investigate the dynamics of reference negotiation in dialogue.
对话需要对话参与者在语篇、回合管理以及相互理解等方面进行大量的协调。这种协调努力在文献中得到了充分记录,但仍有几个问题有待回答,包括跨说话者之间语言行为的再利用程度对新颖参照物标签的出现有何影响。在这项研究中,我们提出了一个自动检测共享同义词短语的方法——用于对话中共同使用的词汇核心的表达。我们将该方法应用于一个旨在识别没有现有标签的新物体语料库中。我们的分析揭示了互动中共享同义词短语的使用模式,并揭示了诸如它们的使用频率和使用的不同短语数量等特征与参与者在社交互动后展示的物体标签程度之间的关联。更一般地说,本研究表明,自动检测到的共享同义词短语为研究对话中参照物协商的动态提供了有价值的分析水平。
https://arxiv.org/abs/2405.08546
The IoT and Business Process Management (BPM) communities co-exist in many shared application domains, such as manufacturing and healthcare. The IoT community has a strong focus on hardware, connectivity and data; the BPM community focuses mainly on finding, controlling, and enhancing the structured interactions among the IoT devices in processes. While the field of Process Mining deals with the extraction of process models and process analytics from process event logs, the data produced by IoT sensors often is at a lower granularity than these process-level events. The fundamental questions about extracting and abstracting process-related data from streams of IoT sensor values are: (1) Which sensor values can be clustered together as part of process events?, (2) Which sensor values signify the start and end of such events?, (3) Which sensor values are related but not essential? This work proposes a framework to semi-automatically perform a set of structured steps to convert low-level IoT sensor data into higher-level process events that are suitable for process mining. The framework is meant to provide a generic sequence of abstract steps to guide the event extraction, abstraction, and correlation, with variation points for plugging in specific analysis techniques and algorithms for each step. To assess the completeness of the framework, we present a set of challenges, how they can be tackled through the framework, and an example on how to instantiate the framework in a real-world demonstration from the field of smart manufacturing. Based on this framework, future research can be conducted in a structured manner through refining and improving individual steps.
物联网和业务流程管理(BPM)社区在许多共同的应用领域中存在,如制造业和医疗保健。物联网社区对硬件、连接和数据有很强的关注;BPM社区主要关注在过程设备中寻找、控制和改进结构化的交互。虽然过程挖掘领域涉及从过程事件日志中提取过程模型和过程分析,但物联网传感器产生的数据往往比这些过程级事件具有较低的粒度。从物联网传感器值的流中提取和抽象与过程相关的数据的基本问题包括:(1)哪些传感器值可以聚类为过程事件的一部分?(2)哪些传感器值表示过程事件的开始和结束?(3)哪些传感器值相关但不 essential?本文提出了一个半自动化的框架,以执行一系列结构化步骤将低级物联网传感器数据转换为适合过程挖掘的高级别过程事件。该框架旨在提供一种通用的序列步骤,以指导事件提取、抽象和关联,并允许插入特定分析技术和算法的差异点。为了评估框架的完整性,我们提供了一组挑战,以及如何通过框架解决这些挑战的思路,以及一个从智能制造领域实际演示中即时启动框架的示例。基于这个框架,未来的研究可以通过优化和改进各个步骤来以结构化的方式进行。
https://arxiv.org/abs/2405.08528