Gaze following and social gaze prediction are fundamental tasks providing insights into human communication behaviors, intent, and social interactions. Most previous approaches addressed these tasks separately, either by designing highly specialized social gaze models that do not generalize to other social gaze tasks or by considering social gaze inference as an ad-hoc post-processing of the gaze following task. Furthermore, the vast majority of gaze following approaches have proposed static models that can handle only one person at a time, therefore failing to take advantage of social interactions and temporal dynamics. In this paper, we address these limitations and introduce a novel framework to jointly predict the gaze target and social gaze label for all people in the scene. The framework comprises of: (i) a temporal, transformer-based architecture that, in addition to image tokens, handles person-specific tokens capturing the gaze information related to each individual; (ii) a new dataset, VSGaze, that unifies annotation types across multiple gaze following and social gaze datasets. We show that our model trained on VSGaze can address all tasks jointly, and achieves state-of-the-art results for multi-person gaze following and social gaze prediction.
凝视跟随和社会凝视预测是人类交流行为、意图和社交互动的基础任务,为研究这些任务提供了深入了解。之前的方法通常是将这些任务单独处理,或者通过设计高度专业化的社交凝视模型,这些模型不具有泛化性,或者将社会凝视推理视为凝视跟随任务的附带后处理。此外,绝大多数凝视跟随方法都提出了静态模型,只能处理一个观察者,因此无法充分利用社交互动和时间动态。在本文中,我们克服了这些限制,并引入了一种新的框架,可以同时预测场景中所有人员的凝视目标和社会凝视标签。该框架包括:(i)一个基于时间,Transformer-based架构,除了图像标记外,还处理个人标记,捕捉与每个个体相关的凝视信息;(ii)一个新的数据集VSGaze,将多个凝视跟随和社交凝视数据集中的注释类型统一。我们证明了在VSGaze上训练的模型可以同时处理所有任务,并取得了多个人凝视跟随和社交凝视预测的最好成绩。
https://arxiv.org/abs/2403.10511
Knowledge Measures (KMs) aim at quantifying the amount of knowledge/information that a knowledge base carries. On the other hand, Belief Change (BC) is the process of changing beliefs (in our case, in terms of contraction, expansion and revision) taking into account a new piece of knowledge, which possibly may be in contradiction with the current belief. We propose a new quantitative BC framework that is based on KMs by defining belief change operators that try to minimise, from an information-theoretic point of view, the surprise that the changed belief carries. To this end, we introduce the principle of minimal surprise. In particular, our contributions are (i) a general information-theoretic approach to KMs for which [1] is a special case; (ii) KM-based BC operators that satisfy the so-called AGM postulates; and (iii) a characterisation of any BC operator that satisfies the AGM postulates as a KM-based BC operator, i.e., any BC operator satisfying the AGM postulates can be encoded within our quantitative BC framework. We also introduce quantitative measures that account for the information loss of contraction, information gain of expansion and information change of revision. We also give a succinct look into the problem of iterated revision, which deals with the application of a sequence of revision operations in our framework, and also illustrate how one may build from our KM-based contraction operator also one not satisfying the (in)famous recovery postulate, by focusing on the so-called severe withdrawal model as an illustrative example.
知识衡量(KMs)旨在量化知识库所携带的知识/信息量。另一方面,信念变化(BC)是针对新知识(在我们的情况下,收缩、膨胀和修订)的过程,该过程考虑到新知识可能与当前信念相矛盾。我们提出了一个基于KMs的新定量BC框架,通过定义信念变化操作来最小化从信息论角度上的惊喜。因此,我们引入了最小惊奇原则。具体来说,我们的贡献是(i)一种针对KMs的一般信息论方法,其中[1]是一个特殊情况;(ii)满足所谓AGM公理的KM-基信念变化操作;(iii)满足AGM公理的任意BC操作的KM基定义,即任何满足AGM公理的BC操作都可以用我们的定量BC框架编码。我们还引入了考虑收缩、膨胀和修订的信息损失的定量度量。我们还简要探讨了迭代修订的问题,该问题涉及在我们的框架中应用一系列修订操作,并且还通过关注所谓的严重撤回模型作为示例来说明如何从基于KMs的收缩操作构建不满足(不)著名恢复公理的操作。
https://arxiv.org/abs/2403.10502
Integrating Large Language Models (VLMs) and Vision-Language Models (VLMs) with robotic systems enables robots to process and understand complex natural language instructions and visual information. However, a fundamental challenge remains: for robots to fully capitalize on these advancements, they must have a deep understanding of their physical embodiment. The gap between AI models cognitive capabilities and the understanding of physical embodiment leads to the following question: Can a robot autonomously understand and adapt to its physical form and functionalities through interaction with its environment? This question underscores the transition towards developing self-modeling robots without reliance on external sensory or pre-programmed knowledge about their structure. Here, we propose a meta self modeling that can deduce robot morphology through proprioception (the internal sense of position and movement). Our study introduces a 12 DoF reconfigurable legged robot, accompanied by a diverse dataset of 200k unique configurations, to systematically investigate the relationship between robotic motion and robot morphology. Utilizing a deep neural network model comprising a robot signature encoder and a configuration decoder, we demonstrate the capability of our system to accurately predict robot configurations from proprioceptive signals. This research contributes to the field of robotic self-modeling, aiming to enhance understanding of their physical embodiment and adaptability in real world scenarios.
将大型语言模型(VLMs)和视觉语言模型(VLMs)与机器人系统集成,使机器人能够处理和理解复杂的自然语言指令和视觉信息。然而,一个基本挑战 remains:为了充分利用这些进步,机器人必须对其物理 embodiment 具有深入的理解。AI 模型认知能力和物理 embodiment 的理解之间的差距导致了以下问题:机器人是否可以通过与环境的交互自主理解并适应其物理形态和功能?这个问题突出了开发无需依赖外部感官或预编程知识其结构的自我建模机器人的过渡。 在这里,我们提出了一种元自我建模,可以通过本体感觉(内部感觉位置和运动)来推断机器人的外形。我们的研究引入了一台12个自由度可重构的机器人,并随着一个包含200k个独特配置的多样数据集,系统性地研究了机器运动和机器人外形之间的关系。利用包括机器人签名编码器和一个配置解码器的深度神经网络模型,我们证明了我们的系统能够准确预测从本体感觉信号预测机器人配置。这项研究为机器人自建模领域做出了贡献,旨在增强其在现实场景中理解和适应能力。
https://arxiv.org/abs/2403.10496
It is common for us to feel pressure in a competition environment, which arises from the desire to obtain success comparing with other individuals or opponents. Although we might get anxious under the pressure, it could also be a drive for us to stimulate our potentials to the best in order to keep up with others. Inspired by this, we propose a competitive learning framework which is able to help individual robot to acquire knowledge from the competition, fully stimulating its dynamics potential in the race. Specifically, the competition information among competitors is introduced as the additional auxiliary signal to learn advantaged actions. We further build a Multiagent-Race environment, and extensive experiments are conducted, demonstrating that robots trained in competitive environments outperform ones that are trained with SoTA algorithms in single robot environment.
在竞争环境中感到压力是很常见的,这源于我们想要与他人或对手比较并获得成功的愿望。尽管在压力下我们可能会感到焦虑,但这也可能是我们激发潜能,以与他人保持竞争力的动力。受到这一启发,我们提出了一个竞争性学习框架,该框架能够帮助个人机器人从竞争中获得知识,完全发挥其在比赛中的动态潜力。具体来说,竞争信息作为附加辅助信号来学习优势动作。我们进一步构建了一个多代理人大赛环境,并进行了广泛的实验,证明了在单机器人环境中训练的机器人表现不如在竞争环境中训练的机器人。
https://arxiv.org/abs/2403.10487
Integrated task and motion planning (TAMP) has proven to be a valuable approach to generalizable long-horizon robotic manipulation and navigation problems. However, the typical TAMP problem formulation assumes full observability and deterministic action effects. These assumptions limit the ability of the planner to gather information and make decisions that are risk-aware. We propose a strategy for TAMP with Uncertainty and Risk Awareness (TAMPURA) that is capable of efficiently solving long-horizon planning problems with initial-state and action outcome uncertainty, including problems that require information gathering and avoiding undesirable and irreversible outcomes. Our planner reasons under uncertainty at both the abstract task level and continuous controller level. Given a set of closed-loop goal-conditioned controllers operating in the primitive action space and a description of their preconditions and potential capabilities, we learn a high-level abstraction that can be solved efficiently and then refined to continuous actions for execution. We demonstrate our approach on several robotics problems where uncertainty is a crucial factor and show that reasoning under uncertainty in these problems outperforms previously proposed determinized planning, direct search, and reinforcement learning strategies. Lastly, we demonstrate our planner on two real-world robotics problems using recent advancements in probabilistic perception.
集成任务和运动规划(TAMP)已经被证明是解决通用长视野机器人操作和导航问题的有价值的方法。然而,典型的TAMP问题陈述假设具有完全可观测性和确定性动作效果。这些假设限制了规划器收集信息和做出风险意识决策的能力。我们提出了一种名为TAMPURA的TAMP策略,具有在初始状态和动作结果不确定性的情况下解决长视野规划问题的能力,包括需要信息收集和避免不良和不可逆结果的问题。我们的规划器在抽象任务级别和连续控制器级别均基于不确定性进行推理。给定一系列在基本动作空间上运行的闭环 goal-conditioned 控制器以及它们的前提条件和潜在能力的描述,我们学习到一个高层次的抽象,可以高效地解决,然后对其进行微调以实现连续动作。我们在几个机器人问题中应用了我们的方法,其中不确定性是一个关键因素,并证明了在不确定性下进行推理在这些问题中优于之前提出的确定性规划、直接搜索和强化学习策略。最后,我们还使用最近在概率感知方面的进展在两个真实世界的机器人问题中应用我们的规划器。
https://arxiv.org/abs/2403.10454
Humans perceive and construct the world as an arrangement of simple parametric models. In particular, we can often describe man-made environments using volumetric primitives such as cuboids or cylinders. Inferring these primitives is important for attaining high-level, abstract scene descriptions. Previous approaches for primitive-based abstraction estimate shape parameters directly and are only able to reproduce simple objects. In contrast, we propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. A RANSAC estimator guided by a neural network fits these primitives to a depth map. We condition the network on previously detected parts of the scene, parsing it one-by-one. To obtain cuboids from single RGB images, we additionally optimise a depth estimation CNN end-to-end. Naively minimising point-to-primitive distances leads to large or spurious cuboids occluding parts of the scene. We thus propose an improved occlusion-aware distance metric correctly handling opaque scenes. Furthermore, we present a neural network based cuboid solver which provides more parsimonious scene abstractions while also reducing inference time. The proposed algorithm does not require labour-intensive labels, such as cuboid annotations, for training. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
人类通过简单参数方程来感知和构建世界。特别是,我们可以通过使用体积原型如立方体或圆柱体来描述人造环境。推断这些原型对于实现高级抽象场景描述非常重要。之前基于原型的抽象方法直接估计形状参数,并且只能复制简单的物体。相比之下,我们提出了一种鲁棒的原始建模器,它可以有意义地通过立方体抽象复杂现实世界。一个基于神经网络的RANSAC估计器将这些原型适配到深度图。我们通过将网络的条件限制在前置检测到的场景部分上,逐个解析它。为了从单色人眼图像中获得立方体,我们还在端到端深度估计CNN上进行优化。简单地最小化点对原点的距离会导致大或伪劣的立方体遮挡场景的部分。因此,我们提出了一个更好的遮挡感知距离度量来处理不透明的场景。此外,我们还提出了一个基于立方体的求解器,它在减少推理时间的同时提供更加简洁的场景抽象。所提出的算法不需要劳动密集型的标签,例如立方体注释,来进行训练。在NYU Depth v2数据集上的结果表明,所提出的算法成功地抽象了杂乱的现实生活中3D场景布局。
https://arxiv.org/abs/2403.10452
This study introduces the Hybrid Sequential Manipulation Planner (H-MaP), a novel approach that iteratively does motion planning using contact points and waypoints for complex sequential manipulation tasks in robotics. Combining optimization-based methods for generalizability and sampling-based methods for robustness, H-MaP enhances manipulation planning through active contact mode switches and enables interactions with auxiliary objects and tools. This framework, validated by a series of diverse physical manipulation tasks and real-robot experiments, offers a scalable and adaptable solution for complex real-world applications in robotic manipulation.
本研究介绍了一种名为Hybrid Sequential Manipulation Planner(H-MaP)的新颖方法,用于在机器人中使用接触点和轨迹点进行迭代运动规划,以解决复杂序列操作任务。通过结合基于优化的方法和基于采样的方法来提高可扩展性和稳健性,H-MaP通过主动接触模式切换增强了操作规划,并可以与辅助对象和工具进行交互。通过一系列多样物理操作任务和真实机器人实验的验证,该框架为机器人操作复杂现实应用提供了可扩展和适应性的解决方案。
https://arxiv.org/abs/2403.10436
The current societal challenges exceed the capacity of human individual or collective effort alone. As AI evolves, its role within human collectives is poised to vary from an assistive tool to a participatory member. Humans and AI possess complementary capabilities that, when synergized, can achieve a level of collective intelligence that surpasses the collective capabilities of either humans or AI in isolation. However, the interactions in human-AI systems are inherently complex, involving intricate processes and interdependencies. This review incorporates perspectives from network science to conceptualize a multilayer representation of human-AI collective intelligence, comprising a cognition layer, a physical layer, and an information layer. Within this multilayer network, humans and AI agents exhibit varying characteristics; humans differ in diversity from surface-level to deep-level attributes, while AI agents range in degrees of functionality and anthropomorphism. The interplay among these agents shapes the overall structure and dynamics of the system. We explore how agents' diversity and interactions influence the system's collective intelligence. Furthermore, we present an analysis of real-world instances of AI-enhanced collective intelligence. We conclude by addressing the potential challenges in AI-enhanced collective intelligence and offer perspectives on future developments in this field.
当前的社会面临着个人或集体努力无法克服的挑战。随着AI的发展,其在人类集体中的作用从辅助工具转变为参与成员。人类和AI拥有互补的能力,当它们协同作用时,可以实现超过人类或AI单独能力的集体智能。然而,人类-AI系统中的互动固守复杂过程和相互依存关系。本文从网络科学的角度来构思了人类-AI集体智能的多层表示,包括认知层、物理层和信息层。在这种多层网络中,人类和AI代理表现出不同的特征;人类在多样性上从表面到深层,而AI代理在功能和人性化程度上有很大的变化。这些代理之间的互动构成了系统的整体结构和动态。我们探讨了代理的多样性及其相互作用如何影响系统的集体智能。此外,我们还对现实世界中的AI增强集体智能进行了分析。最后,我们讨论了AI增强集体智能中可能出现的挑战,并探讨了该领域未来的发展趋势。
https://arxiv.org/abs/2403.10433
Current rock engineering design in drill and blast tunnelling primarily relies on engineers' observational assessments. Measure While Drilling (MWD) data, a high-resolution sensor dataset collected during tunnel excavation, is underutilised, mainly serving for geological visualisation. This study aims to automate the translation of MWD data into actionable metrics for rock engineering. It seeks to link data to specific engineering actions, thus providing critical decision support for geological challenges ahead of the tunnel face. Leveraging a large and geologically diverse dataset of 500,000 drillholes from 15 tunnels, the research introduces models for accurate rock mass quality classification in a real-world tunnelling context. Both conventional machine learning and image-based deep learning are explored to classify MWD data into Q-classes and Q-values, examples of metrics describing the stability of the rock mass, using both tabular and image data. The results indicate that the K-nearest neighbours algorithm in an ensemble with tree-based models using tabular data, effectively classifies rock mass quality. It achieves a cross-validated balanced accuracy of 0.86 in classifying rock mass into the Q-classes A, B, C, D, E1, E2, and 0.95 for a binary classification with E versus the rest. Classification using a CNN with MWD-images for each blasting round resulted in a balanced accuracy of 0.82 for binary classification. Regressing the Q-value from tabular MWD-data achieved cross-validated R2 and MSE scores of 0.80 and 0.18 for a similar ensemble model as in classification. High performance in regression and classification boosts confidence in automated rock mass assessment. Applying advanced modelling on a unique dataset demonstrates MWD data's value in improving rock mass classification accuracy and advancing data-driven rock engineering design, reducing manual intervention.
目前,钻掘和爆破隧道主要依赖工程师的观察评估。测量掘进过程中的声波数据(MWD)是一个在隧道挖掘过程中收集的高分辨率传感器数据集,主要用於地质可视化。本研究旨在将MWD数据自动化为岩石工程的有用指标。它试图将数据与具体的工程行动联系起来,为即将到来的隧道面前的地质挑战提供关键决策支持。 利用来自15个隧道的500,000个钻孔的大型且地质多样性的数据集,研究引入了在现实钻掘背景下准确判断岩石质量模型的方法。研究探索了使用传统机器学习和基于图像的深度学习将MWD数据分类为Q类和Q值的模型。使用表格和图像数据描述岩石质量的指标。 结果显示,在具有树状模型的集合中,K-最近邻算法有效地分类了岩石质量。它将岩石质量分类到Q类,Q值的二分类器的交叉验证平衡精度提高到0.86。使用CNN对每个爆破周期处理的MWD图像进行分类,二分类器的平衡精度为0.82。从表格MWD数据中回归Q值,其交叉验证的R2和MSE分数分别为0.80和0.18,与分类模型的结果相似。 在先进的建模技术的帮助下,对独特数据集的深入研究证明了MWD数据在提高岩石质量分类准确性和推动数据驱动岩石工程设计方面的价值,并减少了手动干预。
https://arxiv.org/abs/2403.10404
Manipulating deformable objects remains a challenge within robotics due to the difficulties of state estimation, long-horizon planning, and predicting how the object will deform given an interaction. These challenges are the most pronounced with 3D deformable objects. We propose SculptDiff, a goal-conditioned diffusion-based imitation learning framework that works with point cloud state observations to directly learn clay sculpting policies for a variety of target shapes. To the best of our knowledge this is the first real-world method that successfully learns manipulation policies for 3D deformable objects. For sculpting videos and access to our dataset and hardware CAD models, see the project website: this https URL
操作变形对象在机器人领域仍然具有挑战性,由于状态估计、长距离规划以及预测物体在交互过程中的变形困难。这些挑战在3D变形对象上尤为突出。我们提出了SculptDiff,一种基于目标条件扩散的模仿学习框架,可直接从点云状态观测中学习各种目标形状的捏塑策略。据我们所知,这是第一个在现实生活中成功学习3D变形对象操作策略的方法。如果您想观看雕塑视频,访问我们的数据集和硬件CAD模型,请查看项目网站:https:// this https URL。
https://arxiv.org/abs/2403.10401
The field of autonomous driving has attracted considerable interest in approaches that directly infer 3D objects in the Bird's Eye View (BEV) from multiple cameras. Some attempts have also explored utilizing 2D detectors from single images to enhance the performance of 3D detection. However, these approaches rely on a two-stage process with separate detectors, where the 2D detection results are utilized only once for token selection or query initialization. In this paper, we present a single model termed SimPB, which simultaneously detects 2D objects in the perspective view and 3D objects in the BEV space from multiple cameras. To achieve this, we introduce a hybrid decoder consisting of several multi-view 2D decoder layers and several 3D decoder layers, specifically designed for their respective detection tasks. A Dynamic Query Allocation module and an Adaptive Query Aggregation module are proposed to continuously update and refine the interaction between 2D and 3D results, in a cyclic 3D-2D-3D manner. Additionally, Query-group Attention is utilized to strengthen the interaction among 2D queries within each camera group. In the experiments, we evaluate our method on the nuScenes dataset and demonstrate promising results for both 2D and 3D detection tasks. Our code is available at: this https URL.
自动驾驶领域吸引了相当大的关注,尤其是在从多个相机直接推断鸟视图(BEV)中的3D对象的方法。一些尝试还探索了利用单个图像中的2D检测器来提高3D检测的性能。然而,这些方法依赖于两个阶段的处理过程,其中仅在关键词选择或查询初始化时利用2D检测结果。在本文中,我们提出了一个名为SimPB的单一模型,该模型同时从多个相机检测2D物体和3D物体。为实现这一目标,我们引入了一个由多个鸟视图2D检测层和几个3D检测层组成的混合编码器。为了不断更新和优化2D和3D结果之间的交互,我们提出了动态查询分配模块和自适应查询聚合模块。此外,我们还使用了查询组注意来加强每个相机组内2D查询之间的互动。在实验中,我们在 nuScenes 数据集上评估了我们的方法,并展示了对于2D和3D检测任务的积极结果。我们的代码可在此处访问:https://this URL。
https://arxiv.org/abs/2403.10353
Humans can learn a new word and infer its grammatical properties from very few examples. They have an abstract notion of linguistic properties like grammatical gender and agreement rules that can be applied to novel syntactic contexts and words. Drawing inspiration from psycholinguistics, we conduct a noun learning experiment to assess whether an LSTM and a decoder-only transformer can achieve human-like abstraction of grammatical gender in French. Language models were tasked with learning the gender of a novel noun embedding from a few examples in one grammatical agreement context and predicting agreement in another, unseen context. We find that both language models effectively generalise novel noun gender from one to two learning examples and apply the learnt gender across agreement contexts, albeit with a bias for the masculine gender category. Importantly, the few-shot updates were only applied to the embedding layers, demonstrating that models encode sufficient gender information within the word embedding space. While the generalisation behaviour of models suggests that they represent grammatical gender as an abstract category, like humans, further work is needed to explore the details of how exactly this is implemented. For a comparative perspective with human behaviour, we conducted an analogous one-shot novel noun gender learning experiment, which revealed that native French speakers, like language models, also exhibited a masculine gender bias and are not excellent one-shot learners either.
人类可以从很少的例子中学会一个新的单词,并推断出其语义特征。他们具有类似于语义特征的抽象概念,如语性别和一致规则,可以应用于新颖的句法上下文和单词。从心理语言学的灵感出发,我们进行了一次名词学习实验,以评估是否可以使用LSTM和decoder-only transformer实现类似的人类对语性特征的抽象理解。语言模型被要求从几个例子中学习一个新的名词的性别,从一个语义一致上下文中预测另一个未见过的上下文中的同意。我们发现,两个语言模型都能有效地从一两个学习例子中推广新颖的 noun 性别,并将其应用到一致上下文中,尽管对男性性别类别有偏见。重要的是,只应用了少量的样本更新,这表明模型在词嵌入空间中编码了足够的性别信息。虽然模型的泛化行为表明它们将语性特征表示为抽象类别,就像人类一样,但还需要进一步工作来探索实际上是如何实现的。为了与人类行为进行比较,我们进行了一次类似的一小样本新名词性别学习实验,结果表明,与语言模型一样,母语为法语的参与者也表现出性别偏见,并且也不是很好的单击学习者。
https://arxiv.org/abs/2403.10338
Bagging operations, common in packaging and assisted living applications, are challenging due to a bag's complex deformable properties. To address this, we develop a robotic system for automated bagging tasks using an adaptive structure-of-interest (SOI) manipulation approach. Our method relies on real-time visual feedback to dynamically adjust manipulation without requiring prior knowledge of bag materials or dynamics. We present a robust pipeline featuring state estimation for SOIs using Gaussian Mixture Models (GMM), SOI generation via optimization-based bagging techniques, SOI motion planning with Constrained Bidirectional Rapidly-exploring Random Trees (CBiRRT), and dual-arm manipulation coordinated by Model Predictive Control (MPC). Experiments demonstrate the system's ability to achieve precise, stable bagging of various objects using adaptive coordination of the manipulators. The proposed framework advances the capability of dual-arm robots to perform more sophisticated automation of common tasks involving interactions with deformable objects.
袋装操作在包装和护理应用中很常见,但由于袋子的复杂变形特性,这些问题变得非常具有挑战性。为解决这个问题,我们开发了一种使用自适应结构兴趣(SOI)操作方法进行自动袋装任务的机器人系统。我们的方法依赖于实时的视觉反馈来动态调整操作,而无需提前了解袋子的材料或动态。我们提出了一个稳健的流程,包括使用高斯混合模型(GMM)进行SOI状态估计、通过基于优化的袋装技术进行SOI生成、使用Constrained Bidirectional Rapidly-exploring Random Trees (CBiRRT)进行SOI运动规划以及由Model Predictive Control(MPC)协调的双臂操作。实验证明,利用机器人操作器的自适应协同,该系统能够实现对各种物品的精确、稳定的袋装。所提出的框架进一步提高了双臂机器人对涉及与可变形物体交互的常见任务的复杂自动化能力。
https://arxiv.org/abs/2403.10309
Classical structural-based visual localization methods offer high accuracy but face trade-offs in terms of storage, speed, and privacy. A recent innovation, keypoint scene coordinate regression (KSCR) named D2S addresses these issues by leveraging graph attention networks to enhance keypoint relationships and predict their 3D coordinates using a simple multilayer perceptron (MLP). Camera pose is then determined via PnP+RANSAC, using established 2D-3D correspondences. While KSCR achieves competitive results, rivaling state-of-the-art image-retrieval methods like HLoc across multiple benchmarks, its performance is hindered when data samples are limited due to the deep learning model's reliance on extensive data. This paper proposes a solution to this challenge by introducing a pipeline for keypoint descriptor synthesis using Neural Radiance Field (NeRF). By generating novel poses and feeding them into a trained NeRF model to create new views, our approach enhances the KSCR's generalization capabilities in data-scarce environments. The proposed system could significantly improve localization accuracy by up to 50\% and cost only a fraction of time for data synthesis. Furthermore, its modular design allows for the integration of multiple NeRFs, offering a versatile and efficient solution for visual localization. The implementation is publicly available at: this https URL.
基于经典结构视觉定位方法具有高精度,但存储、速度和隐私方面存在牺牲。一种最近的创新,关键点场景坐标回归(KSCR)命名为D2S,通过利用图注意力网络增强关键点关系并使用简单的多层感知器(MLP)预测其3D坐标,从而解决这些问题。然后通过PnP+RANSAC确定相机姿态,利用已建立的2D-3D对应关系。虽然KSCR在多个基准测试中实现了竞争力的结果,与像HLoc这样的最先进图像检索方法相媲美,但当数据样本有限时,由于深度学习模型对广泛数据的高度依赖,其性能受到限制。为了解决这个问题,本文提出了一种通过使用神经辐射场(NeRF)进行关键点描述符的流程来解决这个挑战。通过生成新的姿态并将其输入已训练的NeRF模型以生成新的视图,我们的方法在数据稀疏环境中提高了KSCR的泛化能力。所提出的系统可以通过提高50\%的定位精度,仅用很少的时间来合成数据,显著提高局部定位精度。此外,它的模块化设计允许将多个NeRF集成到一个 versatile和高效的视觉定位解决方案中。实现可通过以下链接公开获得:https://this URL。
https://arxiv.org/abs/2403.10297
Time-series data in real-world medical settings typically exhibit long-range dependencies and are observed at non-uniform intervals. In such contexts, traditional sequence-based recurrent models struggle. To overcome this, researchers replace recurrent architectures with Neural ODE-based models to model irregularly sampled data and use Transformer-based architectures to account for long-range dependencies. Despite the success of these two approaches, both incur very high computational costs for input sequences of moderate lengths and greater. To mitigate this, we introduce the Rough Transformer, a variation of the Transformer model which operates on continuous-time representations of input sequences and incurs significantly reduced computational costs, critical for addressing long-range dependencies common in medical contexts. In particular, we propose multi-view signature attention, which uses path signatures to augment vanilla attention and to capture both local and global dependencies in input data, while remaining robust to changes in the sequence length and sampling frequency. We find that Rough Transformers consistently outperform their vanilla attention counterparts while obtaining the benefits of Neural ODE-based models using a fraction of the computational time and memory resources on synthetic and real-world time-series tasks.
真实世界医疗场景中的时间序列数据通常表现出长距离依赖关系,并且观察到的间隔是非均匀的。在这种情况下,传统基于序列的循环模型很难。为了克服这种,研究人员用基于神经网络的运动方程模型来建模非均匀采样数据,并使用基于Transformer的架构来处理长距离依赖关系。尽管这两种方法都取得了成功,但它们的输入序列中等长度和高维数据需要非常高的计算成本。为了减轻这种成本,我们引入了Rough Transformer,这是一种Transformer模型的变体,它在输入序列的连续时间表示上运行,并大大降低了计算成本,这对解决医疗场景中常见的长距离依赖关系非常重要。 特别是,我们提出了多视角签名注意,它使用路径签名来增强基本的注意力,并捕捉输入数据中的局部和全局依赖关系,同时保持对序列长度和采样周期的变化鲁棒。我们发现,Rough Transformers在 synthetic 和 real-world time-series 任务上的表现始终优于它们的普通注意力 counterparts,而使用的时间和内存资源却大大减少。
https://arxiv.org/abs/2403.10288
The deepfake threats to society and cybersecurity have provoked significant public apprehension, driving intensified efforts within the realm of deepfake video detection. Current video-level methods are mostly based on {3D CNNs} resulting in high computational demands, although have achieved good performance. This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. This transformation process involves sequentially masking frames at the same positions within each frame. These frames are then resized into sub-frames and reorganized into the predetermined layout, forming thumbnails. TALL is model-agnostic and has remarkable simplicity, necessitating only minimal code modifications. Furthermore, we introduce a graph reasoning block (GRB) and semantic consistency (SC) loss to strengthen TALL, culminating in TALL++. GRB enhances interactions between different semantic regions to capture semantic-level inconsistency clues. The semantic consistency loss imposes consistency constraints on semantic features to improve model generalization ability. Extensive experiments on intra-dataset, cross-dataset, diffusion-generated image detection, and deepfake generation method recognition show that TALL++ achieves results surpassing or comparable to the state-of-the-art methods, demonstrating the effectiveness of our approaches for various deepfake detection problems. The code is available at this https URL.
深度伪造对社会的威胁和对信息网络安全的影响引起了公众的广泛关注,推动了在深度伪造视频检测领域加大投入。目前,基于3D CNN的方法在计算需求方面较高,尽管在性能方面已经取得了一定的成果。本文介绍了一种优雅而简单 yet effective 的策略,名为缩略图布局(TALL),该策略将视频剪辑转换为预定义的布局,以实现保留空间和时间依赖关系的目的。这个转换过程涉及在每帧相同的位置逐序遮罩帧。这些帧 then 被缩放成子帧并重新排列成预定义的布局,形成缩略图。TALL对模型具有模型无关性,具有显著的简单性,只需要很少的代码修改。此外,我们还引入了图推理单元(GRB)和语义一致性(SC)损失,以增强TALL,最终实现TALL++。GRB增强了不同语义区域之间的交互,以捕捉语义级别的不一致性线索。语义一致性损失对语义特征施加一致性约束,以提高模型的泛化能力。在内部数据集、跨数据集、扩散生成的图像检测和深度伪造生成方法识别等大量实验中,TALL++实现了与最先进方法相当或超越最先进方法的结果,证明了我们的方法在各种深度伪造检测问题上的有效性。代码可在此处下载:https://www.xxxxxx.com。
https://arxiv.org/abs/2403.10261
Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast, multi-modal object ReID utilizes complementary information from diverse modalities, showing great potentials for practical applications. However, previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues, we propose a novel learning framework named \textbf{EDITOR} to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then, we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards, we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally, to further reduce the effect of backgrounds, we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions, which improve the feature discrimination with background suppression. As a result, our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at this https URL.
单模态物体识别(ReID)在复杂的视觉场景中面临很大的挑战。相比之下,多模态物体ReID利用来自不同模态的互补信息,具有很大的实际应用潜力。然而,之前的方法可能会受到无关背景的影响,通常会忽略模态差距。为解决上述问题,我们提出了一个名为《编辑器》(Editor)的新学习框架,用于从视觉Transformer中选择多样化的标记。我们首先使用共享的视觉Transformer提取不同输入模态的标记。然后,我们引入了一个空间频率标记选择(SFTS)模块,以适应选择具有空间和频率信息的物体中心标记。接下来,我们使用层次结构掩码聚合(HMA)模块促进模态之间和模态之间的特征交互。最后,为了进一步减少背景的影响,我们提出了背景一致性约束(BCC)和物体中心特征细化(OCFR)。它们被表示为两个新的损失函数,通过背景抑制改善了特征识别。因此,我们的框架可以生成更有区分性的多模态物体ReID。在三个多模态ReID基准测试中进行了广泛的实验,验证了我们的方法的有效性。代码可在此处下载:https://www.example.com/。
https://arxiv.org/abs/2403.10254
While text summarization is a well-known NLP task, in this paper, we introduce a novel and useful variant of it called functionality extraction from Git README files. Though this task is a text2text generation at an abstract level, it involves its own peculiarities and challenges making existing text2text generation systems not very useful. The motivation behind this task stems from a recent surge in research and development activities around the use of large language models for code-related tasks, such as code refactoring, code summarization, etc. We also release a human-annotated dataset called FuncRead, and develop a battery of models for the task. Our exhaustive experimentation shows that small size fine-tuned models beat any baseline models that can be designed using popular black-box or white-box large language models (LLMs) such as ChatGPT and Bard. Our best fine-tuned 7 Billion CodeLlama model exhibit 70% and 20% gain on the F1 score against ChatGPT and Bard respectively.
尽管文本摘要是一个著名的自然语言处理(NLP)任务,但在这篇论文中,我们引入了一个新颖且实用的文本摘要变体,称为从Git README文件中提取功能。尽管这个任务在抽象层面上是一个文本到文本的生成任务,但它具有自己的独特特点和挑战,使得现有的文本到文本生成系统变得不太有用。这个任务的背后是对大型语言模型在代码相关任务中进行研究和开发活动的激增。我们还发布了一个人类标注的数据集FuncRead,并为该任务开发了各种模型。我们进行彻底的实验,结果表明,经过微调的小规模模型击败了所有使用流行的大型语言模型(LLMs)如ChatGPT和Bard设计的基线模型。我们最佳微调的70亿CodeLlama模型在ChatGPT和Bard上的F1得分分别增加了70%和20%。
https://arxiv.org/abs/2403.10205
In recent advancements within the domain of Large Language Models (LLMs), there has been a notable emergence of agents capable of addressing Robotic Process Automation (RPA) challenges through enhanced cognitive capabilities and sophisticated reasoning. This development heralds a new era of scalability and human-like adaptability in goal attainment. In this context, we introduce AUTONODE (Autonomous User-interface Transformation through Online Neuro-graphic Operations and Deep Exploration). AUTONODE employs advanced neuro-graphical techniques to facilitate autonomous navigation and task execution on web interfaces, thereby obviating the necessity for predefined scripts or manual intervention. Our engine empowers agents to comprehend and implement complex workflows, adapting to dynamic web environments with unparalleled efficiency. Our methodology synergizes cognitive functionalities with robotic automation, endowing AUTONODE with the ability to learn from experience. We have integrated an exploratory module, DoRA (Discovery and mapping Operation for graph Retrieval Agent), which is instrumental in constructing a knowledge graph that the engine utilizes to optimize its actions and achieve objectives with minimal supervision. The versatility and efficacy of AUTONODE are demonstrated through a series of experiments, highlighting its proficiency in managing a diverse array of web-based tasks, ranging from data extraction to transaction processing.
在大型语言模型(LLMs)领域最近的研究进展中,出现了一些能够通过增强认知能力和复杂的推理能力来应对机器人流程自动化(RPA)挑战的智能体。这一发展预示着在实现目标的过程中将进入一个可扩展和具有人类相似适应性的新时代。在这个背景下,我们介绍了一个名为AUTONODE(通过在线神经图网络操作和深度探索实现自主用户界面转换)的系统。AUTONODE采用先进的神经图网络技术来促进自主导航和任务执行在网页界面上,从而消除了需要预定义脚本或手动干预的必要性。我们的引擎使智能体能够理解并实施复杂的任务流程,适应于动态的网页环境,效率无与伦比。我们的方法论将认知功能与机器人自动化相结合,使AUTONODE具有从经验中学习的能力。我们引入了一个探索模块,DoRA(用于构建知识图的发现和映射操作),该模块对于构建引擎使用的知识图至关重要。通过一系列实验,我们展示了AUTONODE的多样性和有效性,涵盖了从数据提取到交易处理的各类网页任务。
https://arxiv.org/abs/2403.10171
User Interface (UI) understanding has been an increasingly popular topic over the last few years. So far, there has been a vast focus solely on web and mobile applications. In this paper, we introduce the harder task of computer UI understanding. With the goal of enabling research in this field, we have generated a dataset with a set of videos where a user is performing a sequence of actions and each image shows the desktop contents at that time point. We also present a framework that is composed of a synthetic sample generation pipeline to augment the dataset with relevant characteristics, and a contrastive learning method to classify images in the videos. We take advantage of the natural conditional, tree-like, relationship of the images' characteristics to regularize the learning of the representations by dealing with multiple partial tasks simultaneously. Experimental results show that the proposed framework outperforms previously proposed hierarchical multi-label contrastive losses in fine-grain UI classification.
用户界面(UI)理解是一个越来越热门的话题,并且在过去的几年里,主要集中在Web和移动应用程序上。在本文中,我们提出了一个更难的任务:计算机UI理解。旨在促进此领域的研究,我们生成了一个数据集,其中用户会执行一系列操作,并且每张图片都显示了此时桌面内容。我们还提出了一个由合成样本生成管道和视频中的图像对比学习方法组成的框架。我们利用图像特征的自然条件树状关系,在处理多个部分任务的同时,对表示的学习进行正则化。实验结果表明,与之前提出的分层多标签对比损失相比,所提出的框架在细粒度UI分类上表现优异。
https://arxiv.org/abs/2403.10170