We introduce Perception Encoder (PE), a state-of-the-art encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network. To draw them out, we introduce two alignment methods, language alignment for multimodal language modeling, and spatial alignment for dense prediction. Together with the core contrastive checkpoint, our PE family of models achieves state-of-the-art performance on a wide variety of tasks, including zero-shot image and video classification and retrieval; document, image, and video Q&A; and spatial tasks such as detection, depth estimation, and tracking. To foster further research, we are releasing our models, code, and a novel dataset of synthetically and human-annotated videos.
我们介绍了一种先进的感知编码器(PE),这是一种通过简单视觉-语言学习训练出来的图像和视频理解的编码器。传统上,视觉编码器依赖于一系列用于特定下游任务如分类、描述或定位的预训练目标。令人惊讶的是,在扩展了我们精心调整的图像预训练方案并用我们的稳健视频数据引擎进行微调后,我们发现仅通过对比式视觉-语言训练就能产生适用于所有这些下游任务的强大且通用的嵌入表示。唯一的不足是:这些嵌入隐藏在网络中间层中。 为了提取它们,我们引入了两种对齐方法:多模态语言模型的语言对齐和密集预测的空间对齐。结合核心对比检查点,我们的PE家族模型在广泛的任务上取得了最先进的性能,包括零样本图像和视频分类及检索;文档、图像和视频问答;以及空间任务如检测、深度估计和跟踪。 为了促进进一步的研究,我们将发布我们的模型、代码以及一套新颖的合成和人工标注视频数据集。
https://arxiv.org/abs/2504.13181
Object 6D pose estimation is a critical challenge in robotics, particularly for manipulation tasks. While prior research combining visual and tactile (visuotactile) information has shown promise, these approaches often struggle with generalization due to the limited availability of visuotactile data. In this paper, we introduce ViTa-Zero, a zero-shot visuotactile pose estimation framework. Our key innovation lies in leveraging a visual model as its backbone and performing feasibility checking and test-time optimization based on physical constraints derived from tactile and proprioceptive observations. Specifically, we model the gripper-object interaction as a spring-mass system, where tactile sensors induce attractive forces, and proprioception generates repulsive forces. We validate our framework through experiments on a real-world robot setup, demonstrating its effectiveness across representative visual backbones and manipulation scenarios, including grasping, object picking, and bimanual handover. Compared to the visual models, our approach overcomes some drastic failure modes while tracking the in-hand object pose. In our experiments, our approach shows an average increase of 55% in AUC of ADD-S and 60% in ADD, along with an 80% lower position error compared to FoundationPose.
六维(6D)姿态估计是机器人技术中的一个关键挑战,尤其是在执行抓取和操作任务时。尽管之前的研究将视觉信息与触觉(视听触觉)相结合的方法显示出了一定的潜力,但由于视触数据集有限的问题,这些方法在泛化能力上常常存在不足。本文介绍了ViTa-Zero框架,这是一个零样本学习下的视听触觉姿态估计框架。 我们的核心创新在于利用一个视觉模型作为主干,并基于来自触觉和本体感觉观察所推导出的物理约束来进行可行性检查及测试时优化。具体来说,我们将抓手-物体交互建模为弹簧质量系统,在此系统中,触觉传感器诱导吸引作用力,而本体感受则产生排斥作用力。 我们通过在实际机器人设置上的实验验证了该框架的有效性,展示出其对代表性视觉主干和操作场景(包括抓取、物体拾起及双臂交接)的适用性。与仅依赖视觉模型的方法相比,在跟踪手中物体姿态时,我们的方法克服了一些极端失败模式,并且在平均AUC增益方面,ADD-S提高了55%,ADD提高了60%,同时位置误差降低了80%(相较于FoundationPose)。
https://arxiv.org/abs/2504.13179
Creating a photorealistic scene and human reconstruction from a single monocular in-the-wild video figures prominently in the perception of a human-centric 3D world. Recent neural rendering advances have enabled holistic human-scene reconstruction but require pre-calibrated camera and human poses, and days of training time. In this work, we introduce a novel unified framework that simultaneously performs camera tracking, human pose estimation and human-scene reconstruction in an online fashion. 3D Gaussian Splatting is utilized to learn Gaussian primitives for humans and scenes efficiently, and reconstruction-based camera tracking and human pose estimation modules are designed to enable holistic understanding and effective disentanglement of pose and appearance. Specifically, we design a human deformation module to reconstruct the details and enhance generalizability to out-of-distribution poses faithfully. Aiming to learn the spatial correlation between human and scene accurately, we introduce occlusion-aware human silhouette rendering and monocular geometric priors, which further improve reconstruction quality. Experiments on the EMDB and NeuMan datasets demonstrate superior or on-par performance with existing methods in camera tracking, human pose estimation, novel view synthesis and runtime. Our project page is at this https URL.
从单目野外视频创建逼真的场景和人体重建在人类中心的3D世界的感知中占据重要地位。最近的神经渲染技术进步已实现了完整的人体场景重建,但需要预先校准的摄像机和人体姿态,并且需要数天的训练时间。在这项工作中,我们引入了一个新颖的统一框架,在线同时执行相机跟踪、人体姿态估计和人体场景重建。3D高斯点阵(Gaussian Splatting)被用来高效地学习用于人类和场景的高斯基元,并设计了基于重构的摄像机跟踪和人体姿态估计算法模块,以实现整体理解并有效地分离姿势和外观。具体而言,我们设计了一个人体变形模块来重建细节并提高对分布外姿态的一致性和泛化能力。为了准确学习人与场景之间的空间相关性,我们引入了感知遮挡的人体轮廓渲染和单目几何先验,这进一步提高了重构的质量。在EMDB和NeuMan数据集上的实验表明,在摄像机跟踪、人体姿态估计、新视图合成和运行时间方面,我们的方法的表现优于或与现有方法相当。我们的项目页面位于此链接:[https URL]。 请注意,最后的URL被省略了具体的网址内容,请确认并提供完整的链接地址以便查阅详细信息。
https://arxiv.org/abs/2504.13167
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision, we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework. Our code, model, and benchmark will be released.
动态三维重建和视频中的点跟踪通常被视为独立的任务,尽管它们之间有着深刻的联系。我们提出了St4RTrack,这是一个前馈框架,它能够从RGB输入中同时在世界坐标系下重构和追踪动态视频内容。这是通过为两个不同时间捕获的帧预测两个适当定义的点图(pointmaps)来实现的。具体来说,我们在同一时刻、同一个世界内预测这两个点图,捕捉静态和动态场景几何结构的同时保持三维对应关系。这些预测沿着视频序列相对于参考帧进行链接,自然地计算出长距离对应的3D重建与3D追踪相结合的有效方法。不同于以往的方法依赖于4D的地面实况监督,我们采用了一种基于重投影损失的新适应方案。 我们建立了新的世界坐标系下的重构和跟踪广泛基准测试,证明了我们的统一数据驱动框架的有效性和效率。我们的代码、模型以及基准测试将公开发布。
https://arxiv.org/abs/2504.13152
Eradicating poverty is the first goal in the United Nations Sustainable Development Goals. However, aporophobia -- the societal bias against people living in poverty -- constitutes a major obstacle to designing, approving and implementing poverty-mitigation policies. This work presents an initial step towards operationalizing the concept of aporophobia to identify and track harmful beliefs and discriminative actions against poor people on social media. In close collaboration with non-profits and governmental organizations, we conduct data collection and exploration. Then we manually annotate a corpus of English tweets from five world regions for the presence of (1) direct expressions of aporophobia, and (2) statements referring to or criticizing aporophobic views or actions of others, to comprehensively characterize the social media discourse related to bias and discrimination against the poor. Based on the annotated data, we devise a taxonomy of categories of aporophobic attitudes and actions expressed through speech on social media. Finally, we train several classifiers and identify the main challenges for automatic detection of aporophobia in social networks. This work paves the way towards identifying, tracking, and mitigating aporophobic views on social media at scale.
消除贫困是联合国可持续发展目标中的首要目标。然而,对生活在贫困中的人的偏见——即阿波霍菲亚(aporophobia)——构成了设计、批准和实施减贫政策的主要障碍。这项工作旨在通过操作化理解阿波霍菲亚概念来识别并追踪社会媒体上针对穷人的有害信念和歧视性行为。在与非政府组织和政府部门密切合作的情况下,我们进行数据收集和探索。然后,我们手动标注来自全球五个地区的英语推文语料库,以确定其中是否存在(1)直接表达的阿波霍菲亚言论,以及(2)针对或批评他人阿波霍菲亚观点或行为的声明,从而全面描述与对穷人的偏见和歧视相关的社交媒体讨论。基于这些标注的数据,我们设计了一个分类体系,涵盖了在社交网络上通过言语表达的阿波霍菲亚态度和行动类别。最后,我们训练了几种分类器,并识别了自动检测社交网络中阿波霍菲亚的主要挑战。这项工作为大规模地识别、追踪并缓解社交媒体上的阿波霍菲亚观点铺平道路。
https://arxiv.org/abs/2504.13085
This paper presents a new task-space Non-singular Terminal Super-Twisting Sliding Mode (NT-STSM) controller with adaptive gains for robust trajectory tracking of a 7-DOF robotic manipulator. The proposed approach addresses the challenges of chattering, unknown disturbances, and rotational motion tracking, making it suited for high-DOF manipulators in dexterous manipulation tasks. A rigorous boundedness proof is provided, offering gain selection guidelines for practical implementation. Simulations and hardware experiments with external disturbances demonstrate the proposed controller's robust, accurate tracking with reduced control effort under unknown disturbances compared to other NT-STSM and conventional controllers. The results demonstrated that the proposed NT-STSM controller mitigates chattering and instability in complex motions, making it a viable solution for dexterous robotic manipulations and various industrial applications.
本文提出了一种新的任务空间非奇异终端超级扭转滑模(NT-STSM)控制器,该控制器具有自适应增益,适用于7自由度机械臂的鲁棒轨迹跟踪。所提出的方法解决了抖振、未知扰动和旋转运动跟踪等挑战,非常适合于灵巧操作任务中的高自由度机械臂。本文提供了严格的有界性证明,并为实际应用中增益的选择提供了指导原则。通过外部干扰情况下的仿真和硬件实验表明,与其它NT-STSM控制器及传统控制器相比,所提出的控制器在未知扰动条件下能够实现鲁棒、精确的跟踪并减少控制努力。实验结果表明,所提出的NT-STSM控制器有效地减轻了复杂运动中的抖振和不稳定问题,使其成为灵巧机器人操作以及各种工业应用中的一种可行解决方案。
https://arxiv.org/abs/2504.13056
Achieving versatile and explosive motion with robustness against dynamic uncertainties is a challenging task. Introducing parallel compliance in quadrupedal design is deemed to enhance locomotion performance, which, however, makes the control task even harder. This work aims to address this challenge by proposing a general template model and establishing an efficient motion planning and control pipeline. To start, we propose a reduced-order template model-the dual-legged actuated spring-loaded inverted pendulum with trunk rotation-which explicitly models parallel compliance by decoupling spring effects from active motor actuation. With this template model, versatile acrobatic motions, such as pronking, froggy jumping, and hop-turn, are generated by a dual-layer trajectory optimization, where the singularity-free body rotation representation is taken into consideration. Integrated with a linear singularity-free tracking controller, enhanced quadrupedal locomotion is achieved. Comparisons with the existing template model reveal the improved accuracy and generalization of our model. Hardware experiments with a rigid quadruped and a newly designed compliant quadruped demonstrate that i) the template model enables generating versatile dynamic motion; ii) parallel elasticity enhances explosive motion. For example, the maximal pronking distance, hop-turn yaw angle, and froggy jumping distance increase at least by 25%, 15% and 25%, respectively; iii) parallel elasticity improves the robustness against dynamic uncertainties, including modelling errors and external disturbances. For example, the allowable support surface height variation increases by 100% for robust froggy jumping.
实现具备广泛适应性和爆发力的运动,并且能够抵御动态不确定性,是一项挑战性任务。在四足动物设计中引入并行柔顺性被认为可以提高行走性能,但这却使控制任务变得更加困难。本研究旨在通过提出一个通用模板模型并建立高效的运动规划和控制系统来解决这一问题。 首先,我们提出了一个降阶的模板模型——具有躯干旋转功能的双足驱动弹簧加载倒立摆(Dual-Legged Actuated Spring-Loaded Inverted Pendulum with Trunk Rotation)。该模型明确地通过分离弹簧效应与主动电机驱动作用来建模并行柔顺性。利用这个模板模型,诸如跳跃、蛙跳和转向跃等广泛的杂技动作可以通过双层轨迹优化生成,在此过程中考虑到了无奇点的体旋转表示。 结合线性的无奇点跟踪控制器后,实现了增强型四足动物行走性能。与现有的模板模型相比,我们的模型在准确性和泛化性方面表现出显著改进。通过刚性四足机器人和新设计的柔顺四足机器人的硬件实验发现: 1. 模板模型能够生成多样化的动态运动; 2. 并行弹性增强了爆发力动作的表现能力。例如,最大跳跃距离、转向跃偏航角度及蛙跳距离至少分别增加了25%、15%和25%; 3. 并行弹性提高了对抗动态不确定性(包括建模误差和外部干扰)的鲁棒性。例如,在稳健型蛙跳中允许的支持面高度变化增加了一倍。 这些结果表明,所提出的模板模型及控制策略在提高四足机器人运动性能方面具有显著优势,并为未来的研究提供了坚实的基础。
https://arxiv.org/abs/2504.12854
The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.
在自然语言处理(NLP)和2D视觉领域,预训练模型利用大量数据所取得的显著成就激励我们探索广泛数据预训练在自动驾驶3D感知中的潜力。为此,本文提出了一种利用来自异构数据集的大规模未标注数据对3D感知模型进行预训练的方法。我们介绍了一个自我监督的预训练框架,该框架可以从无标签的数据中从头开始学习有效的3D表示,并结合基于提示适配器的领域自适应策略来减少数据集偏差。这种方法在诸如3D目标检测、鸟瞰图(BEV)分割、3D目标跟踪和占用预测等下游任务上显著提升了模型性能,同时随着训练数据量的增加而表现出稳定提升的表现,展示了持续受益于3D感知模型以支持自动驾驶的潜力。我们将发布源代码以激发社区进一步的研究探索。
https://arxiv.org/abs/2504.12709
The development of artificial intelligence towards real-time interaction with the environment is a key aspect of embodied intelligence and robotics. Inverse dynamics is a fundamental robotics problem, which maps from joint space to torque space of robotic systems. Traditional methods for solving it rely on direct physical modeling of robots which is difficult or even impossible due to nonlinearity and external disturbance. Recently, data-based model-learning algorithms are adopted to address this issue. However, they often require manual parameter tuning and high computational costs. Neuromorphic computing is inherently suitable to process spatiotemporal features in robot motion control at extremely low costs. However, current research is still in its infancy: existing works control only low-degree-of-freedom systems and lack performance quantification and comparison. In this paper, we propose a neuromorphic control framework to control 7 degree-of-freedom robotic manipulators. We use Spiking Neural Network to leverage the spatiotemporal continuity of the motion data to improve control accuracy, and eliminate manual parameters tuning. We validated the algorithm on two robotic platforms, which reduces torque prediction error by at least 60% and performs a target position tracking task successfully. This work advances embodied neuromorphic control by one step forward from proof of concept to applications in complex real-world tasks.
人工智能向环境实时互动的发展是具身智能和机器人技术的关键方面。逆动力学问题是机器人技术中的一个基本问题,它从关节空间映射到机器人的扭矩空间。传统的方法依赖于对机器人的直接物理建模来解决这个问题,但由于非线性和外部干扰的存在,这种建模往往很难甚至不可能实现。近年来,基于数据的模型学习算法被用来应对这一挑战。然而,这些方法通常需要手动参数调整,并且计算成本高昂。神经形态计算天然适合在机器人运动控制中以极低的成本处理时空特征。但是,目前的研究仍处于初级阶段:现有的工作仅能控制自由度较低的系统,并且缺乏性能量化和比较。 本文提出了一种用于控制具有7个自由度机械臂的神经形态控制框架。我们使用脉冲神经网络来利用运动数据中的时空连续性以提高控制精度,并消除手动参数调整的需求。我们在两个机器人平台上验证了该算法,减少了至少60%的扭矩预测误差,并成功完成了一个目标位置跟踪任务。这项工作使具身神经形态控制从概念证明迈向复杂现实世界应用迈出了重要的一步。
https://arxiv.org/abs/2504.12702
Arabidopsis is a widely used model plant to gain basic knowledge on plant physiology and development. Live imaging is an important technique to visualize and quantify elemental processes in plant development. To uncover novel theories underlying plant growth and cell division, accurate cell tracking on live imaging is of utmost importance. The commonly used cell tracking software, TrackMate, adopts tracking-by-detection fashion, which applies Laplacian of Gaussian (LoG) for blob detection, and Linear Assignment Problem (LAP) tracker for tracking. However, they do not perform sufficiently when cells are densely arranged. To alleviate the problems mentioned above, we propose an accurate tracking method based on Genetic algorithm (GA) using knowledge of Arabidopsis root cellular patterns and spatial relationship among volumes. Our method can be described as a coarse-to-fine method, in which we first conducted relatively easy line-level tracking of cell nuclei, then performed complicated nuclear tracking based on known linear arrangement of cell files and their spatial relationship between nuclei. Our method has been evaluated on a long-time live imaging dataset of Arabidopsis root tips, and with minor manual rectification, it accurately tracks nuclei. To the best of our knowledge, this research represents the first successful attempt to address a long-standing problem in the field of time-lapse microscopy in the root meristem by proposing an accurate tracking method for Arabidopsis root nuclei.
拟南芥是一种广泛用于获取植物生理和发育基本知识的模型植物。活体成像是可视化并量化植物发育过程中基本元素过程的重要技术。为了揭示新的理论,阐明植物生长和细胞分裂背后的机制,对活体成像中的精确细胞追踪至关重要。目前常用的细胞追踪软件TrackMate采用基于检测的追踪方法,通过高斯拉普拉斯算子(LoG)进行斑点检测,并使用线性分配问题(LAP)跟踪器来进行追踪。然而,在细胞密集排列的情况下,这种方法的效果并不理想。 为了解决上述问题,我们提出了一种基于遗传算法(GA)并结合拟南芥根部细胞模式和体积间空间关系知识的精确追踪方法。我们的方法可以描述为从粗到细的过程:首先进行相对简单的线性级别的细胞核追踪,然后根据已知的细胞文件线性排列及其细胞核之间的空间关系来进行复杂的核追踪。 我们在拟南芥根尖长时间活体成像数据集上对这种方法进行了评估,并在经过少量人工修正后,该方法能够精确地追踪细胞核。据我们所知,这项研究是首次成功尝试解决时序显微镜技术中长期存在的问题——即为拟南芥根部核提供准确的跟踪方法,这在分生组织领域尤为突出。
https://arxiv.org/abs/2504.12676
This paper presents a novel autonomous drone-based smoke plume tracking system capable of navigating and tracking plumes in highly unsteady atmospheric conditions. The system integrates advanced hardware and software and a comprehensive simulation environment to ensure robust performance in controlled and real-world settings. The quadrotor, equipped with a high-resolution imaging system and an advanced onboard computing unit, performs precise maneuvers while accurately detecting and tracking dynamic smoke plumes under fluctuating conditions. Our software implements a two-phase flight operation, i.e., descending into the smoke plume upon detection and continuously monitoring the smoke movement during in-plume tracking. Leveraging Proportional Integral-Derivative (PID) control and a Proximal Policy Optimization based Deep Reinforcement Learning (DRL) controller enables adaptation to plume dynamics. Unreal Engine simulation evaluates performance under various smoke-wind scenarios, from steady flow to complex, unsteady fluctuations, showing that while the PID controller performs adequately in simpler scenarios, the DRL-based controller excels in more challenging environments. Field tests corroborate these findings. This system opens new possibilities for drone-based monitoring in areas like wildfire management and air quality assessment. The successful integration of DRL for real-time decision-making advances autonomous drone control for dynamic environments.
本文介绍了一种新颖的自主无人机烟羽追踪系统,该系统能够在高度不稳定的大气条件下导航和跟踪烟羽。该系统集成了先进的硬件与软件,并且包括一个全面的模拟环境,以确保在控制和现实世界设置中均能实现稳健性能。四旋翼飞行器配备了高分辨率成像系统和高级机载计算单元,在不断变化的情况下能够执行精确操作并准确地检测和跟踪动态烟羽。 我们的软件实施了两个阶段的飞行操作:即探测到烟羽后下降进入烟羽,并在进入烟羽后持续监测烟雾运动。通过利用比例积分微分(PID)控制以及基于近端策略优化(Proximal Policy Optimization)的深度强化学习(DRL)控制器,使系统能够适应烟羽动态变化。 借助Unreal Engine模拟器,在各种烟雾-风环境场景下评估了系统的性能,从稳定的气流到复杂的不稳定性波动。结果显示:虽然PID控制器在简单情况下表现良好,但基于DRL的控制器在更复杂和具有挑战性的环境中表现出色。实地测试验证了这些发现。 该系统为无人机监测开辟了新的可能性,特别是在野火管理和空气质量评估等领域。将深度强化学习成功集成到实时决策制定中,有助于自主无人机控制在动态环境中的发展与应用。
https://arxiv.org/abs/2504.12664
Provenance is the chronology of things, resonating with the fundamental pursuit to uncover origins, trace connections, and situate entities within the flow of space and time. As artificial intelligence advances towards autonomous agents capable of interactive collaboration on complex tasks, the provenance of generated content becomes entangled in the interplay of collective creation, where contributions are continuously revised, extended or overwritten. In a multi-agent generative chain, content undergoes successive transformations, often leaving little, if any, trace of prior contributions. In this study, we investigates the problem of tracking multi-agent provenance across the temporal dimension of generation. We propose a chronological system for post hoc attribution of generative history from content alone, without reliance on internal memory states or external meta-information. At its core lies the notion of symbolic chronicles, representing signed and time-stamped records, in a form analogous to the chain of custody in forensic science. The system operates through a feedback loop, whereby each generative timestep updates the chronicle of prior interactions and synchronises it with the synthetic content in the very act of generation. This research seeks to develop an accountable form of collaborative artificial intelligence within evolving cyber ecosystems.
出处(Provenance)是指事物的时间顺序,它与探究起源、追溯联系以及将实体置于时空流中的基本追求相呼应。随着人工智能向能够进行复杂任务交互协作的自主代理发展,生成内容的出处变得纠缠于集体创作过程中,其中贡献不断被修订、扩展或重写。在一个多代理生成链中,内容经历连续的转变,往往几乎不留任何先前贡献的痕迹。在这项研究中,我们探讨了在生成的时间维度上追踪多代理出处的问题。我们提出了一种基于时间顺序系统的后验归因方法,仅通过内容本身来追溯生成历史,而不依赖于内部记忆状态或外部元信息。该系统的核心是象征性编年史的概念,即类似于法医学中的证据链的签名和时间戳记录形式。该系统通过反馈循环运行,在每个生成的时间步中更新先前交互的编年史,并在生成过程中与合成内容同步。这项研究旨在开发一种在不断演变的网络生态系统中可问责的合作人工智能形式。
https://arxiv.org/abs/2504.12612
We propose a framework enabling mobile manipulators to reliably complete pick-and-place tasks for assembling structures from construction blocks. The picking uses an eye-in-hand visual servoing controller for object tracking with Control Barrier Functions (CBFs) to ensure fiducial markers in the blocks remain visible. An additional robot with an eye-to-hand setup ensures precise placement, critical for structural stability. We integrate human-in-the-loop capabilities for flexibility and fault correction and analyze robustness to camera pose errors, proposing adapted barrier functions to handle them. Lastly, experiments validate the framework on 6-DoF mobile arms.
我们提出了一种框架,使移动操作臂能够可靠地完成从建筑积木组装结构的取放任务。该框架采用眼手视觉伺服控制器进行物体跟踪,并利用控制屏障函数(CBFs)确保积木上的标记始终可见,从而实现准确的拾取动作。另一个配备眼到手设置的机器人则负责精确放置,这对于保证结构稳定性至关重要。我们还集成了人机协作功能以增强灵活性和故障纠正能力,并分析了摄像机姿态误差对系统的影响,提出了适应性屏障函数来处理这些问题。最后,实验验证了该框架在6自由度移动臂上的有效性。
https://arxiv.org/abs/2504.12506
Soft robots are increasingly used in healthcare, especially for assistive care, due to their inherent safety and adaptability. Controlling soft robots is challenging due to their nonlinear dynamics and the presence of time delays, especially in applications like a soft robotic arm for patient care. This paper presents a learning-based approach to approximate the nonlinear state predictor (Smith Predictor), aiming to improve tracking performance in a two-module soft robot arm with a short inherent input delay. The method uses Kernel Recursive Least Squares Tracker (KRLST) for online learning of the system dynamics and a Legendre Delay Network (LDN) to compress past input history for efficient delay compensation. Experimental results demonstrate significant improvement in tracking performance compared to a baseline model-based non-linear controller. Statistical analysis confirms the significance of the improvements. The method is computationally efficient and adaptable online, making it suitable for real-world scenarios and highlighting its potential for enabling safer and more accurate control of soft robots in assistive care applications.
软体机器人在医疗保健领域,特别是辅助护理中的应用越来越广泛,这得益于其固有的安全性和适应性。然而,由于非线性动力学和时间延迟的存在,控制软体机器人的难度较大,尤其是在像为患者提供护理的软体机械臂这样的应用场景中。本文提出了一种基于学习的方法来近似非线性状态预测器(史密斯预测器),旨在提高两模块软机器人手臂在短固有输入延迟情况下的跟踪性能。该方法采用核递归最小二乘追踪器(KRLST)进行系统的动力学在线学习,并利用勒让德延迟网络(LDN)压缩过去的输入历史,以实现有效的延迟补偿。实验结果表明,在与基于模型的非线性控制器相比较时,本文提出的方法在跟踪性能方面有了显著提升。统计分析确认了这些改进的重要性。该方法计算效率高且能够在线适应,适用于实际场景,并突显了其在辅助护理应用中使软体机器人控制更加安全和准确方面的潜力。
https://arxiv.org/abs/2504.12428
We introduce an approach for detecting and tracking detailed 3D poses of multiple people from a single monocular camera stream. Our system maintains temporally coherent predictions in crowded scenes filled with difficult poses and occlusions. Our model performs both strong per-frame detection and a learned pose update to track people from frame to frame. Rather than match detections across time, poses are updated directly from a new input image, which enables online tracking through occlusion. We train on numerous image and video datasets leveraging pseudo-labeled annotations to produce a model that matches state-of-the-art systems in 3D pose estimation accuracy while being faster and more accurate in tracking multiple people through time. Code and weights are provided at this https URL
我们介绍了一种从单目摄像机流中检测和跟踪多人详细三维姿态的方法。我们的系统能够在拥挤且包含困难姿势和遮挡的场景中保持时间上连贯的预测结果。该模型同时执行强大的逐帧检测和学习到的姿态更新,以实现跨帧的人体追踪。与在不同时间段匹配检测结果的方式不同,我们直接从新输入图像更新姿态,这使得即使在存在遮挡的情况下也能进行在线跟踪。我们在大量的图像和视频数据集上进行了训练,并利用伪标签注释来生成一个模型,在三维姿态估计精度方面能够匹敌当前最先进的系统,同时在长时间内多个人体追踪的速度和准确性方面更为出色。代码和权重可在此 [https URL] 获取。
https://arxiv.org/abs/2504.12186
Movie Audio Description (AD) aims to narrate visual content during dialogue-free segments, particularly benefiting blind and visually impaired (BVI) audiences. Compared with general video captioning, AD demands plot-relevant narration with explicit character name references, posing unique challenges in movie this http URL identify active main characters and focus on storyline-relevant regions, we propose FocusedAD, a novel framework that delivers character-centric movie audio descriptions. It includes: (i) a Character Perception Module(CPM) for tracking character regions and linking them to names; (ii) a Dynamic Prior Module(DPM) that injects contextual cues from prior ADs and subtitles via learnable soft prompts; and (iii) a Focused Caption Module(FCM) that generates narrations enriched with plot-relevant details and named characters. To overcome limitations in character identification, we also introduce an automated pipeline for building character query banks. FocusedAD achieves state-of-the-art performance on multiple benchmarks, including strong zero-shot results on MAD-eval-Named and our newly proposed Cinepile-AD dataset. Code and data will be released at this https URL .
电影音频描述(AD)旨在通过叙述视觉内容来帮助盲人和视力受损者(BVI)在无对话的片段中更好地理解画面。相比一般的视频字幕,AD需要提供与剧情相关且明确指明角色名称的叙述,这为电影带来了独特的挑战。 为了识别活跃的主要角色并专注于与剧情相关的区域,我们提出了FocusedAD这一创新框架,它提供了以人物为中心的电影音频描述。该框架包括以下部分: (i) 角色感知模块(CPM),用于跟踪角色所在的画面区域,并将其链接到对应的名字; (ii) 动态先验模块(DPM),通过可学习的软提示从之前的AD和字幕中注入上下文线索; (iii) 集中描述模块(FCM),生成包含与剧情相关的细节及命名人物的叙述。 为了克服角色识别上的局限性,我们还引入了一种自动化流程来构建角色查询库。 在多个基准测试上,包括在MAD-eval-Named和新提出的Cinepile-AD数据集中的零样本结果中,FocusedAD达到了最先进的性能水平。代码和数据将在该链接(假设提供了一个URL)发布。
https://arxiv.org/abs/2504.12157
Unmanned Aerial Vehicles (UAVs) are indispensable for infrastructure inspection, surveillance, and related tasks, yet they also introduce critical security challenges. This survey provides a wide-ranging examination of the anti-UAV domain, centering on three core objectives-classification, detection, and tracking-while detailing emerging methodologies such as diffusion-based data synthesis, multi-modal fusion, vision-language modeling, self-supervised learning, and reinforcement learning. We systematically evaluate state-of-the-art solutions across both single-modality and multi-sensor pipelines (spanning RGB, infrared, audio, radar, and RF) and discuss large-scale as well as adversarially oriented benchmarks. Our analysis reveals persistent gaps in real-time performance, stealth detection, and swarm-based scenarios, underscoring pressing needs for robust, adaptive anti-UAV systems. By highlighting open research directions, we aim to foster innovation and guide the development of next-generation defense strategies in an era marked by the extensive use of UAVs.
无人飞行器(UAV)在基础设施检查、监控及相关任务中不可或缺,但同时也带来了关键的安全挑战。本综述对反无人机领域进行了全面的考察,重点关注三个核心目标:分类、检测和跟踪,并详细介绍了扩散数据合成、多模态融合、视觉语言建模、自监督学习以及强化学习等新兴方法。我们系统性地评估了单模态及多传感器管道(涵盖RGB图像、红外线、音频、雷达和射频)中当前最先进的解决方案,并讨论大规模及针对对抗场景的基准测试。我们的分析揭示了实时性能、隐蔽检测及集群式情景中的持久缺口,强调了开发稳健且适应性强的反无人机系统的需求。通过突出开放的研究方向,我们旨在激发创新并指导下一代防御策略的发展,在这一广泛应用UAV的时代中发挥关键作用。
https://arxiv.org/abs/2504.11967
Stories are a fundamental aspect of human experience. Engaging deeply with stories and spotting plot holes -- inconsistencies in a storyline that break the internal logic or rules of a story's world -- requires nuanced reasoning skills, including tracking entities and events and their interplay, abstract thinking, pragmatic narrative understanding, commonsense and social reasoning, and theory of mind. As Large Language Models (LLMs) increasingly generate, interpret, and modify text, rigorously assessing their narrative consistency and deeper language understanding becomes critical. However, existing benchmarks focus mainly on surface-level comprehension. In this work, we propose plot hole detection in stories as a proxy to evaluate language understanding and reasoning in LLMs. We introduce FlawedFictionsMaker, a novel algorithm to controllably and carefully synthesize plot holes in human-written stories. Using this algorithm, we construct a benchmark to evaluate LLMs' plot hole detection abilities in stories -- FlawedFictions -- , which is robust to contamination, with human filtering ensuring high quality. We find that state-of-the-art LLMs struggle in accurately solving FlawedFictions regardless of the reasoning effort allowed, with performance significantly degrading as story length increases. Finally, we show that LLM-based story summarization and story generation are prone to introducing plot holes, with more than 50% and 100% increases in plot hole detection rates with respect to human-written originals.
故事是人类体验中的一个基本方面。深入地参与故事并发现情节漏洞——即故事线中破坏内部逻辑或故事世界规则的不一致之处,需要精细的推理能力,包括追踪实体和事件及其相互作用、抽象思维、实用性的叙事理解、常识和社会推理以及理论心理。随着大型语言模型(LLMs)越来越多地生成、解释和修改文本,严格评估其叙述一致性与更深层次的语言理解变得至关重要。然而,现有的基准测试主要关注表面级的理解。 在本文中,我们提出将情节漏洞检测作为衡量大型语言模型语言理解和推理能力的一个代理方法。我们引入了一个名为FlawedFictionsMaker的新算法,该算法可以控制且细致地在人类撰写的故事情节中合成情节漏洞。利用这一算法,我们构建了用于评估LLM在故事中发现情节漏洞的能力的基准测试——FlawedFictions,该基准具备抗污染性,并通过人工筛选确保高质量。我们发现最先进的大型语言模型无论允许多少推理努力,在准确解决FlawedFictions的问题上都面临挑战,并且随着故事情节长度的增加,性能显著下降。 最后,我们展示了基于LLM的故事摘要和故事生成容易引入情节漏洞,与人类撰写的原始作品相比,前者的情节漏洞检测率分别增加了超过50%和100%。
https://arxiv.org/abs/2504.11900
The World Health Organization suggests that road traffic crashes cost approximately 518 billion dollars globally each year, which accounts for 3% of the gross domestic product for most countries. Most fatal road accidents in urban areas involve Vulnerable Road Users (VRUs). Smart cities environments present innovative approaches to combat accidents involving cutting-edge technologies, that include advanced sensors, extensive datasets, Machine Learning (ML) models, communication systems, and edge computing. This paper proposes a strategy and an implementation of a system for road monitoring and safety for smart cities, based on Computer Vision (CV) and edge computing. Promising results were obtained by implementing vision algorithms and tracking using surveillance cameras, that are part of a Smart City testbed, the Aveiro Tech City Living Lab (ATCLL). The algorithm accurately detects and tracks cars, pedestrians, and bicycles, while predicting the road state, the distance between moving objects, and inferring on collision events to prevent collisions, in near real-time.
世界卫生组织建议,每年全球道路交通事故造成的经济损失约为5180亿美元,这占大多数国家国内生产总值的3%。在城市地区发生的大部分致命交通事故涉及弱势道路使用者(VRUs)。智慧城市的环境提供了一些创新的方法来应对利用尖端技术的道路事故,这些技术包括高级传感器、庞大的数据集、机器学习(ML)模型、通信系统以及边缘计算。本文提出了一种基于计算机视觉(CV)和边缘计算的智慧城市道路交通监控与安全系统的策略及其实现方案。通过在智慧城市的测试平台——阿威罗科技城生活实验室(ATCLL)中部署监视摄像头并实施视觉算法和跟踪,取得了令人鼓舞的结果。该算法能够准确地检测和追踪车辆、行人以及自行车,并实时预测道路状况、移动物体之间的距离,并推断碰撞事件以防止发生碰撞。
https://arxiv.org/abs/2504.11662
Wheat accounts for approximately 20% of the world's caloric intake, making it a vital component of global food security. Given this importance, mapping wheat fields plays a crucial role in enabling various stakeholders, including policy makers, researchers, and agricultural organizations, to make informed decisions regarding food security, supply chain management, and resource allocation. In this paper, we tackle the problem of accurately mapping wheat fields out of satellite images by introducing an improved pipeline for winter wheat segmentation, as well as presenting a case study on a decade-long analysis of wheat mapping in Lebanon. We integrate a Temporal Spatial Vision Transformer (TSViT) with Parameter-Efficient Fine Tuning (PEFT) and a novel post-processing pipeline based on the Fields of The World (FTW) framework. Our proposed pipeline addresses key challenges encountered in existing approaches, such as the clustering of small agricultural parcels in a single large field. By merging wheat segmentation with precise field boundary extraction, our method produces geometrically coherent and semantically rich maps that enable us to perform in-depth analysis such as tracking crop rotation pattern over years. Extensive evaluations demonstrate improved boundary delineation and field-level precision, establishing the potential of the proposed framework in operational agricultural monitoring and historical trend analysis. By allowing for accurate mapping of wheat fields, this work lays the foundation for a range of critical studies and future advances, including crop monitoring and yield estimation.
小麦占全球热量摄入的大约20%,是全球粮食安全的重要组成部分。鉴于其重要性,对小麦田的测绘对于政策制定者、研究人员和农业组织等各方做出关于粮食安全、供应链管理和资源配置的明智决策至关重要。本文提出了一种改进的小麦冬播田分段流程,并通过黎巴嫩十年期小麦测绘案例研究来展示该方法的应用。我们整合了时空视觉变压器(TSViT)与参数高效微调(PEFT),以及基于“世界田野”框架的新颖后处理管道,以应对现有方法中存在的挑战,例如将小型农田合并为一个大型地块的问题。 通过结合小麦分割和精确的田地边界提取,我们的方法生成几何连贯且语义丰富的地图,从而能够进行多年度作物轮作模式等深入分析。广泛的评估显示了改进后的边界划分和田野级精度,证明所提出的框架在操作农业监测和历史趋势分析方面具有潜力。 这项工作为准确测绘小麦田奠定了基础,并为包括作物监测和产量估计在内的各种关键研究和未来进步铺平道路。
https://arxiv.org/abs/2504.11366