This article presents a comparative analysis of a mobile robot trajectories computed by various ROS-based SLAM systems. For this reason we developed a prototype of a mobile robot with common sensors: 2D lidar, a monocular and ZED stereo cameras. Then we conducted experiments in a typical office environment and collected data from all sensors, running all tested SLAM systems based on the acquired dataset. We studied the following SLAM systems: (a) 2D lidar-based: GMapping, Hector SLAM, Cartographer; (b) monocular camera-based: Large Scale Direct monocular SLAM (LSD SLAM), ORB SLAM, Direct Sparse Odometry (DSO); and (c) stereo camera-based: ZEDfu, Real-Time Appearance-Based Mapping (RTAB map), ORB SLAM, Stereo Parallel Tracking and Mapping (S-PTAM). Since all SLAM methods were tested on the same dataset we compared results for different SLAM systems with appropriate metrics, demonstrating encouraging results for lidar-based Cartographer SLAM, Monocular ORB SLAM and Stereo RTAB Map methods.
本文对基于ROS的SLAM系统计算出的不同移动机器人轨迹进行了比较分析。为此,我们开发了一个配备有常见传感器(2D激光雷达、单目摄像头和ZED立体摄像头)的移动机器人原型。然后在典型的办公室环境中进行了实验,并从所有传感器收集了数据,在所获得的数据集上运行了所有测试过的SLAM系统。本文研究了以下几种SLAM系统: (a) 基于2D激光雷达:GMapping、Hector SLAM和Cartographer; (b) 基于单目摄像头:大范围直接单目SLAM(LSD SLAM)、ORB SLAM以及直接稀疏里程计(DSO); (c) 基于立体摄像头:ZEDfu、实时基于外观的映射(RTAB map)、ORB SLAM和立体并行跟踪与地图构建(S-PTAM)。 由于所有SLAM方法都是在同一数据集上进行测试,因此我们使用适当的指标对不同SLAM系统的结果进行了比较。结果显示基于激光雷达的Cartographer SLAM、单目ORB SLAM以及立体RTAB Map方法取得了令人鼓舞的结果。
https://arxiv.org/abs/2501.09490
In soccer video analysis, player detection is essential for identifying key events and reconstructing tactical positions. The presence of numerous players and frequent occlusions, combined with copyright restrictions, severely restricts the availability of datasets, leaving limited options such as SoccerNet-Tracking and SportsMOT. These datasets suffer from a lack of diversity, which hinders algorithms from adapting effectively to varied soccer video contexts. To address these challenges, we developed SoccerSynth-Detection, the first synthetic dataset designed for the detection of synthetic soccer players. It includes a broad range of random lighting and textures, as well as simulated camera motion blur. We validated its efficacy using the object detection model (Yolov8n) against real-world datasets (SoccerNet-Tracking and SportsMoT). In transfer tests, it matched the performance of real datasets and significantly outperformed them in images with motion blur; in pre-training tests, it demonstrated its efficacy as a pre-training dataset, significantly enhancing the algorithm's overall performance. Our work demonstrates the potential of synthetic datasets to replace real datasets for algorithm training in the field of soccer video analysis.
在足球视频分析中,球员检测对于识别关键事件和重建战术位置至关重要。由于存在大量球员及频繁遮挡,并且受版权限制的影响,可用数据集的数量非常有限,仅有如SoccerNet-Tracking和SportsMOT等少数选项可供选择。这些数据集因多样性不足而难以使算法有效适应各种足球视频环境。为了解决这些问题,我们开发了SoccerSynth-Detection,这是首个专用于合成足球运动员检测的合成数据集。该数据集包含了广泛的随机光照及纹理变化,并且模拟了相机运动模糊的效果。 为了验证其有效性,我们将基于对象检测模型(YOLOv8n)进行测试,对比真实世界数据集(SoccerNet-Tracking和SportsMoT)。在迁移学习测试中,它达到了与真实数据集相当的性能,在带有运动模糊效果的图像上更是显著超越了真实数据集;而在预训练测试中,它展示了作为预训练数据集的有效性,并大幅提升了算法的整体性能。我们的工作证明了合成数据集具备替代真实数据集用于足球视频分析领域中的算法训练的巨大潜力。
https://arxiv.org/abs/2501.09281
In this paper, we present an optimization-based framework for generating estimation-aware trajectories in scenarios where measurement (output) uncertainties are state-dependent and set-valued. The framework leverages the concept of regularity for set-valued output maps. Specifically, we demonstrate that, for output-regular maps, one can utilize a set-valued observability measure that is concave with respect to finite-horizon state trajectories. By maximizing this measure, optimized estimation-aware trajectories can be designed for a broad class of systems, including those with locally linearized dynamics. To illustrate the effectiveness of the proposed approach, we provide a representative example in the context of trajectory planning for vision-based estimation. We present an estimation-aware trajectory for an uncooperative target-tracking problem that uses a machine learning (ML)-based estimation module on an ego-satellite.
在这篇论文中,我们提出了一种基于优化的框架,用于在测量(输出)不确定性和状态相关的集合值情况下生成感知估计的轨迹。该框架利用了集合值输出映射的概念——正则性。具体而言,我们证明对于输出正则地图,可以使用一种相对于有限时间跨度的状态轨迹是凹形的集合值可观察度量。通过最大化这一度量,可以为包括具有局部线性化动力学的一类广泛系统设计优化后的估计感知轨迹。为了展示所提出方法的有效性,在基于视觉估计的轨迹规划背景下提供了一个代表性示例。我们展示了用于解决无合作目标跟踪问题的一个估计感知轨迹,该问题利用了一种搭载在自身卫星上的机器学习(ML)估计模块。
https://arxiv.org/abs/2501.09192
Understanding how misleading and outright false information enters news ecosystems remains a difficult challenge that requires tracking how narratives spread across thousands of fringe and mainstream news websites. To do this, we introduce a system that utilizes encoder-based large language models and zero-shot stance detection to scalably identify and track news narratives and their attitudes across over 4,000 factually unreliable, mixed-reliability, and factually reliable English-language news websites. Running our system over an 18 month period, we track the spread of 146K news stories. Using network-based interference via the NETINF algorithm, we show that the paths of news narratives and the stances of websites toward particular entities can be used to uncover slanted propaganda networks (e.g., anti-vaccine and anti-Ukraine) and to identify the most influential websites in spreading these attitudes in the broader news ecosystem. We hope that increased visibility into our distributed news ecosystem can help with the reporting and fact-checking of propaganda and disinformation.
理解误导性和完全虚假信息如何进入新闻生态系统仍然是一个难题,这需要追踪这些叙事在数千个边缘化和主流新闻网站上的传播情况。为了解决这个问题,我们引入了一个系统,该系统利用基于编码器的大规模语言模型以及零样本立场检测技术,在超过4,000个事实不可靠、可靠性参半和事实可靠的英语新闻网站上可扩展地识别和跟踪新闻叙事及其态度。 在为期18个月的时间内运行我们的系统后,我们追踪了146,000篇新闻故事的传播情况。通过使用基于网络的干扰方法(如NETINF算法),我们展示了新闻叙事路径以及网站对特定实体的态度可以被用来揭露有偏见的宣传网络(例如反疫苗和反乌克兰)并识别出在更广泛的新闻生态系统中传播这些态度最具影响力的网站。 我们希望,对我们分布式新闻生态系统的更多了解能够帮助报道和核实宣传及虚假信息。
https://arxiv.org/abs/2501.09102
Targeting the notorious cumulative drift errors in NeRF SLAM, we propose a Semantic-guided Loop Closure with Shared Latent Code, dubbed SLC$^2$-SLAM. Especially, we argue that latent codes stored in many NeRF SLAM systems are not fully exploited, as they are only used for better reconstruction. In this paper, we propose a simple yet effective way to detect potential loops using the same latent codes as local features. To further improve the loop detection performance, we use the semantic information, which are also decoded from the same latent codes to guide the aggregation of local features. Finally, with the potential loops detected, we close them with a graph optimization followed by bundle adjustment to refine both the estimated poses and the reconstructed scene. To evaluate the performance of our SLC$^2$-SLAM, we conduct extensive experiments on Replica and ScanNet datasets. Our proposed semantic-guided loop closure significantly outperforms the pre-trained NetVLAD and ORB combined with Bag-of-Words, which are used in all the other NeRF SLAM with loop closure. As a result, our SLC$^2$-SLAM also demonstrated better tracking and reconstruction performance, especially in larger scenes with more loops, like ScanNet.
针对NeRF SLAM中的累积漂移误差问题,我们提出了一种基于语义引导的循环闭合方法,并结合共享潜在代码(Semantic-guided Loop Closure with Shared Latent Code),简称SLC$^2$-SLAM。特别地,我们认为许多NeRF SLAM系统中存储的潜在代码没有得到充分利用,因为它们仅用于更好的重建。在本文中,我们提出了一种简单而有效的方法来利用这些相同的潜在代码作为局部特征来检测潜在循环。 为了进一步提高循环检测性能,我们使用从同一组潜在代码解码出的语义信息来指导局部特征的聚合过程。最后,在确定了潜在循环后,我们通过图优化和随后的束调整(bundle adjustment)来闭合这些循环,并以此细化估计的姿态和重建场景。 为评估我们的SLC$^2$-SLAM方法的效果,我们在Replica和ScanNet数据集上进行了广泛的实验。我们提出的基于语义引导的循环闭合法显著优于所有其他采用预训练NetVLAD与Bag-of-Words结合ORB的方法在NeRF SLAM中的表现。因此,在像ScanNet这样包含更多循环的大场景中,我们的SLC$^2$-SLAM方法展示了更佳的跟踪和重建性能。 通过这种方法,不仅解决了累积漂移误差问题,还显著提升了整体的定位与建图精度,特别是在复杂环境下的表现尤为突出。
https://arxiv.org/abs/2501.08880
This paper summarizes in depth the state of the art of aerial swarms, covering both classical and new reinforcement-learning-based approaches for their management. Then, it proposes a hybrid AI system, integrating deep reinforcement learning in a multi-agent centralized swarm architecture. The proposed system is tailored to perform surveillance of a specific area, searching and tracking ground targets, for security and law enforcement applications. The swarm is governed by a central swarm controller responsible for distributing different search and tracking tasks among the cooperating UAVs. Each UAV agent is then controlled by a collection of cooperative sub-agents, whose behaviors have been trained using different deep reinforcement learning models, tailored for the different task types proposed by the swarm controller. More specifically, proximal policy optimization (PPO) algorithms were used to train the agents' behavior. In addition, several metrics to assess the performance of the swarm in this application were defined. The results obtained through simulation show that our system searches the operation area effectively, acquires the targets in a reasonable time, and is capable of tracking them continuously and consistently.
这篇论文深入总结了空中群集技术的最新进展,涵盖了传统方法和基于强化学习的新方法。然后提出了一种混合人工智能系统,该系统在多智能体集中式群架构中结合了深度强化学习技术。所提出的系统专为特定区域的安全监控、搜索和追踪地面目标而设计,适用于安全和执法应用。 该空中群集由一个中央群控制器管理,负责将不同的搜索和跟踪任务分配给合作的无人机(UAV)。每架无人机则由一组协作子代理控制,这些行为模型通过不同类型的深度强化学习进行训练。具体而言,使用了近端策略优化(PPO)算法来训练各个智能体的行为。 此外,还定义了几种评估系统性能的指标。仿真结果表明,我们的系统能够有效搜索操作区域,在合理的时间内获取目标,并且具备持续、稳定地追踪它们的能力。
https://arxiv.org/abs/2501.08655
Localization within a known environment is a crucial capability for mobile robots. Simultaneous Localization and Mapping (SLAM) is a prominent solution to this problem. SLAM is a framework that consists of a diverse set of computational tasks ranging from real-time tracking to computation-intensive map optimization. This combination can present a challenge for resource-limited mobile robots. Previously, edge-assisted SLAM methods have demonstrated promising real-time execution capabilities by offloading heavy computations while performing real-time tracking onboard. However, the common approach of utilizing a client-server architecture for offloading is sensitive to server and network failures. In this article, we propose a novel edge-assisted SLAM framework capable of self-organizing fully distributed SLAM execution across a network of devices or functioning on a single device without connectivity. The architecture consists of three layers and is designed to be device-agnostic, resilient to network failures, and minimally invasive to the core SLAM system. We have implemented and demonstrated the framework for monocular ORB SLAM3 and evaluated it in both fully distributed and standalone SLAM configurations against the ORB SLAM3. The experiment results demonstrate that the proposed design matches the accuracy and resource utilization of the monolithic approach while enabling collaborative execution.
在已知环境中进行定位是移动机器人的关键能力之一。同时定位与地图构建(SLAM)是解决这一问题的一种突出方法。SLAM是一个框架,涵盖了从实时跟踪到计算密集型地图优化的一系列多样化的计算任务。这种组合对于资源受限的移动机器人来说可能是个挑战。之前的研究表明,边缘辅助SLAM方法通过将繁重的计算卸载到外部服务器,并在设备上执行实时跟踪,展示出了有前景的实时处理能力。然而,使用客户端-服务器架构进行计算卸载的方法对服务器和网络故障非常敏感。 本文提出了一种新颖的边缘辅助SLAM框架,该框架能够自组织地在网络中的多个设备之间或在一个单独的无连接设备上实现完全分布式的SLAM执行。此架构由三层组成,并设计为与设备无关、抗网络故障且不会干扰核心SLAM系统。我们已经针对单目ORB SLAM3实现了这一框架,并在完全分布式和独立式SLAM配置中对其进行了评估,比较对象是未做修改的ORB SLAM3。 实验结果表明,所提出的架构能够匹配传统单体化方法的精度和资源利用率,并同时支持协作执行。
https://arxiv.org/abs/2501.08629
Deep learning (DL) systems present unique challenges in software engineering, especially concerning quality attributes like correctness and resource efficiency. While DL models achieve exceptional performance in specific tasks, engineering DL-based systems is still essential. The effort, cost, and potential diminishing returns of continual improvements must be carefully evaluated, as software engineers often face the critical decision of when to stop refining a system relative to its quality attributes. This experience paper explores the role of MLOps practices -- such as monitoring and experiment tracking -- in creating transparent and reproducible experimentation environments that enable teams to assess and justify the impact of design decisions on quality attributes. Furthermore, we report on experiences addressing the quality challenges by embedding domain knowledge into the design of a DL model and its integration within a larger system. The findings offer actionable insights into not only the benefits of domain knowledge and MLOps but also the strategic consideration of when to limit further optimizations in DL projects to maximize overall system quality and reliability.
深度学习(DL)系统在软件工程中带来了独特的挑战,尤其是在正确性和资源效率等质量属性方面。虽然深度学习模型在特定任务上取得了卓越的性能,但构建基于深度学习的系统仍然是至关重要的。不断改进所需的精力、成本以及潜在的回报递减问题需要仔细评估,因为软件工程师常常面临一个关键决策:何时停止对系统的优化以符合其质量属性的要求。本文的经验报告探讨了MLOps实践——如监控和实验跟踪——在创建透明且可重复的实验环境中的作用,这些环境使团队能够评估并证明设计决策对质量属性的影响。此外,我们还报道了一些经验,通过将领域知识嵌入深度学习模型及其在更大系统中的集成中来应对质量挑战。研究结果不仅提供了关于领域知识和MLOps益处的实际见解,而且还为何时限制深度学习项目进一步优化以最大化整体系统质量和可靠性提供战略考虑。
https://arxiv.org/abs/2501.08402
The audio visual benefit in speech perception, where congruent visual input enhances auditory processing, is well documented across age groups, particularly in challenging listening conditions and among individuals with varying hearing abilities. However, most studies rely on highly controlled laboratory environments with scripted stimuli. Here, we examine the audio visual benefit using unscripted, natural speech from untrained speakers within a virtual acoustic environment. Using electroencephalography (EEG) and cortical speech tracking, we assessed neural responses across audio visual, audio only, visual only, and masked lip conditions to isolate the role of lip movements. Additionally, we analysed individual differences in acoustic and visual features of the speakers, including pitch, jitter, and lip openness, to explore their influence on the audio visual speech tracking benefit. Results showed a significant audio visual enhancement in speech tracking with background noise, with the masked lip condition performing similarly to the audio-only condition, emphasizing the importance of lip movements in adverse listening situations. Our findings reveal the feasibility of cortical speech tracking with naturalistic stimuli and underscore the impact of individual speaker characteristics on audio-visual integration in real world listening contexts.
在言语感知中,视觉信息与听觉输入相一致时能够增强听觉处理的视听益处,在各个年龄段尤其是面对具有挑战性的聆听环境以及听力水平各异的人群中得到了充分证明。然而,大多数研究依赖于实验室环境中使用脚本化刺激材料进行的高度控制实验条件。在此项研究中,我们采用未经训练的发言人在虚拟声学环境中提供的非脚本、自然语言来评估视听益处。 通过脑电图(EEG)和皮层言语跟踪技术,我们在听觉-视觉结合、仅听觉输入、仅视觉输入以及遮挡嘴唇条件下分析了神经反应,以明确唇部运动的作用。此外,我们还研究了个别发言人在音高、抖动及唇部张开程度等声学与视觉特征上的差异,探讨这些特性如何影响视听整合对言语跟踪的益处。 结果表明,在背景噪音中进行言语跟踪时存在显著的视听增强效应,并且遮挡嘴唇条件下表现与仅听觉条件类似,这突显了在不良聆听环境中唇部运动的重要性。我们的研究揭示了使用自然刺激材料实现皮层言语跟踪的可能性,并强调了个别发言人的特征对真实世界聆听情境下视听整合的影响。
https://arxiv.org/abs/2501.08124
Robotic systems are increasingly employed for industrial automation, with contact-rich tasks like polishing requiring dexterity and compliant behaviour. These tasks are difficult to model, making classical control challenging. Deep reinforcement learning (RL) offers a promising solution by enabling the learning of models and control policies directly from data. However, its application to real-world problems is limited by data inefficiency and unsafe exploration. Adaptive hybrid RL methods blend classical control and RL adaptively, combining the strengths of both: structure from control and learning from RL. This has led to improvements in data efficiency and exploration safety. However, their potential for hardware applications remains underexplored, with no evaluations on physical systems to date. Such evaluations are critical to fully assess the practicality and effectiveness of these methods in real-world settings. This work presents an experimental demonstration of the hybrid RL algorithm CHEQ for robotic polishing with variable impedance, a task requiring precise force and velocity tracking. In simulation, we show that variable impedance enhances polishing performance. We compare standalone RL with adaptive hybrid RL, demonstrating that CHEQ achieves effective learning while adhering to safety constraints. On hardware, CHEQ achieves effective polishing behaviour, requiring only eight hours of training and incurring just five failures. These results highlight the potential of adaptive hybrid RL for real-world, contact-rich tasks trained directly on hardware.
机器人系统在工业自动化中的应用日益增多,特别是对于打磨这类需要灵巧度和柔顺行为的接触密集型任务。这些任务难以建模,使得传统控制方法面临挑战。深度强化学习(RL)通过直接从数据中学习模型和控制策略,为解决此类问题提供了一种有前景的方法。然而,由于其在数据效率低下及探索过程中存在安全隐患的问题,该技术在实际应用中的效果受到限制。自适应混合RL方法结合了传统控制的结构优势和强化学习的学习能力,这不仅提高了数据利用效率,还改善了探索的安全性。尽管如此,这些方法在硬件上的潜力尚未被充分发掘,在物理系统上也未进行过评估。这种评估对于全面评价这些方法在现实环境中的实用性和有效性至关重要。 本研究通过机器人打磨实验演示了一种名为CHEQ的混合RL算法的应用,这是一种需要精确力和速度跟踪的任务,并且要求具备可变阻抗能力。我们证明了,在仿真环境中,采用可变阻尼可以提升打磨效果。我们将单纯的强化学习方法与自适应混合RL进行了对比,结果表明,CHEQ能够在遵守安全限制的情况下实现有效的学习过程。在硬件设备上进行的实验中,仅需八小时训练时间,且失败次数仅有五次,就能达到有效的打磨行为表现。 这些成果凸显了自适应混合RL技术对于直接应用于实际环境中接触密集型任务的巨大潜力。
https://arxiv.org/abs/2501.07985
Facial expression captioning has found widespread application across various domains. Recently, the emergence of video Multimodal Large Language Models (MLLMs) has shown promise in general video understanding tasks. However, describing facial expressions within videos poses two major challenges for these models: (1) the lack of adequate datasets and benchmarks, and (2) the limited visual token capacity of video MLLMs. To address these issues, this paper introduces a new instruction-following dataset tailored for dynamic facial expression caption. The dataset comprises 5,033 high-quality video clips annotated manually, containing over 700,000 tokens. Its purpose is to improve the capability of video MLLMs to discern subtle facial nuances. Furthermore, we propose FaceTrack-MM, which leverages a limited number of tokens to encode the main character's face. This model demonstrates superior performance in tracking faces and focusing on the facial expressions of the main characters, even in intricate multi-person scenarios. Additionally, we introduce a novel evaluation metric combining event extraction, relation classification, and the longest common subsequence (LCS) algorithm to assess the content consistency and temporal sequence consistency of generated text. Moreover, we present FEC-Bench, a benchmark designed to assess the performance of existing video MLLMs in this specific task. All data and source code will be made publicly available.
面部表情描述在各个领域得到了广泛应用。最近,视频多模态大型语言模型(MLLM)在通用视频理解任务中展现出巨大潜力。然而,在视频中描述面部表情对这些模型提出了两个主要挑战:(1) 缺乏足够的数据集和基准;以及 (2) 视频 MLLM 的视觉标记容量有限。为解决这些问题,本文介绍了一个新的遵循指令的数据集,专门用于动态面部表情描述。该数据集包含5,033个高质量视频片段,并且这些片段都经过了手动标注,总计超过70万个令牌。它的目的是提高视频 MLLM 辨别细微面部变化的能力。 此外,我们提出了 FaceTrack-MM 模型,它利用有限数量的标记来编码主要人物的脸部信息。此模型在跟踪脸部和聚焦于主要角色的表情方面表现出色,即使是在复杂的多人场景中也是如此。另外,我们还引入了一种新的评估指标,结合事件提取、关系分类以及最长公共子序列(LCS)算法来评价生成文本的内容一致性和时间顺序一致性。 除此之外,我们推出了 FEC-Bench,这是一个基准测试工具,用于评估现有视频 MLLM 在这一特定任务中的表现。所有数据和源代码都将公开提供。
https://arxiv.org/abs/2501.07978
A significant limitation of current smartphone-based eye-tracking algorithms is their low accuracy when applied to video-type visual stimuli, as they are typically trained on static images. Also, the increasing demand for real-time interactive applications like games, VR, and AR on smartphones requires overcoming the limitations posed by resource constraints such as limited computational power, battery life, and network bandwidth. Therefore, we developed two new smartphone eye-tracking techniques for video-type visuals by combining Convolutional Neural Networks (CNN) with two different Recurrent Neural Networks (RNN), namely Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU). Our CNN+LSTM and CNN+GRU models achieved an average Root Mean Square Error of 0.955 cm and 1.091 cm, respectively. To address the computational constraints of smartphones, we developed an edge intelligence architecture to enhance the performance of smartphone-based eye tracking. We applied various optimisation methods like quantisation and pruning to deep learning models for better energy, CPU, and memory usage on edge devices, focusing on real-time processing. Using model quantisation, the model inference time in the CNN+LSTM and CNN+GRU models was reduced by 21.72% and 19.50%, respectively, on edge devices.
当前基于智能手机的眼动追踪算法的一个显著限制是,当应用于视频类型的视觉刺激时,其准确性较低,因为这些算法通常是在静态图像上进行训练的。此外,随着对实时互动应用(如游戏、VR和AR)的需求增加,必须克服由计算能力有限、电池寿命短以及网络带宽不足等资源约束所带来的问题。因此,我们开发了两种新的智能手机眼动追踪技术来处理视频类型的视觉内容:结合卷积神经网络(CNN)与两种不同类型的递归神经网络(RNN),即长短期记忆(LSTM)和门控循环单元(GRU)。我们的 CNN+LSTM 和 CNN+GRU 模型分别达到了平均均方根误差为 0.955 厘米和 1.091 厘米的水平。 为了应对智能手机计算资源的限制,我们开发了一种边缘智能架构来提升基于智能手机的眼动追踪性能。通过采用诸如量化和剪枝等优化方法对深度学习模型进行改进,以更好地在边缘设备上使用能源、CPU 和内存,并专注于实时处理。利用模型量化技术,在边缘设备上的 CNN+LSTM 和 CNN+GRU 模型的推理时间分别减少了 21.72% 和 19.50%。
https://arxiv.org/abs/2408.12463
Video editing models have advanced significantly, but evaluating their performance remains challenging. Traditional metrics, such as CLIP text and image scores, often fall short: text scores are limited by inadequate training data and hierarchical dependencies, while image scores fail to assess temporal consistency. We present SST-EM (Semantic, Spatial, and Temporal Evaluation Metric), a novel evaluation framework that leverages modern Vision-Language Models (VLMs), Object Detection, and Temporal Consistency checks. SST-EM comprises four components: (1) semantic extraction from frames using a VLM, (2) primary object tracking with Object Detection, (3) focused object refinement via an LLM agent, and (4) temporal consistency assessment using a Vision Transformer (ViT). These components are integrated into a unified metric with weights derived from human evaluations and regression analysis. The name SST-EM reflects its focus on Semantic, Spatial, and Temporal aspects of video evaluation. SST-EM provides a comprehensive evaluation of semantic fidelity and temporal smoothness in video editing. The source code is available in the \textbf{\href{this https URL}{GitHub Repository}}.
视频编辑模型已经取得了显著的进步,但其性能评估仍然颇具挑战。传统的评价指标,如CLIP文本和图像得分,在实践中往往表现不足:文本得分受限于训练数据的不足以及层级依赖关系;而图像得分则无法有效评估时间一致性。为此,我们提出了一种新的评估框架SST-EM(语义、空间与时间评估度量),该框架结合了现代视觉语言模型(VLMs)、物体检测和时间一致性检查等技术。SST-EM由四个部分组成: 1. 利用VLM从视频帧中提取语义信息。 2. 通过物体检测进行主要对象跟踪。 3. 使用大型语言模型代理对关注的对象进行细化处理。 4. 利用视觉变换器(ViT)评估时间一致性。 这些组件被整合到一个统一的度量标准中,其权重根据人类评价和回归分析得出。名称SST-EM反映了它在视频评估中的语义、空间与时间方面所侧重的特点。SST-EM能够全面地评估视频编辑中的语义准确性和时间平滑性。源代码可在**[GitHub仓库](https://this https URL)**中获取。
https://arxiv.org/abs/2501.07554
Agent-based program repair offers to automatically resolve complex bugs end-to-end by combining the planning, tool use, and code generation abilities of modern LLMs. Recent work has explored the use of agent-based repair approaches on the popular open-source SWE-Bench, a collection of bugs from highly-rated GitHub Python projects. In addition, various agentic approaches such as SWE-Agent have been proposed to solve bugs in this benchmark. This paper explores the viability of using an agentic approach to address bugs in an enterprise context. To investigate this, we curate an evaluation set of 178 bugs drawn from Google's issue tracking system. This dataset spans both human-reported (78) and machine-reported bugs (100). To establish a repair performance baseline on this benchmark, we implement Passerine, an agent similar in spirit to SWE-Agent that can work within Google's development environment. We show that with 20 trajectory samples and Gemini 1.5 Pro, Passerine can produce a patch that passes bug tests (i.e., plausible) for 73% of machine-reported and 25.6% of human-reported bugs in our evaluation set. After manual examination, we found that 43% of machine-reported bugs and 17.9% of human-reported bugs have at least one patch that is semantically equivalent to the ground-truth patch. These results establish a baseline on an industrially relevant benchmark, which as we show, contains bugs drawn from a different distribution -- in terms of language diversity, size, and spread of changes, etc. -- compared to those in the popular SWE-Bench dataset.
基于代理的程序修复提供了一种自动解决复杂错误的方法,该方法通过结合现代LLM(大型语言模型)的规划、工具使用和代码生成能力来实现从端到端的解决方案。最近的研究探索了在流行的开源SWE-Bench上使用基于代理的修复方法,这是一个收集自GitHub上高评价Python项目的Bug集合。此外,还提出了多种代理性方法,例如SWE-Agent,用于解决该基准测试中的错误问题。本文探讨了在企业环境中采用代理式方法来处理软件缺陷的可能性。 为了研究这一点,我们从Google的问题跟踪系统中整理了一个包含178个Bug的评估集。这些数据涵盖了人类报告(78个)和机器报告的Bug(100个)。为在这个基准上建立修复性能基线,我们实现了Passerine——一个类似SWE-Agent、能在Google开发环境中运行的代理程序。实验表明,在使用20条轨迹样本以及Gemini 1.5 Pro的情况下,Passerine能够对73%的机器报告Bug和25.6%的人类报告Bug生成通过测试(即合理的)修复补丁。 经过手动审查后发现,对于43%的机器报告Bug和17.9%的人类报告Bug,至少存在一个语义上等同于实际解决方案的修补程序。这些结果在工业相关的基准上建立了基线,并展示了与流行SWE-Bench数据集中的错误相比,在语言多样性、规模和变更范围等方面具有不同的分布特征的错误情况。
https://arxiv.org/abs/2501.07531
Timber represents an increasingly valuable and versatile resource. However, forestry operations such as harvesting, handling and measuring logs still require substantial human labor in remote environments posing significant safety risks. Progressively automating these tasks has the potential of increasing their efficiency as well as safety, but requires an accurate detection of individual logs as well as live trees and their context. Although initial approaches have been proposed for this challenging application domain, specialized data and algorithms are still too scarce to develop robust solutions. To mitigate this gap, we introduce the TimberVision dataset, consisting of more than 2k annotated RGB images containing a total of 51k trunk components including cut and lateral surfaces, thereby surpassing any existing dataset in this domain in terms of both quantity and detail by a large margin. Based on this data, we conduct a series of ablation experiments for oriented object detection and instance segmentation and evaluate the influence of multiple scene parameters on model performance. We introduce a generic framework to fuse the components detected by our models for both tasks into unified trunk representations. Furthermore, we automatically derive geometric properties and apply multi-object tracking to further enhance robustness. Our detection and tracking approach provides highly descriptive and accurate trunk representations solely from RGB image data, even under challenging environmental conditions. Our solution is suitable for a wide range of application scenarios and can be readily combined with other sensor modalities.
木材代表了一种日益宝贵且多用途的资源。然而,伐木作业如采伐、处理和测量原木仍然需要在偏远环境中进行大量的人工劳动,并伴随着显著的安全风险。逐步实现这些任务的自动化具有提高效率和安全性的潜力,但这要求能够准确检测单个原木以及活树及其环境。尽管已经为这一挑战性应用领域提出了初步的方法,但专门的数据和算法仍不足以开发出稳健的解决方案。为了弥补这一差距,我们引入了TimberVision数据集,该数据集包含超过2000张标注的RGB图像,总计包括51,000个树干部分,包括切割面和平行面,从而在数量和细节方面大大超越了现有的任何相关数据集。 基于这些数据,我们进行了一系列关于定向物体检测和实例分割的消融实验,并评估了多个场景参数对模型性能的影响。我们还引入了一个通用框架,用于融合由我们的模型检测到的所有任务组件,以形成统一的树干表示。此外,我们自动推导了几何属性并应用多对象跟踪,进一步增强了鲁棒性。 我们的检测和跟踪方法能够仅从RGB图像数据中提供高度描述性和准确性的树干表示,即使在具有挑战性的环境条件下也是如此。我们的解决方案适用于广泛的应用场景,并可以与其它传感器模式轻松结合使用。
https://arxiv.org/abs/2501.07360
Roadside billboards and other forms of outdoor advertising play a crucial role in marketing initiatives; however, they can also distract drivers, potentially contributing to accidents. This study delves into the significance of roadside advertising in images captured from a driver's perspective. Firstly, it evaluates the effectiveness of neural networks in detecting advertising along roads, focusing on the YOLOv5 and Faster R-CNN models. Secondly, the study addresses the determination of billboard significance using methods for saliency extraction. The UniSal and SpectralResidual methods were employed to create saliency maps for each image. The study establishes a database of eye tracking sessions captured during city highway driving to assess the saliency models.
路边广告牌及其他户外广告在市场营销活动中扮演着重要角色;然而,它们也可能分散司机的注意力,从而可能导致交通事故。本研究探讨了从驾驶员视角拍摄的照片中路边广告的重要性。首先,该研究评估了神经网络检测道路沿线广告的有效性,重点比较了YOLOv5和Faster R-CNN模型的表现。其次,该研究还讨论了利用提取显著性的方法来确定广告牌的重要性的问题。为了每张图片生成显著图,采用了UniSal和SpectralResidual两种方法。此外,该研究建立了一个基于城市高速公路驾驶过程中眼动追踪数据的数据库,以评估这些显著性模型的效果。
https://arxiv.org/abs/2501.07342
We present GazeGrasp, a gaze-based manipulation system enabling individuals with motor impairments to control collaborative robots using eye-gaze. The system employs an ESP32 CAM for eye tracking, MediaPipe for gaze detection, and YOLOv8 for object localization, integrated with a Universal Robot UR10 for manipulation tasks. After user-specific calibration, the system allows intuitive object selection with a magnetic snapping effect and robot control via eye gestures. Experimental evaluation involving 13 participants demonstrated that the magnetic snapping effect significantly reduced gaze alignment time, improving task efficiency by 31%. GazeGrasp provides a robust, hands-free interface for assistive robotics, enhancing accessibility and autonomy for users.
我们介绍了GazeGrasp,这是一个基于目光的操控系统,允许运动能力受限的人士通过眼球追踪来控制协作机器人。该系统使用ESP32 CAM进行眼部跟踪,MediaPipe进行注视点检测,并采用YOLOv8进行目标定位,与Universal Robot UR10结合用于执行操纵任务。经过用户特定校准后,该系统支持通过磁性吸附效果直观选择物体并通过眼球手势操作机器人。实验评估涉及13名参与者的结果表明,磁性吸附效应显著减少了目光对齐时间,使任务效率提高了31%。GazeGrasp提供了一个稳健的、无手部接触的操作界面,增强了辅助机器人的可访问性和自主性。
https://arxiv.org/abs/2501.07255
This paper is devoted to the detection of objects on a road, performed with a combination of two methods based on both the use of depth information and video analysis of data from a stereo camera. Since neither the time of the appearance of an object on the road, nor its size and shape is known in advance, ML/DL-based approaches are not applicable. The task becomes more complicated due to variations in artificial illumination, inhomogeneous road surface texture, and unknown character and features of the object. To solve this problem we developed the depth and image fusion method that complements a search of small contrast objects by RGB-based method, and obstacle detection by stereo image-based approach with SLIC superpixel segmentation. We conducted experiments with static and low speed obstacles in an underground parking lot and demonstrated the successful work of the developed technique for detecting and even tracking small objects, which can be parking infrastructure objects, things left on the road, wheels, dropped boxes, etc.
这篇论文致力于使用结合了两种方法的道路物体检测,这两种方法分别依赖于深度信息和来自立体摄像头的数据视频分析。由于目标对象在道路上出现的时间、大小和形状都是未知的,因此基于机器学习/深度学习的方法并不适用。此外,人工照明的变化、路面纹理不均匀以及目标特征的不确定性使得该任务更加复杂。 为了解决这个问题,我们开发了一种将深度信息与图像融合的方法。这种方法通过RGB基础方法搜索低对比度的小物体,并采用SLIC超像素分割技术结合立体图像来检测障碍物。我们在地下停车场进行了针对静态和低速障碍物的实验,证明了所提出的技术能够成功地检测并追踪小物体,包括停车设施、遗留在道路上的东西、车轮以及掉落的箱子等。
https://arxiv.org/abs/2501.07245
3D single object tracking (3DSOT) in LiDAR point clouds is a critical task for outdoor perception, enabling real-time perception of object location, orientation, and motion. Despite the impressive performance of current 3DSOT methods, evaluating them on clean datasets inadequately reflects their comprehensive performance, as the adverse weather conditions in real-world surroundings has not been considered. One of the main obstacles is the lack of adverse weather benchmarks for the evaluation of 3DSOT. To this end, this work proposes a challenging benchmark for LiDAR-based 3DSOT in adverse weather, which comprises two synthetic datasets (KITTI-A and nuScenes-A) and one real-world dataset (CADC-SOT) spanning three weather types: rain, fog, and snow. Based on this benchmark, five representative 3D trackers from different tracking frameworks conducted robustness evaluation, resulting in significant performance degradations. This prompts the question: What are the factors that cause current advanced methods to fail on such adverse weather samples? Consequently, we explore the impacts of adverse weather and answer the above question from three perspectives: 1) target distance; 2) template shape corruption; and 3) target shape corruption. Finally, based on domain randomization and contrastive learning, we designed a dual-branch tracking framework for adverse weather, named DRCT, achieving excellent performance in benchmarks.
基于激光雷达点云的单个物体三维跟踪(3DSOT)是室外感知的关键任务,它能够实现目标对象位置、姿态和运动的实时感知。尽管目前的3DSOT方法表现出色,但仅在清洁数据集上进行评估无法全面反映其性能,因为现实世界中的恶劣天气条件未被充分考虑。其中一个主要障碍是没有为3DSOT评估设计的恶劣天气基准测试。 为此,这项工作提出了一个具有挑战性的基于激光雷达的3DSOT恶劣天气基准测试,包括两个合成数据集(KITTI-A和nuScenes-A)以及一个真实世界的数据集(CADC-SOT),涵盖了雨、雾和雪三种类型的恶劣天气。根据这一基准,来自不同跟踪框架的五个代表性的3D追踪器进行了鲁棒性评估,结果表明性能显著下降。这引发了问题:是什么因素导致当前先进的方法在这种恶劣天气样本上表现不佳?因此,我们从三个方面探讨了恶劣天气的影响,并回答了上述问题:1)目标距离;2)模板形状损坏;和3)目标形状损坏。 最后,在领域随机化和对比学习的基础上,我们设计了一个用于恶劣天气的双分支跟踪框架DRCT(Domain Randomization and Contrastive Learning-based Dual-branch Tracker),在基准测试中取得了卓越的成绩。
https://arxiv.org/abs/2501.07133
This paper explores the integration of provenance tracking systems within the context of Semantic Web technologies to enhance data integrity in diverse operational environments. SURROUND Australia Pty Ltd demonstrates innovative applica-tions of the PROV Data Model (PROV-DM) and its Semantic Web variant, PROV-O, to systematically record and manage provenance information across multiple data processing domains. By employing RDF and Knowledge Graphs, SURROUND ad-dresses the critical challenges of shared entity identification and provenance granularity. The paper highlights the company's architecture for capturing comprehensive provenance data, en-abling robust validation, traceability, and knowledge inference. Through the examination of two projects, we illustrate how provenance mechanisms not only improve data reliability but also facilitate seamless integration across heterogeneous systems. Our findings underscore the importance of sophisticated provenance solutions in maintaining data integrity, serving as a reference for industry peers and academics engaged in provenance research and implementation.
本文探讨了在语义网技术背景下集成起源追踪系统,以增强多变操作环境下的数据完整性。SURROUND Australia Pty Ltd展示了PROV数据模型(PROV-DM)及其语义网版本PROV-O的创新应用,用于跨多个数据处理领域的起源信息记录与管理。通过使用RDF和知识图谱,SURROUND解决了共享实体识别和起源粒度的关键挑战。本文强调了公司捕捉全面起源数据的架构,使强大的验证、可追溯性和知识推理成为可能。通过对两个项目的分析,我们展示了起源机制不仅提高了数据可靠性,还促进了异构系统间的无缝集成。我们的研究结果突显了复杂起源解决方案在维护数据完整性方面的重要性,并为从事起源研究和实施的业界同行及学术界人士提供了参考依据。
https://arxiv.org/abs/2501.09029