The rapid growth of social media has led to the widespread sharing of individual portrait images, which pose serious privacy risks due to the capabilities of automatic face recognition (AFR) systems for mass surveillance. Hence, protecting facial privacy against unauthorized AFR systems is essential. Inspired by the generation capability of the emerging diffusion models, recent methods employ diffusion models to generate adversarial face images for privacy protection. However, they suffer from the diffusion purification effect, leading to a low protection success rate (PSR). In this paper, we first propose learning unconditional embeddings to increase the learning capacity for adversarial modifications and then use them to guide the modification of the adversarial latent code to weaken the diffusion purification effect. Moreover, we integrate an identity-preserving structure to maintain structural consistency between the original and generated images, allowing human observers to recognize the generated image as having the same identity as the original. Extensive experiments conducted on two public datasets, i.e., CelebA-HQ and LADN, demonstrate the superiority of our approach. The protected faces generated by our method outperform those produced by existing facial privacy protection approaches in terms of transferability and natural appearance.
社交媒体的快速增长导致个人肖像图片广泛分享,这给自动面部识别(AFR)系统的大规模监控带来了严重的隐私风险。因此,保护面部隐私以防止未经授权的AFR系统至关重要。受到新兴扩散模型生成能力的启发,最近的方法使用扩散模型来生成对抗性人脸图像以实现隐私保护。然而,这些方法受到了扩散净化效应的影响,导致了较低的成功保护率(PSR)。在这篇论文中,我们首先提出了学习无条件嵌入以增强针对对抗性修改的学习容量,并利用它们引导对抗性潜在代码的修改,从而削弱扩散净化效应。此外,我们整合了一个保持身份特征的结构,在原始图像和生成图像之间维持结构性的一致性,使人类观察者能够识别出生成的图片与原图具有相同的个体身份。在两个公共数据集CelebA-HQ和LADN上进行的广泛实验表明了我们的方法的优越性。我们方法生成的受保护面孔在可迁移性和自然外观方面超过了现有的面部隐私保护方法。
https://arxiv.org/abs/2503.10350
Foodborne gastrointestinal (GI) illness is a common cause of ill health in the UK. However, many cases do not interact with the healthcare system, posing significant challenges for traditional surveillance methods. The growth of publicly available online restaurant reviews and advancements in large language models (LLMs) present potential opportunities to extend disease surveillance by identifying public reports of GI illness. In this study, we introduce a novel annotation schema, developed with experts in GI illness, applied to the Yelp Open Dataset of reviews. Our annotations extend beyond binary disease detection, to include detailed extraction of information on symptoms and foods. We evaluate the performance of open-weight LLMs across these three tasks: GI illness detection, symptom extraction, and food extraction. We compare this performance to RoBERTa-based classification models fine-tuned specifically for these tasks. Our results show that using prompt-based approaches, LLMs achieve micro-F1 scores of over 90% for all three of our tasks. Using prompting alone, we achieve micro-F1 scores that exceed those of smaller fine-tuned models. We further demonstrate the robustness of LLMs in GI illness detection across three bias-focused experiments. Our results suggest that publicly available review text and LLMs offer substantial potential for public health surveillance of GI illness by enabling highly effective extraction of key information. While LLMs appear to exhibit minimal bias in processing, the inherent limitations of restaurant review data highlight the need for cautious interpretation of results.
食品引发的胃肠道(GI)疾病是英国常见的健康问题。然而,许多病例并未与医疗系统互动,这给传统的监测方法带来了重大挑战。公开可用的在线餐厅评论数量的增长以及大型语言模型(LLMs)的进步为通过识别公众报告的GI疾病来扩展疾病监控提供了潜在机会。在这项研究中,我们引入了一种由胃肠道疾病专家开发的新颖标注方案,并将其应用于Yelp开放数据集中的评论。我们的注释不仅限于二元疾病的检测,还详细提取了症状和食物的相关信息。我们评估了开源LLMs在以下三个任务上的性能:GI疾病检测、症状提取和食物提取,并将这些结果与为特定任务微调的基于RoBERTa的分类模型进行了比较。结果显示,使用提示方法,LLMs在这三项任务上均达到了超过90%的微平均F1分数。仅通过使用提示法,我们就取得了优于较小微调模型的微平均F1分数。此外,我们还展示了在GI疾病检测中偏见聚焦实验中的LLMs的稳健性。我们的结果表明,公开可用的评论文本和LLMs为胃肠道疾病的公共卫生监测提供了巨大的潜力,因为它们能够有效地提取关键信息。尽管LLMs似乎在处理过程中表现出最小的偏差,但餐厅评论数据本身的固有限制强调了对结果进行谨慎解释的重要性。
https://arxiv.org/abs/2503.09743
Low-light enhancement has wide applications in autonomous driving, 3D reconstruction, remote sensing, surveillance, and so on, which can significantly improve information utilization. However, most existing methods lack generalization and are limited to specific tasks such as image recovery. To address these issues, we propose \textbf{Gated-Mechanism Mixture-of-Experts (GM-MoE)}, the first framework to introduce a mixture-of-experts network for low-light image enhancement. GM-MoE comprises a dynamic gated weight conditioning network and three sub-expert networks, each specializing in a distinct enhancement task. Combining a self-designed gated mechanism that dynamically adjusts the weights of the sub-expert networks for different data domains. Additionally, we integrate local and global feature fusion within sub-expert networks to enhance image quality by capturing multi-scale features. Experimental results demonstrate that the GM-MoE achieves superior generalization with respect to 25 compared approaches, reaching state-of-the-art performance on PSNR on 5 benchmarks and SSIM on 4 benchmarks, respectively.
低光增强技术在自动驾驶、三维重建、遥感、监控等领域有广泛的应用,可以显著提高信息利用效率。然而,现有的大多数方法缺乏泛化能力,并且局限于特定任务如图像恢复。为解决这些问题,我们提出了**门控机制专家混合网络(GM-MoE)**,这是首个引入专家混合网络以进行低光图像增强的框架。GM-MoE包含一个动态门控权重调节网络和三个子专家网络,每个子专家网络专注于不同的增强任务。通过结合自设计的门控机制来动态调整针对不同数据域的子专家网络权重。此外,我们还在子专家网络中集成了局部与全局特征融合功能,以捕捉多尺度特征从而提升图像质量。实验结果表明,GM-MoE在25种比较方法中表现出色,并且在五个基准测试上达到了PSNR指标上的最新性能,在四个基准测试上达到SSIM指标的领先表现。
https://arxiv.org/abs/2503.07417
Despite the widespread adoption of vision sensors in edge applications, such as surveillance, the transmission of video data consumes substantial spectrum resources. Semantic communication (SC) offers a solution by extracting and compressing information at the semantic level, preserving the accuracy and relevance of transmitted data while significantly reducing the volume of transmitted information. However, traditional SC methods face inefficiencies due to the repeated transmission of static frames in edge videos, exacerbated by the absence of sensing capabilities, which results in spectrum inefficiency. To address this challenge, we propose a SC with computer vision sensing (SCCVS) framework for edge video transmission. The framework first introduces a compression ratio (CR) adaptive SC (CRSC) model, capable of adjusting CR based on whether the frames are static or dynamic, effectively conserving spectrum resources. Additionally, we implement an object detection and semantic segmentation models-enabled sensing (OSMS) scheme, which intelligently senses the changes in the scene and assesses the significance of each frame through in-context analysis. Hence, The OSMS scheme provides CR prompts to the CRSC model based on real-time sensing results. Moreover, both CRSC and OSMS are designed as lightweight models, ensuring compatibility with resource-constrained sensors commonly used in practical edge applications. Experimental simulations validate the effectiveness of the proposed SCCVS framework, demonstrating its ability to enhance transmission efficiency without sacrificing critical semantic information.
尽管视觉传感器在诸如监控等边缘应用中得到了广泛应用,但视频数据的传输消耗了大量的频谱资源。语义通信(SC)通过提取和压缩语义层面的信息提供了一种解决方案,在保持传递数据准确性和相关性的同时显著减少了信息量。然而,传统的SC方法由于重复传输边缘视频中的静态帧而效率低下,这种低效进一步因缺乏感知能力而加剧,导致频谱资源使用率低下。为解决这一挑战,我们提出了一种基于计算机视觉感知的语义通信(SCCVS)框架用于边缘视频传输。 该框架首先引入了压缩比(CR)自适应语义通信(CRSC)模型,能够根据帧是静态还是动态调整CR,有效地节省频谱资源。此外,我们还实施了一个启用对象检测和语义分割模型的感知方案(OSMS),通过上下文分析智能地感知场景的变化并评估每帧的重要性。因此,OSMS方案可以根据实时感知结果向CRSC模型提供压缩比提示。 值得注意的是,无论是CRSC还是OSMS都设计为轻量级模型,以确保与实践中常用资源受限传感器的兼容性。实验模拟验证了所提出的SCCVS框架的有效性,展示了其在不牺牲关键语义信息的情况下提高传输效率的能力。
https://arxiv.org/abs/2503.07252
As the security of public spaces remains a critical issue in today's world, Digital Twin technologies have emerged in recent years as a promising solution for detecting and predicting potential future threats. The applied methodology leverages a Digital Twin of a metro station in Athens, Greece, using the FlexSim simulation software. The model encompasses points of interest and passenger flows, and sets their corresponding parameters. These elements influence and allow the model to provide reasonable predictions on the security management of the station under various scenarios. Experimental tests are conducted with different configurations of surveillance cameras and optimizations of camera angles to evaluate the effectiveness of the space surveillance setup. The results show that the strategic positioning of surveillance cameras and the adjustment of their angles significantly improves the detection of suspicious behaviors and with the use of the DT it is possible to evaluate different scenarios and find the optimal camera setup for each case. In summary, this study highlights the value of Digital Twins in real-time simulation and data-driven security management. The proposed approach contributes to the ongoing development of smart security solutions for public spaces and provides an innovative framework for threat detection and prevention.
随着公共空间的安全问题在当今世界变得越来越重要,数字孪生技术近年来作为检测和预测潜在未来威胁的有前景解决方案出现。该方法采用希腊雅典的一个地铁站的数字孪生模型,并使用FlexSim仿真软件进行建模。此模型涵盖了兴趣点和乘客流量,并设置了相应的参数。这些元素影响并使模型能够在各种场景下提供合理的安全管理预测。 通过不同的监控摄像头配置及其角度优化,实验测试了空间监视设置的有效性。结果表明,战略性地定位监控摄像头以及调整其角度显著提高了识别可疑行为的能力,同时利用数字孪生技术可以评估不同情景,并为每种情况找到最佳的摄像头设置方案。 总之,这项研究强调了数字孪生在实时仿真和数据驱动安全管理系统中的价值。所提出的方法促进了智能安全解决方案的发展,为公共空间提供了创新性的威胁检测与预防框架。
https://arxiv.org/abs/2503.06996
Dense visual prediction tasks, such as detection and segmentation, are crucial for time-critical applications (e.g., autonomous driving and video surveillance). While deep models achieve strong performance, their efficiency remains a challenge. Knowledge distillation (KD) is an effective model compression technique, but existing feature-based KD methods rely on static, teacher-driven feature selection, failing to adapt to the student's evolving learning state or leverage dynamic student-teacher interactions. To address these limitations, we propose Adaptive student-teacher Cooperative Attention Masking for Knowledge Distillation (ACAM-KD), which introduces two key components: (1) Student-Teacher Cross-Attention Feature Fusion (STCA-FF), which adaptively integrates features from both models for a more interactive distillation process, and (2) Adaptive Spatial-Channel Masking (ASCM), which dynamically generates importance masks to enhance both spatial and channel-wise feature selection. Unlike conventional KD methods, ACAM-KD adapts to the student's evolving needs throughout the entire distillation process. Extensive experiments on multiple benchmarks validate its effectiveness. For instance, on COCO2017, ACAM-KD improves object detection performance by up to 1.4 mAP over the state-of-the-art when distilling a ResNet-50 student from a ResNet-101 teacher. For semantic segmentation on Cityscapes, it boosts mIoU by 3.09 over the baseline with DeepLabV3-MobileNetV2 as the student model.
密集型视觉预测任务(如检测和分割)对于时间关键性应用(例如自动驾驶和视频监控)至关重要。虽然深度模型取得了强大的性能,但其效率仍然是一个挑战。知识蒸馏(KD)是一种有效的模型压缩技术,然而现有的基于特征的KD方法依赖于静态、由教师驱动的特征选择方式,无法适应学生模型不断变化的学习状态或利用动态的学生-老师交互作用。为了克服这些限制,我们提出了自适应学生-教师合作注意力屏蔽知识蒸馏(ACAM-KD),引入了两个关键组件:(1)学生-教师交叉注意特性融合(STCA-FF),该方法可以灵活地整合来自两个模型的特征,从而实现更为互动的知识传递过程;(2)自适应空间-通道屏蔽(ASCM),它可以动态生成重要性掩码以增强空间和通道级特征选择。与传统的KD方法不同,ACAM-KD在整个蒸馏过程中能够根据学生的不断变化的需求进行调整。 多项基准测试的广泛实验验证了其有效性。例如,在COCO2017数据集上,当使用ResNet-50学生模型从ResNet-101教师模型进行知识蒸馏时,ACAM-KD将物体检测性能提高了高达1.4 mAP(平均精度均值)相对于最先进的方法;在Cityscapes上的语义分割任务中,它使mIoU(交并比均值)相较于基于DeepLabV3-MobileNetV2的学生模型的基线提升了3.09。
https://arxiv.org/abs/2503.06307
Although advances in deep learning and aerial surveillance technology are improving wildlife conservation efforts, complex and erratic environmental conditions still pose a problem, requiring innovative solutions for cost-effective small animal detection. This work introduces DEAL-YOLO, a novel approach that improves small object detection in Unmanned Aerial Vehicle (UAV) images by using multi-objective loss functions like Wise IoU (WIoU) and Normalized Wasserstein Distance (NWD), which prioritize pixels near the centre of the bounding box, ensuring smoother localization and reducing abrupt deviations. Additionally, the model is optimized through efficient feature extraction with Linear Deformable (LD) convolutions, enhancing accuracy while maintaining computational efficiency. The Scaled Sequence Feature Fusion (SSFF) module enhances object detection by effectively capturing inter-scale relationships, improving feature representation, and boosting metrics through optimized multiscale fusion. Comparison with baseline models reveals high efficacy with up to 69.5\% fewer parameters compared to vanilla Yolov8-N, highlighting the robustness of the proposed modifications. Through this approach, our paper aims to facilitate the detection of endangered species, animal population analysis, habitat monitoring, biodiversity research, and various other applications that enrich wildlife conservation efforts. DEAL-YOLO employs a two-stage inference paradigm for object detection, refining selected regions to improve localization and confidence. This approach enhances performance, especially for small instances with low objectness scores.
尽管深度学习和空中监视技术的进步正在改善野生动物保护工作,但复杂的、不稳定的环境条件仍然带来挑战,需要创新的解决方案来实现低成本的小型动物检测。本文介绍了DEAL-YOLO,这是一种新型方法,通过使用多目标损失函数(如Wise IoU (WIoU) 和 Normalized Wasserstein Distance (NWD)),改进了无人机图像中小物体的检测,这些损失函数优先考虑边界框中心附近的像素,确保更平滑的位置定位并减少突然偏差。此外,该模型通过高效的线性可变形(LD)卷积进行特征提取而得到优化,在提高准确率的同时保持计算效率。Scaled Sequence Feature Fusion (SSFF) 模块通过有效地捕捉跨尺度关系,增强对象检测能力,并通过优化多尺度融合来改进特征表示并提升指标性能。与基准模型相比,DEAL-YOLO展示了更高的功效,参数量比标准 Yolov8-N 少多达69.5%,突显了所提出修改的稳健性。 本文旨在通过这种方法促进对濒危物种、动物种群分析、栖息地监测和生物多样性研究等领域的检测,并支持各种其他应用以丰富野生动物保护工作。DEAL-YOLO 采用两阶段推理范式进行对象检测,通过细化选定区域来提高定位准确性和置信度,从而在具有较低物体显著性的小型实例上增强性能。
https://arxiv.org/abs/2503.04698
Parks play a crucial role in enhancing the quality of life by providing recreational spaces and environmental benefits. Understanding the patterns of park usage, including the number of visitors and their activities, is essential for effective security measures, infrastructure maintenance, and resource allocation. Traditional methods rely on single-entry sensors that count total visits but fail to distinguish unique users, limiting their effectiveness due to manpower and cost this http URL advancements in affordable video surveillance and networked processing, more comprehensive park usage analysis is now feasible. This study proposes a multi-agent system leveraging low-cost cameras in a distributed network to track and analyze unique users. As a case study, we deployed this system at the Jack A. Markell (JAM) Trail in Wilmington, Delaware, and Hall Trail in Newark, Delaware. The system captures video data, autonomously processes it using existing algorithms, and extracts user attributes such as speed, direction, activity type, clothing color, and gender. These attributes are shared across cameras to construct movement trails and accurately count unique visitors. Our approach was validated through comparison with manual human counts and simulated scenarios under various conditions. The results demonstrate a 72% success rate in identifying unique users, setting a benchmark in automated park activity monitoring. Despite challenges such as camera placement and environmental factors, our findings suggest that this system offers a scalable, cost-effective solution for real-time park usage analysis and visitor behavior tracking.
公园在提升生活质量方面发挥着关键作用,通过提供娱乐空间和环境效益。了解公园使用模式,包括访客数量及活动类型,对于有效的安全措施、基础设施维护以及资源分配至关重要。传统方法依赖单一入口传感器来统计总访问量,但无法区分不同用户,这限制了其效果,因为人力和成本较高。然而,随着经济实惠的视频监控技术和联网处理技术的进步,现在可以更全面地分析公园使用情况。 本文提出了一种多智能体系统,利用分布式网络中的低成本摄像头跟踪并分析独特访客。作为案例研究,在特拉华州威尔明顿市的杰克·A·马克尔(JAM)步道和纽瓦克市的霍尔步道部署了该系统。该系统捕捉视频数据,并自主处理这些数据以提取用户属性,包括速度、方向、活动类型、着装颜色以及性别等信息。这些特征会在摄像头之间共享,用于构建移动轨迹并准确计数独特访客。 我们通过与人工手动统计和各种条件下的模拟场景进行比较来验证这种方法的有效性。结果显示,在识别独特用户的成功率达到了72%,为自动化公园活动监控设立了基准。尽管面临诸如摄像机安装位置及环境因素等挑战,但我们的研究结果表明,该系统提供了一种可扩展且经济实惠的解决方案,能够实现对实时公园使用情况和访客行为进行分析和跟踪。
https://arxiv.org/abs/2503.07651
Video anomaly detection (VAD) is crucial for video analysis and surveillance in computer vision. However, existing VAD models rely on learned normal patterns, which makes them difficult to apply to diverse environments. Consequently, users should retrain models or develop separate AI models for new environments, which requires expertise in machine learning, high-performance hardware, and extensive data collection, limiting the practical usability of VAD. To address these challenges, this study proposes customizable video anomaly detection (C-VAD) technique and the AnyAnomaly model. C-VAD considers user-defined text as an abnormal event and detects frames containing a specified event in a video. We effectively implemented AnyAnomaly using a context-aware visual question answering without fine-tuning the large vision language model. To validate the effectiveness of the proposed model, we constructed C-VAD datasets and demonstrated the superiority of AnyAnomaly. Furthermore, our approach showed competitive performance on VAD benchmark datasets, achieving state-of-the-art results on the UBnormal dataset and outperforming other methods in generalization across all datasets. Our code is available online at this http URL.
视频异常检测(VAD)在计算机视觉领域的视频分析和监控中至关重要。然而,现有的VAD模型依赖于学习到的正常模式,这使得它们难以应用于多变的环境。因此,用户需要重新训练模型或为新的环境开发单独的人工智能模型,这一过程要求具备机器学习的专业知识、高性能硬件以及大量的数据收集工作,从而限制了VAD的实际应用性。为了应对这些挑战,本研究提出了一种可定制化的视频异常检测(C-VAD)技术及AnyAnomaly模型。C-VAD将用户定义的文本视为异常事件,并在视频中识别包含特定事件的画面。我们通过使用情境感知的视觉问答方法有效实现了AnyAnomaly模型,而无需对大规模的视觉语言模型进行微调。为了验证所提出模型的有效性,我们构建了C-VAD数据集并展示了AnyAnomaly的优势。此外,我们的方法在VAD基准数据集上也表现出色,在UBnormal数据集中达到了最先进的成果,并且在所有数据集上的泛化能力优于其他方法。我们的代码可在[此链接](请在此处插入实际的在线网址)获得。
https://arxiv.org/abs/2503.04504
In intelligent transportation systems (ITSs), incorporating pedestrians and vehicles in-the-loop is crucial for developing realistic and safe traffic management solutions. However, there is falls short of simulating complex real-world ITS scenarios, primarily due to the lack of a digital twin implementation framework for characterizing interactions between pedestrians and vehicles at different locations in different traffic environments. In this article, we propose a surveillance video assisted federated digital twin (SV-FDT) framework to empower ITSs with pedestrians and vehicles in-the-loop. Specifically, SVFDT builds comprehensive pedestrian-vehicle interaction models by leveraging multi-source traffic surveillance videos. Its architecture consists of three layers: (i) the end layer, which collects traffic surveillance videos from multiple sources; (ii) the edge layer, responsible for semantic segmentation-based visual understanding, twin agent-based interaction modeling, and local digital twin system (LDTS) creation in local regions; and (iii) the cloud layer, which integrates LDTSs across different regions to construct a global DT model in realtime. We analyze key design requirements and challenges and present core guidelines for SVFDT's system implementation. A testbed evaluation demonstrates its effectiveness in optimizing traffic management. Comparisons with traditional terminal-server frameworks highlight SV-FDT's advantages in mirroring delays, recognition accuracy, and subjective evaluation. Finally, we identify some open challenges and discuss future research directions.
在智能交通系统(ITS)中,将行人和车辆纳入闭环模拟是开发现实且安全的交通管理解决方案的关键。然而,目前缺乏能够有效模拟复杂现实世界ITS场景的方法,尤其是由于缺少用于表征不同地点、不同交通环境下人车互动的数字孪生实施框架。本文提出了一种基于监控视频辅助联邦数字孪生(SV-FDT)框架,旨在增强智能交通系统中行人和车辆在闭环中的作用。具体而言,SVFDT通过利用多源交通监控视频来构建全面的人车交互模型。其架构分为三层:(i) 终端层,负责从多个来源收集交通监控视频;(ii) 边缘层,用于基于语义分割的视觉理解、基于孪生代理的互动建模和局部数字孪生系统(LDTS)在本地区域内的创建;以及(iii) 云层,将不同地区的LDTS集成起来,实现实时全球DT模型构建。本文分析了关键设计需求和挑战,并提出了SVFDT系统实施的核心指南。通过试验平台评估证明其在优化交通管理方面的有效性。与传统终端-服务器框架的比较突显了SV-FDT在反映延迟、识别准确性和主观评价方面的优势。最后,我们指出了若干开放性问题并讨论了未来的研究方向。
https://arxiv.org/abs/2503.04170
This paper presents the Spinning Blimp, a novel lighter-than-air (LTA) aerial vehicle designed for low-energy stable flight. Utilizing an oblate spheroid helium balloon for buoyancy, the vehicle achieves minimal energy consumption while maintaining prolonged airborne states. The unique and low-cost design employs a passively arranged wing coupled with a propeller to induce a spinning behavior, providing inherent pendulum-like stabilization. We propose a control strategy that takes advantage of the continuous revolving nature of the spinning blimp to control translational motion. The cost-effectiveness of the vehicle makes it highly suitable for a variety of applications, such as patrolling, localization, air and turbulence monitoring, and domestic surveillance. Experimental evaluations affirm the design's efficacy and underscore its potential as a versatile and economically viable solution for aerial applications.
本文介绍了Spinning Blimp,这是一种新型的轻于空气(LTA)飞行器,专为低能耗稳定飞行而设计。该飞行器利用椭球形氦气球实现浮力,并在保持长时间悬停状态的同时实现了最小的能量消耗。其独特的低成本设计采用了被动布置的机翼和螺旋桨组合,以诱导旋转行为,从而提供类似钟摆的稳定性。我们提出了一种控制策略,利用Spinning Blimp连续旋转的特性来控制平移运动。飞行器的成本效益使其非常适合各种应用,如巡逻、定位、空气和湍流监测以及家庭监控。实验评估证实了该设计的有效性,并强调了其作为空中应用中灵活且经济可行解决方案的潜力。
https://arxiv.org/abs/2503.04112
Multi-Agent Reinforcement Learning (MARL) has shown promise in solving complex problems involving cooperation and competition among agents, such as an Unmanned Surface Vehicle (USV) swarm used in search and rescue, surveillance, and vessel protection. However, aligning system behavior with user preferences is challenging due to the difficulty of encoding expert intuition into reward functions. To address the issue, we propose a Reinforcement Learning with Human Feedback (RLHF) approach for MARL that resolves credit-assignment challenges through an Agent-Level Feedback system categorizing feedback into intra-agent, inter-agent, and intra-team types. To overcome the challenges of direct human feedback, we employ a Large Language Model (LLM) evaluator to validate our approach using feedback scenarios such as region constraints, collision avoidance, and task allocation. Our method effectively refines USV swarm policies, addressing key challenges in multi-agent systems while maintaining fairness and performance consistency.
多智能体强化学习(MARL)在解决涉及多个智能体之间合作与竞争的复杂问题中显示出潜力,例如用于搜索和救援、监视及船舶保护的无人水面艇(USV)群。然而,由于将专家直觉编码到奖励函数中的难度,使系统行为与用户偏好一致是一个挑战。为了解决这一问题,我们提出了一种针对MARL的基于人类反馈的强化学习(RLHF)方法,该方法通过一个代理级别反馈系统解决信用分配难题,并根据内部智能体、跨智能体和团队内三种类型进行分类。为了克服直接人类反馈的困难,我们使用大型语言模型(LLM)评估器来验证我们的方法,利用诸如区域限制、碰撞避免和任务分配等反馈场景。我们的方法有效地优化了USV群策略,在解决多代理系统中的关键挑战的同时保持公平性和性能一致性。
https://arxiv.org/abs/2503.03796
Testing aerial robots in tasks such as pickup-and-delivery and surveillance significantly benefits from high energy efficiency and scalability of the deployed robotic system. This paper presents MochiSwarm, an open-source testbed of light-weight robotic blimps, ready for multi-robot operation without external localization. We introduce the system design in hardware, software, and perception, which capitalizes on modularity, low cost, and light weight. The hardware allows for rapid modification, which enables the integration of additional sensors to enhance autonomy for different scenarios. The software framework supports different actuation models and communication between the base station and multiple blimps. The detachable perception module allows independent blimps to perform tasks that involve detection and autonomous actuation. We showcase a differential-drive module as an example, of which the autonomy is enabled by visual servoing using the perception module. A case study of pickup-and-delivery tasks with up to 12 blimps highlights the autonomy of the MochiSwarm without external infrastructures.
在执行抓取和交付、监控等任务时,对空中机器人的测试显著受益于部署的机器人系统的高能效和可扩展性。本文介绍了MochiSwarm,这是一个轻量级机器人飞艇的开源测试平台,为多机器人操作而设计,并且无需外部定位系统。我们详细介绍了硬件、软件和感知模块的设计,这些设计充分利用了模块化、低成本和重量轻的特点。 该硬件允许快速修改,使集成额外传感器以增强不同场景下的自主性成为可能。软件框架支持不同的执行模型以及基地站与多个飞艇之间的通信。可拆卸的感知模块使得独立飞行器能够进行涉及检测和自主行动的任务。我们展示了一个差分驱动模块作为示例,该模块通过使用感知模块实现视觉伺服来增强自主性。 通过一项案例研究,在不依赖外部基础设施的情况下展示了多达12个飞艇执行抓取和交付任务时的自主能力。
https://arxiv.org/abs/2503.03077
The rapid urbanization of cities and increasing vehicular congestion have posed significant challenges to traffic management and safety. This study explores the transformative potential of artificial intelligence (AI) and machine vision technologies in revolutionizing traffic systems. By leveraging advanced surveillance cameras and deep learning algorithms, this research proposes a system for real-time detection of vehicles, traffic anomalies, and driver behaviors. The system integrates geospatial and weather data to adapt dynamically to environmental conditions, ensuring robust performance in diverse scenarios. Using YOLOv8 and YOLOv11 models, the study achieves high accuracy in vehicle detection and anomaly recognition, optimizing traffic flow and enhancing road safety. These findings contribute to the development of intelligent traffic management solutions and align with the vision of creating smart cities with sustainable and efficient urban infrastructure.
城市快速的城市化进程和日益严重的车辆拥堵给交通管理和安全带来了重大挑战。本研究探讨了人工智能(AI)和机器视觉技术在重塑交通系统方面的变革潜力。通过利用先进的监控摄像头和深度学习算法,该研究提出了一套实时检测车辆、交通异常以及驾驶员行为的系统。该系统结合了地理空间数据和天气信息,能够动态适应环境变化,在各种场景中保证系统的稳健性能。使用YOLOv8和YOLOv11模型,本研究在车辆检测和异常识别方面实现了高精度,从而优化了交通流量并提高了道路安全水平。这些发现有助于智能交通管理解决方案的发展,并符合创建具有可持续性和高效城市基础设施的智慧城市这一愿景。
https://arxiv.org/abs/2503.02967
Textual data from social platforms captures various aspects of mental health through discussions around and across issues, while users reach out for help and others sympathize and offer support. We propose a comprehensive framework that leverages Natural Language Processing (NLP) and Generative AI techniques to identify and assess mental health disorders, detect their severity, and create recommendations for behavior change and therapeutic interventions based on users' posts on Reddit. To classify the disorders, we use rule-based labeling methods as well as advanced pre-trained NLP models to extract nuanced semantic features from the data. We fine-tune domain-adapted and generic pre-trained NLP models based on predictions from specialized Large Language Models (LLMs) to improve classification accuracy. Our hybrid approach combines the generalization capabilities of pre-trained models with the domain-specific insights captured by LLMs, providing an improved understanding of mental health discourse. Our findings highlight the strengths and limitations of each model, offering valuable insights into their practical applicability. This research potentially facilitates early detection and personalized care to aid practitioners and aims to facilitate timely interventions and improve overall well-being, thereby contributing to the broader field of mental health surveillance and digital health analytics.
社交媒体平台上的文本数据通过围绕各种议题的讨论捕捉了心理健康的不同方面,用户在此寻求帮助,其他人则表示同情并提供支持。我们提出了一种综合框架,该框架利用自然语言处理(NLP)和生成式人工智能技术来识别和评估心理健康障碍、检测其严重程度,并根据Reddit用户的帖子创建行为改变和治疗干预的建议。为了分类这些障碍,我们使用了基于规则的标签方法以及先进的预训练NLP模型,以从数据中提取细微的语义特征。 我们将针对特定领域的预训练NLP模型与通用预训练模型结合,通过来自专业大型语言模型(LLMs)的预测进行微调,以此来提高分类准确性。我们的混合方法结合了预训练模型的泛化能力以及LLMs捕获的专业领域洞察力,从而对心理健康讨论提供更深入的理解。 研究结果强调了每种模型的优势和局限性,为它们的实际应用提供了宝贵的见解。这项研究有可能促进早期发现和个人护理,有助于从业者及时采取干预措施,并提高整体福祉,从而对心理健康监测及数字健康分析的广泛领域做出贡献。
https://arxiv.org/abs/2503.01442
Anticipating future events is crucial for various application domains such as healthcare, smart home technology, and surveillance. Narrative event descriptions provide context-rich information, enhancing a system's future planning and decision-making capabilities. We propose a novel task: $\textit{long-term future narration generation}$, which extends beyond traditional action anticipation by generating detailed narrations of future daily activities. We introduce a visual-language model, ViNa, specifically designed to address this challenging task. ViNa integrates long-term videos and corresponding narrations to generate a sequence of future narrations that predict subsequent events and actions over extended time horizons. ViNa extends existing multimodal models that perform only short-term predictions or describe observed videos by generating long-term future narrations for a broader range of daily activities. We also present a novel downstream application that leverages the generated narrations called future video retrieval to help users improve planning for a task by visualizing the future. We evaluate future narration generation on the largest egocentric dataset Ego4D.
预测未来事件对于医疗保健、智能家居技术和监控等各个应用领域至关重要。叙述性事件描述提供了富含背景信息的内容,增强了系统对未来规划和决策的能力。我们提出了一项新的任务:**长期未来叙述生成**(long-term future narration generation),这项任务超越了传统的行动预判,通过生成未来的日常活动的详细叙述来扩展其范围。为此,我们引入了一个视觉-语言模型ViNa,专门设计用于应对这一具有挑战性的任务。 ViNa结合了长时间视频及其对应的叙述,生成了一系列对未来事件和行为进行预测的未来叙述序列,这些预测覆盖较长的时间跨度。与现有的仅执行短期预测或描述观察到视频的多模态模型不同,ViNa能够为更广泛的日常活动生成长期未来的叙述。此外,我们还提出了一种新的下游应用,该应用利用生成的叙述来帮助用户通过可视化未来的方式提高任务规划能力,这种方法称为**未来视频检索**(future video retrieval)。 我们在最大的第一人称视角数据集Ego4D上评估了这项未来叙述生成技术。
https://arxiv.org/abs/2503.01416
Image restoration under adverse weather conditions refers to the process of removing degradation caused by weather particles while improving visual quality. Most existing deweathering methods rely on increasing the network scale and data volume to achieve better performance which requires more expensive computing power. Also, many methods lack generalization for specific applications. In the traffic surveillance screener, the main challenges are snow removal and veil effect elimination. In this paper, we propose a wavelet-enhanced snow removal method that use a Dual-Tree Complex Wavelet Transform feature enhancement module and a dynamic convolution acceleration module to address snow degradation in surveillance images. We also use a residual learning restoration module to remove veil effects caused by rain, snow, and fog. The proposed architecture extracts and analyzes information from snow-covered regions, significantly improving snow removal performance. And the residual learning restoration module removes veiling effects in images, enhancing clarity and detail. Experiments show that it performs better than some popular desnowing methods. Our approach also demonstrates effectiveness and accuracy when applied to real traffic surveillance images.
在恶劣天气条件下的图像恢复是指去除由天气颗粒(如雨、雪等)引起的质量下降,同时提升视觉效果的过程。目前大多数现有的除雾化方法依赖于增加网络规模和数据量来提高性能,这需要更多的计算资源,并且许多方法缺乏对特定应用的泛化能力。在交通监控场景中,主要挑战在于去除雪花以及消除雨、雪或雾造成的朦胧效应。 本文提出了一种增强型去雪方法,该方法采用双树复数小波变换特征增强模块和动态卷积加速模块来解决监控图像中的降雪问题,并使用残差学习恢复模块来移除由雨、雪和雾引起的朦胧效果。提出的架构能够从覆盖雪花的区域中提取和分析信息,从而显著提升了去雪性能。此外,残差学习恢复模块还用于去除图像中的朦胧效应,提高清晰度和细节。 实验结果表明,该方法在处理降雪问题上比一些流行的方法表现更好,并且当应用于实际交通监控图像时也展示了其有效性和准确性。
https://arxiv.org/abs/2503.01339
In the current information era, customer analytics play a key role in the success of any business. Since customer demographics primarily dictate their preferences, identification and utilization of age & gender information of customers in sales forecasting, may maximize retail sales. In this work, we propose a computer vision based approach to age and gender prediction in surveillance video. The proposed approach leverage the effectiveness of Wide Residual Networks and Xception deep learning models to predict age and gender demographics of the consumers. The proposed approach is designed to work with raw video captured in a typical CCTV video surveillance system. The effectiveness of the proposed approach is evaluated on real-life garment store surveillance video, which is captured by low resolution camera, under non-uniform illumination, with occlusions due to crowding, and environmental noise. The system can also detect customer facial expressions during purchase in addition to demographics, that can be utilized to devise effective marketing strategies for their customer base, to maximize sales.
在当前的信息时代,客户分析对于任何企业的成功都发挥着关键作用。由于客户的人口统计特征主要决定了他们的偏好,因此,在销售预测中识别和利用顾客的年龄与性别信息可以最大化零售销售额。在这项工作中,我们提出了一种基于计算机视觉的方法,用于从监控视频中预测客户的年龄和性别。该方法利用Wide Residual Networks(宽残差网络)和Xception深度学习模型的有效性来预测消费者的年龄和性别人口统计特征。 所提出的方案设计为能够处理典型闭路电视(CCTV) 视频监控系统捕捉到的原始视频。我们在真实的生活服装店监视视频上评估了该方法的效果,这些视频由低分辨率摄像头在不均匀照明条件下拍摄,并且由于人群拥挤而存在遮挡和环境噪声。 此外,该系统还可以检测顾客在购买过程中的面部表情,在分析人口统计特征之外提供额外的信息。这些信息可以被用来为他们的客户群体制定有效的市场营销策略,以最大化销售额。
https://arxiv.org/abs/2503.00453
The analysis of sales information, is a vital step in designing an effective marketing strategy. This work proposes a novel approach to analyse the shopping behaviour of customers to identify their purchase patterns. An extended version of the Multi-Cluster Overlapping k-Means Extension (MCOKE) algorithm with weighted k-Means algorithm is utilized to map customers to the garments of interest. The age & gender traits of the customer; the time spent and the expressions exhibited while selecting garments for purchase, are utilized to associate a customer or a group of customers to a garments they are interested in. Such study on the customer base of a retail business, may help in inferring the products of interest of their consumers, and enable them in developing effective business strategies, thus ensuring customer satisfaction, loyalty, increased sales and profits.
销售信息的分析是设计有效市场营销策略的关键步骤。本文提出了一种新的方法来分析顾客的购物行为,以识别他们的购买模式。使用了多簇重叠k均值扩展(MCOKE)算法的增强版结合加权k-均值算法,将客户与他们感兴趣的服装进行关联。通过利用客户的年龄和性别特征、在挑选衣物时所花费的时间以及表现出来的面部表情,可以将一个或一组顾客与其感兴趣的服装联系起来。通过对零售企业的客户群体的研究,可以帮助企业推断其消费者感兴趣的产品,并帮助他们制定有效的商业策略,从而确保客户满意度、忠诚度的提高,进而增加销售和利润。
https://arxiv.org/abs/2503.00452
One of the basic requirements of humans is clothing and this approach aims to identify the garments selected by customer during shopping, from surveillance video. The existing approaches to detect garments were developed on western wear using datasets of western clothing. They do not address Indian garments due to the increased complexity. In this work, we propose a computer vision based framework to address this problem through video surveillance. The proposed framework uses the Mixture of Gaussians background subtraction algorithm to identify the foreground present in a video frame. The visual information present in this foreground is analysed using computer vision techniques such as image segmentation to detect the various garments, the customer is interested in. The framework was tested on a dataset, that comprises of CCTV videos from a garments store. When presented with raw surveillance footage, the proposed framework demonstrated its effectiveness in detecting the interest of customer in choosing their garments by achieving a high precision and recall.
人类的基本需求之一是穿着衣物,而本研究旨在通过监控视频识别顾客在购物时选择的服装。现有的检测服装的方法主要基于西方服饰,并使用了西方服装的数据集进行开发,这些方法并未解决印度传统服饰的问题,因为其复杂性较高。为此,我们提出了一种基于计算机视觉的框架来利用视频监控解决这一问题。 该框架采用了高斯混合模型背景减除算法来识别视频帧中的前景对象。通过计算机视觉技术(如图像分割)分析这些前景中包含的视觉信息,以检测顾客感兴趣的各类服装。 我们在一个数据集上测试了这个框架,该数据集包含了从服装店收集到的CCTV视频片段。面对原始监控录像时,所提出的框架显示出了其有效性,在识别顾客选择服装的兴趣方面取得了高精度和召回率的成绩。
https://arxiv.org/abs/2503.00442