A key stumbling block in effective supply chain risk management for companies and policymakers is a lack of visibility on interdependent supply network relationships. Relationship prediction, also called link prediction is an emergent area of supply chain surveillance research that aims to increase the visibility of supply chains using data-driven techniques. Existing methods have been successful for predicting relationships but struggle to extract the context in which these relationships are embedded - such as the products being supplied or locations they are supplied from. Lack of context prevents practitioners from distinguishing transactional relations from established supply chain relations, hindering accurate estimations of risk. In this work, we develop a new Generative Artificial Intelligence (Gen AI) enhanced machine learning framework that leverages pre-trained language models as embedding models combined with machine learning models to predict supply chain relationships within knowledge graphs. By integrating Generative AI techniques, our approach captures the nuanced semantic relationships between entities, thereby improving supply chain visibility and facilitating more precise risk management. Using data from a real case study, we show that GenAI-enhanced link prediction surpasses all benchmarks, and demonstrate how GenAI models can be explored and effectively used in supply chain risk management.
https://arxiv.org/abs/2412.03390
Mobile target tracking is crucial in various applications such as surveillance and autonomous navigation. This study presents a decentralized tracking framework utilizing a Consensus-Based Estimation Filter (CBEF) integrated with the Nearly-Constant-Velocity (NCV) model to predict a moving target's state. The framework facilitates agents in a network to collaboratively estimate the target's position by sharing local observations and achieving consensus despite communication constraints and measurement noise. A saturation-based filtering technique is employed to enhance robustness by mitigating the impact of noisy sensor data. Simulation results demonstrate that the proposed method effectively reduces the Mean Squared Estimation Error (MSEE) over time, indicating improved estimation accuracy and reliability. The findings underscore the effectiveness of the CBEF in decentralized environments, highlighting its scalability and resilience in the presence of uncertainties.
https://arxiv.org/abs/2412.03095
Understanding video content is pivotal for advancing real-world applications like activity recognition, autonomous systems, and human-computer interaction. While scene graphs are adept at capturing spatial relationships between objects in individual frames, extending these representations to capture dynamic interactions across video sequences remains a significant challenge. To address this, we present TCDSG, Temporally Consistent Dynamic Scene Graphs, an innovative end-to-end framework that detects, tracks, and links subject-object relationships across time, generating action tracklets, temporally consistent sequences of entities and their interactions. Our approach leverages a novel bipartite matching mechanism, enhanced by adaptive decoder queries and feedback loops, ensuring temporal coherence and robust tracking over extended sequences. This method not only establishes a new benchmark by achieving over 60% improvement in temporal recall@k on the Action Genome, OpenPVSG, and MEVA datasets but also pioneers the augmentation of MEVA with persistent object ID annotations for comprehensive tracklet generation. By seamlessly integrating spatial and temporal dynamics, our work sets a new standard in multi-frame video analysis, opening new avenues for high-impact applications in surveillance, autonomous navigation, and beyond.
https://arxiv.org/abs/2412.02808
3D Gaussian Splatting has advanced radiance field reconstruction, enabling high-quality view synthesis and fast rendering in 3D modeling. While adversarial attacks on object detection models are well-studied for 2D images, their impact on 3D models remains underexplored. This work introduces the Masked Iterative Fast Gradient Sign Method (M-IFGSM), designed to generate adversarial noise targeting the CLIP vision-language model. M-IFGSM specifically alters the object of interest by focusing perturbations on masked regions, degrading the performance of CLIP's zero-shot object detection capability when applied to 3D models. Using eight objects from the Common Objects 3D (CO3D) dataset, we demonstrate that our method effectively reduces the accuracy and confidence of the model, with adversarial noise being nearly imperceptible to human observers. The top-1 accuracy in original model renders drops from 95.4\% to 12.5\% for train images and from 91.2\% to 35.4\% for test images, with confidence levels reflecting this shift from true classification to misclassification, underscoring the risks of adversarial attacks on 3D models in applications such as autonomous driving, robotics, and surveillance. The significance of this research lies in its potential to expose vulnerabilities in modern 3D vision models, including radiance fields, prompting the development of more robust defenses and security measures in critical real-world applications.
https://arxiv.org/abs/2412.02803
Understanding and anticipating human movement has become more critical and challenging in diverse applications such as autonomous driving and surveillance. The complex interactions brought by different relations between agents are a crucial reason that poses challenges to this task. Researchers have put much effort into designing a system using rule-based or data-based models to extract and validate the patterns between pedestrian trajectories and these interactions, which has not been adequately addressed yet. Inspired by how humans perceive social interactions with different level of relations to themself, this work proposes the GrouP ConCeption (short for GPCC) model composed of the Group method, which categorizes nearby agents into either group members or non-group members based on a long-term distance kernel function, and the Conception module, which perceives both visual and acoustic information surrounding the target agent. Evaluated across multiple datasets, the GPCC model demonstrates significant improvements in trajectory prediction accuracy, validating its effectiveness in modeling both social and individual dynamics. The qualitative analysis also indicates that the GPCC framework successfully leverages grouping and perception cues human-like intuitively to validate the proposed model's explainability in pedestrian trajectory forecasting.
https://arxiv.org/abs/2412.02395
Since the emergence of Large Language Models (LLMs), the challenge of effectively leveraging their potential in healthcare has taken center stage. A critical barrier to using LLMs for extracting insights from unstructured clinical notes lies in the prompt engineering process. Despite its pivotal role in determining task performance, a clear framework for prompt optimization remains absent. Current methods to address this gap take either a manual prompt refinement approach, where domain experts collaborate with prompt engineers to create an optimal prompt, which is time-intensive and difficult to scale, or through employing automatic prompt optimizing approaches, where the value of the input of domain experts is not fully realized. To address this, we propose StructEase, a novel framework that bridges the gap between automation and the input of human expertise in prompt engineering. A core innovation of the framework is SamplEase, an iterative sampling algorithm that identifies high-value cases where expert feedback drives significant performance improvements. This approach minimizes expert intervention, to effectively enhance classification outcomes. This targeted approach reduces labeling redundancy, mitigates human error, and enhances classification outcomes. We evaluated the performance of StructEase using a dataset of de-identified clinical narratives from the US National Electronic Injury Surveillance System (NEISS), demonstrating significant gains in classification performance compared to current methods. Our findings underscore the value of expert integration in LLM workflows, achieving notable improvements in F1 score while maintaining minimal expert effort. By combining transparency, flexibility, and scalability, StructEase sets the foundation for a framework to integrate expert input into LLM workflows in healthcare and beyond.
https://arxiv.org/abs/2412.02173
Cloth-changing person re-identification (CC-ReID) aims to match individuals across multiple surveillance cameras despite variations in clothing. Existing methods typically focus on mitigating the effects of clothing changes or enhancing ID-relevant features but often struggle to capture complex semantic information. In this paper, we propose a novel prompt learning framework, Semantic Contextual Integration (SCI), for CC-ReID, which leverages the visual-text representation capabilities of CLIP to minimize the impact of clothing changes and enhance ID-relevant features. Specifically, we introduce Semantic Separation Enhancement (SSE) module, which uses dual learnable text tokens to separately capture confounding and clothing-related semantic information, effectively isolating ID-relevant features from distracting clothing semantics. Additionally, we develop a Semantic-Guided Interaction Module (SIM) that uses orthogonalized text features to guide visual representations, sharpening the model's focus on distinctive ID characteristics. This integration enhances the model's discriminative power and enriches the visual context with high-dimensional semantic insights. Extensive experiments on three CC-ReID datasets demonstrate that our method outperforms state-of-the-art techniques. The code will be released at github.
https://arxiv.org/abs/2412.01345
Traffic Surveillance Systems (TSS) have become increasingly crucial in modern intelligent transportation systems, with vision-based technologies playing a central role for scene perception and understanding. While existing surveys typically focus on isolated aspects of TSS, a comprehensive analysis bridging low-level and high-level perception tasks, particularly considering emerging technologies, remains lacking. This paper presents a systematic review of vision-based technologies in TSS, examining both low-level perception tasks (object detection, classification, and tracking) and high-level perception applications (parameter estimation, anomaly detection, and behavior understanding). Specifically, we first provide a detailed methodological categorization and comprehensive performance evaluation for each task. Our investigation reveals five fundamental limitations in current TSS: perceptual data degradation in complex scenarios, data-driven learning constraints, semantic understanding gaps, sensing coverage limitations and computational resource demands. To address these challenges, we systematically analyze five categories of potential solutions: advanced perception enhancement, efficient learning paradigms, knowledge-enhanced understanding, cooperative sensing frameworks and efficient computing frameworks. Furthermore, we evaluate the transformative potential of foundation models in TSS, demonstrating their unique capabilities in zero-shot learning, semantic understanding, and scene generation. This review provides a unified framework bridging low-level and high-level perception tasks, systematically analyzes current limitations and solutions, and presents a structured roadmap for integrating emerging technologies, particularly foundation models, to enhance TSS capabilities.
https://arxiv.org/abs/2412.00348
This paper addresses the challenge of automated violence detection in video frames captured by surveillance cameras, specifically focusing on classifying scenes as "fight" or "non-fight." This task is critical for enhancing unmanned security systems, online content filtering, and related applications. We propose an approach using a 3D Convolutional Neural Network (3D CNN)-based model named X3D to tackle this problem. Our approach incorporates pre-processing steps such as tube extraction, volume cropping, and frame aggregation, combined with clustering techniques, to accurately localize and classify fight scenes. Extensive experimentation demonstrates the effectiveness of our method in distinguishing violent from non-violent events, providing valuable insights for advancing practical violence detection systems.
https://arxiv.org/abs/2412.02127
Accurate detection and tracking of small objects such as pedestrians, cyclists, and motorbikes are critical for traffic surveillance systems, which are crucial in improving road safety and decision-making in intelligent transportation systems. However, traditional methods struggle with challenges such as occlusion, low resolution, and dynamic traffic conditions, necessitating innovative approaches to address these limitations. This paper introduces DGNN-YOLO, a novel framework integrating dynamic graph neural networks (DGNN) with YOLO11 to enhance small object detection and tracking in traffic surveillance systems. The framework leverages YOLO11's advanced spatial feature extraction capabilities for precise object detection and incorporates DGNN to model spatial-temporal relationships for robust real-time tracking dynamically. By constructing and updating graph structures, DGNN-YOLO effectively represents objects as nodes and their interactions as edges, ensuring adaptive and accurate tracking in complex and dynamic environments. Extensive experiments demonstrate that DGNN-YOLO consistently outperforms state-of-the-art methods in detecting and tracking small objects under diverse traffic conditions, achieving the highest precision (0.8382), recall (0.6875), and mAP@0.5:0.95 (0.6476), showcasing its robustness and scalability, particularly in challenging scenarios involving small and occluded objects. This work provides a scalable, real-time traffic surveillance and analysis solution, significantly contributing to intelligent transportation systems.
准确检测和跟踪行人、自行车和摩托车等小型物体对于交通监控系统至关重要,这对于提高道路安全和智能交通系统的决策能力具有重要作用。然而,传统方法在面对遮挡、低分辨率和动态交通状况等挑战时难以应对,因此需要创新的方法来解决这些限制。本文介绍了DGNN-YOLO,这是一种将动态图神经网络(DGNN)与YOLO11结合的新型框架,用于提升交通监控系统中的小物体检测和跟踪能力。该框架利用YOLO11先进的空间特征提取功能进行精准的对象检测,并融合了DGNN来动态建模时空关系以实现稳健的实时跟踪。通过构建并更新图结构,DGNN-YOLO能够将对象表示为节点,将其交互表示为边,确保在复杂和动态环境中进行自适应且准确的跟踪。大量实验表明,在不同的交通条件下,DGNN-YOLO在检测和跟踪小型物体方面始终优于最先进的方法,达到了最高的精度(0.8382)、召回率(0.6875)以及mAP@0.5:0.95(0.6476),展示了其在处理小且遮挡对象等挑战性场景时的鲁棒性和可扩展性。这项工作提供了一种可扩展、实时的交通监控和分析解决方案,对智能交通系统的发展做出了重要贡献。
https://arxiv.org/abs/2411.17251
The increasing capabilities of deep neural networks for re-identification, combined with the rise in public surveillance in recent years, pose a substantial threat to individual privacy. Event cameras were initially considered as a promising solution since their output is sparse and therefore difficult for humans to interpret. However, recent advances in deep learning proof that neural networks are able to reconstruct high-quality grayscale images and re-identify individuals using data from event cameras. In our paper, we contribute a crucial ethical discussion on data privacy and present the first event anonymization pipeline to prevent re-identification not only by humans but also by neural networks. Our method effectively introduces learnable data-dependent noise to cover personally identifiable information in raw event data, reducing attackers' re-identification capabilities by up to 60%, while maintaining substantial information for the performing of downstream tasks. Moreover, our anonymization generalizes well on unseen data and is robust against image reconstruction and inversion attacks. Code: this https URL
深度神经网络在重新识别方面能力的提升,加上近年来公共监控的增多,对个人隐私构成了重大威胁。事件相机最初被认为是一个有希望的解决方案,因为其输出是稀疏的,因此难以被人解读。然而,最近的深度学习进展证明了神经网络能够使用来自事件相机的数据重建高质量的灰度图像并对个体进行重新识别。在我们的论文中,我们贡献了一个关于数据隐私的关键伦理讨论,并提出了第一个事件匿名化管道,以防止不仅人类而且神经网络进行重新识别。我们的方法有效地引入了可学习的数据相关噪声来掩盖原始事件数据中的个人身份信息,最多可以将攻击者的重新识别能力降低60%,同时保留大量可用于执行下游任务的信息。此外,我们的匿名化方法在未见数据上表现良好,并且能够抵御图像重建和逆向攻击。代码:此 https URL
https://arxiv.org/abs/2411.16440
Detecting mixed-critical events through computer vision is challenging due to the need for contextual understanding to assess event criticality accurately. Mixed critical events, such as fires of varying severity or traffic incidents, demand adaptable systems that can interpret context to trigger appropriate responses. This paper addresses these challenges by proposing a versatile detection system for smart city applications, offering a solution tested across traffic and fire detection scenarios. Our contributions include an analysis of detection requirements and the development of a system adaptable to diverse applications, advancing automated surveillance for smart cities.
通过计算机视觉检测混合关键事件颇具挑战性,因为需要对情境进行理解以准确评估事件的严重程度。例如不同严重程度的火灾或交通事故等混合关键事件要求系统具有可适应性,能够解读情境并触发适当的响应。本文针对这些挑战提出了一个适用于智慧城市应用的多功能检测系统,并在交通和火灾检测场景中进行了测试。我们的贡献包括对检测需求的分析以及开发了一个可以应用于多种场景的系统,从而推动了智慧城市的自动化监控发展。
https://arxiv.org/abs/2411.15773
Real-time object localization on edge devices is fundamental for numerous applications, ranging from surveillance to industrial automation. Traditional frameworks, such as object detection, segmentation, and keypoint detection, struggle in resource-constrained environments, often resulting in substantial target omissions. To address these challenges, we introduce OCDet, a lightweight Object Center Detection framework optimized for edge devices with NPUs. OCDet predicts heatmaps representing object center probabilities and extracts center points through peak identification. Unlike prior methods using fixed Gaussian distribution, we introduce Generalized Centerness (GC) to generate ground truth heatmaps from bounding box annotations, providing finer spatial details without additional manual labeling. Built on NPU-friendly Semantic FPN with MobileNetV4 backbones, OCDet models are trained by our Balanced Continuous Focal Loss (BCFL), which alleviates data imbalance and focuses training on hard negative examples for probability regression tasks. Leveraging the novel Center Alignment Score (CAS) with Hungarian matching, we demonstrate that OCDet consistently outperforms YOLO11 in object center detection, achieving up to 23% higher CAS while requiring 42% fewer parameters, 34% less computation, and 64% lower NPU latency. When compared to keypoint detection frameworks, OCDet achieves substantial CAS improvements up to 186% using identical models. By integrating GC, BCFL, and CAS, OCDet establishes a new paradigm for efficient and robust object center detection on edge devices with NPUs. The code is released at this https URL.
实时目标定位在边缘设备上对于众多应用(从监控到工业自动化)至关重要。传统的框架,如对象检测、分割和关键点检测,在资源受限的环境中通常表现不佳,经常导致目标遗漏问题。为了解决这些挑战,我们引入了OCDet——一个专为配备NPUs的边缘设备优化的轻量级对象中心检测框架。OCDet通过预测表示对象中心概率的热图并识别峰值来提取中心点。与先前使用固定高斯分布的方法不同,我们引入广义中心性(GC),从边界框注释生成真实值热图,提供更精细的空间细节而不需额外的手动标注。基于对NPU友好的Semantic FPN和MobileNetV4骨干网络构建的OCDet模型通过我们的平衡连续焦损(BCFL)进行训练,该损失函数缓解了数据不平衡问题,并专注于硬负例的概率回归任务培训。结合新颖的中心对齐评分(CAS)与匈牙利匹配方法,我们展示了OCDet在对象中心检测上持续优于YOLO11,实现高达23%更高的CAS得分,同时参数减少了42%,计算量减少34%,NPU延迟降低64%。与关键点检测框架相比,在相同模型下,OCDet实现了最高达186%的CAS改进。通过集成GC、BCFL和CAS,OCDet为配备NPUs的边缘设备上的高效且鲁棒的对象中心检测建立了新的范式。代码已发布在以下链接:[https URL]。
https://arxiv.org/abs/2411.15653
Removing adverse weather conditions such as rain, raindrop, and snow from images is critical for various real-world applications, including autonomous driving, surveillance, and remote sensing. However, existing multi-task approaches typically rely on augmenting the model with additional parameters to handle multiple scenarios. While this enables the model to address diverse tasks, the introduction of extra parameters significantly complicates its practical deployment. In this paper, we propose a novel Gradient-Guided Parameter Mask for Multi-Scenario Image Restoration under adverse weather, designed to effectively handle image degradation under diverse weather conditions without additional parameters. Our method segments model parameters into common and specific components by evaluating the gradient variation intensity during training for each specific weather condition. This enables the model to precisely and adaptively learn relevant features for each weather scenario, improving both efficiency and effectiveness without compromising on performance. This method constructs specific masks based on gradient fluctuations to isolate parameters influenced by other tasks, ensuring that the model achieves strong performance across all scenarios without adding extra parameters. We demonstrate the state-of-the-art performance of our framework through extensive experiments on multiple benchmark datasets. Specifically, our method achieves PSNR scores of 29.22 on the Raindrop dataset, 30.76 on the Rain dataset, and 29.56 on the Snow100K dataset. Code is available at: \href{this https URL}{this https URL}.
从图像中移除诸如雨、雨滴和雪等不利天气条件对于自主驾驶、监控和远程传感等各种实际应用至关重要。然而,现有的多任务方法通常依赖于通过增加额外参数来扩展模型以处理多种场景。虽然这使得模型能够应对多样化的任务,但引入额外参数极大地增加了其实际部署的复杂性。在本文中,我们提出了一种针对不利天气下的多场景图像恢复的新颖梯度引导参数掩码方法,该方法能够在不添加额外参数的情况下有效处理各种天气条件下的图像退化问题。我们的方法通过对训练过程中每种特定天气条件下梯度变化强度的评估,将模型参数分为通用和特定组件。这使得模型能够精准且自适应地学习每个天气场景的相关特征,在不影响性能的前提下提升效率与效果。该方法通过基于梯度波动构建特定掩码来隔离受其他任务影响的参数,确保模型在所有场景下均能取得优异表现而不增加额外参数。我们通过多个基准数据集上的广泛实验展示了框架的最先进性能。具体而言,我们的方法在Raindrop数据集上实现了29.22的PSNR评分,在Rain数据集上为30.76,在Snow100K数据集上则达到29.56。代码可在\href{this https URL}{此链接}获取。
https://arxiv.org/abs/2411.16739
The increasing complexity of regulatory updates from global authorities presents significant challenges for medical device manufacturers, necessitating agile strategies to sustain compliance and maintain market access. Concurrently, regulatory bodies must effectively monitor manufacturers' responses and develop strategic surveillance plans. This study employs a multi-agent modeling approach, enhanced with Large Language Models (LLMs), to simulate regulatory dynamics and examine the adaptive behaviors of key actors, including regulatory bodies, manufacturers, and competitors. These agents operate within a simulated environment governed by regulatory flow theory, capturing the impacts of regulatory changes on compliance decisions, market adaptation, and innovation strategies. Our findings illuminate the influence of regulatory shifts on industry behaviour and identify strategic opportunities for improving regulatory practices, optimizing compliance, and fostering innovation. By leveraging the integration of multi-agent systems and LLMs, this research provides a novel perspective and offers actionable insights for stakeholders navigating the evolving regulatory landscape of the medical device industry.
全球监管机构日益复杂的法规更新给医疗器械制造商带来了重大挑战,要求他们采取灵活的战略以维持合规性和市场准入。同时,监管机构必须有效监控制造商的反应并制定战略监督计划。本研究采用增强的大语言模型(LLMs)的多智能体建模方法,模拟监管动态,并考察关键参与者——包括监管机构、制造商和竞争对手——的适应行为。这些代理在由法规流理论支配的仿真环境中运作,捕捉法规变化对合规决策、市场适应性和创新策略的影响。我们的研究结果揭示了法规变动对行业行为的影响,并识别出改善监管实践、优化合规性并促进创新的战略机遇。通过整合多智能体系统和LLMs,本研究提供了新颖的观点,并为应对医疗器械行业不断演变的监管环境的利益相关者提供可操作见解。
https://arxiv.org/abs/2411.15356
Video Anomaly Detection (VAD) aims to automatically analyze spatiotemporal patterns in surveillance videos collected from open spaces to detect anomalous events that may cause harm without physical contact. However, vision-based surveillance systems such as closed-circuit television often capture personally identifiable information. The lack of transparency and interpretability in video transmission and usage raises public concerns about privacy and ethics, limiting the real-world application of VAD. Recently, researchers have focused on privacy concerns in VAD by conducting systematic studies from various perspectives including data, features, and systems, making Privacy-Preserving Video Anomaly Detection (P2VAD) a hotspot in the AI community. However, current research in P2VAD is fragmented, and prior reviews have mostly focused on methods using RGB sequences, overlooking privacy leakage and appearance bias considerations. To address this gap, this article systematically reviews the progress of P2VAD for the first time, defining its scope and providing an intuitive taxonomy. We outline the basic assumptions, learning frameworks, and optimization objectives of various approaches, analyzing their strengths, weaknesses, and potential correlations. Additionally, we provide open access to research resources such as benchmark datasets and available code. Finally, we discuss key challenges and future opportunities from the perspectives of AI development and P2VAD deployment, aiming to guide future work in the field.
视频异常检测(Video Anomaly Detection,简称VAD)旨在自动分析来自开放空间监控视频中的时空模式,以检测可能造成无物理接触危害的异常事件。然而,基于视觉的监视系统如闭路电视常常捕获个人可识别信息。在视频传输和使用中缺乏透明度和解释性引发了公众对于隐私和伦理的关注,限制了VAD的实际应用。近期,研究人员通过从数据、特征和系统等多个角度进行系统的研究来关注VAD中的隐私问题,使得保护隐私的视频异常检测(Privacy-Preserving Video Anomaly Detection, 简称P2VAD)成为人工智能社区的一个热点话题。然而,当前P2VAD领域的研究较为零散,以往的研究回顾大多集中于使用RGB序列的方法,忽视了隐私泄露和外观偏见等问题。为解决这一问题,本文首次系统地综述了P2VAD的发展状况,定义其范围并提供直观的分类法。我们概述了各种方法的基本假设、学习框架及优化目标,并分析它们的优势、劣势及其潜在关联性。此外,我们提供了诸如基准数据集和可用代码等研究资源的开放访问。最后,从人工智能发展与P2VAD部署的角度出发,讨论了关键挑战与未来机遇,旨在为该领域的未来发展提供指导。
https://arxiv.org/abs/2411.14565
Medical vision-and-language models (MVLMs) have attracted substantial interest due to their capability to offer a natural language interface for interpreting complex medical data. Their applications are versatile and have the potential to improve diagnostic accuracy and decision-making for individual patients while also contributing to enhanced public health monitoring, disease surveillance, and policy-making through more efficient analysis of large data sets. MVLMS integrate natural language processing with medical images to enable a more comprehensive and contextual understanding of medical images alongside their corresponding textual information. Unlike general vision-and-language models trained on diverse, non-specialized datasets, MVLMs are purpose-built for the medical domain, automatically extracting and interpreting critical information from medical images and textual reports to support clinical decision-making. Popular clinical applications of MVLMs include automated medical report generation, medical visual question answering, medical multimodal segmentation, diagnosis and prognosis and medical image-text retrieval. Here, we provide a comprehensive overview of MVLMs and the various medical tasks to which they have been applied. We conduct a detailed analysis of various vision-and-language model architectures, focusing on their distinct strategies for cross-modal integration/exploitation of medical visual and textual features. We also examine the datasets used for these tasks and compare the performance of different models based on standardized evaluation metrics. Furthermore, we highlight potential challenges and summarize future research trends and directions. The full collection of papers and codes is available at: this https URL.
医学视觉-语言模型(MVLMs)因其能够为解释复杂医疗数据提供自然语言接口的能力而引起了广泛的关注。它们的应用是多样的,并且有潜力通过提高个体患者的诊断准确性和决策能力来改善公共卫生监测、疾病监控和政策制定,同时也能通过对大型数据集进行更有效的分析做出贡献。MVLMs将自然语言处理与医学影像相结合,以实现对医疗影像及其相应文本信息的更加全面和语境化的理解。不同于在多样且非专业数据集上训练的一般视觉-语言模型,MVLMs专门针对医学领域设计,能够自动从医学图像和文本报告中提取并解读关键信息,从而支持临床决策。MVLMs流行的临床应用包括自动化医疗报告生成、医学视觉问答、医学多模态分割、诊断与预后以及医学影像-文本检索。这里我们对MVLMs及其应用于各种医疗任务进行了全面的概述。我们详细分析了不同的视觉-语言模型架构,重点在于它们在跨模式集成/利用医学视觉和文本特征方面的独特策略。同时,我们也考察了用于这些任务的数据集,并基于标准化评估指标对比不同模型的表现。此外,我们还指出了潜在挑战并总结了未来研究趋势与方向。完整的论文和代码集合可在以下链接获取:[此处的HTTPS URL]。
https://arxiv.org/abs/2411.12195
This work investigates the self-organization of multi-agent systems into closed trajectories, a common requirement in unmanned aerial vehicle (UAV) surveillance tasks. In such scenarios, smooth, unbiased control signals save energy and mitigate mechanical strain. We propose a decentralized control system architecture that produces a globally stable emergent structure from local observations only; there is no requirement for agents to share a global plan or follow prescribed trajectories. Central to our approach is the formulation of an injective virtual embedding induced by rotations from the actual agent positions. This embedding serves as a structure-preserving map around which all agent stabilize their relative positions and permits the use of well-established linear control techniques. We construct the embedding such that it is topologically equivalent to the desired trajectory (i.e., a homeomorphism), thereby preserving the stability characteristics. We demonstrate the versatility of this approach through implementation on a swarm of Quanser QDrone quadcopters. Results demonstrate the quadcopters self-organize into the desired trajectory while maintaining even separation.
这项工作探讨了多智能体系统如何自组织成闭合轨迹,这是无人驾驶飞行器(UAV)监视任务中的常见需求。在这些场景中,平滑无偏的控制信号可以节省能量并减轻机械应力。我们提出了一种去中心化的控制系统架构,仅通过局部观察即可产生全局稳定的涌现结构;无需智能体共享全球计划或遵循预定轨迹。我们的方法的核心是基于实际智能体位置旋转而产生的单射虚拟嵌入的公式化。此嵌入充当一种保结构映射,所有智能体围绕它稳定其相对位置,并允许使用已建立的线性控制技术。我们构建了该嵌入,使其在拓扑上等同于所需轨迹(即同胚),从而保留稳定性特征。我们通过在Quanser QDrone四旋翼飞行器群组上的实现展示了这一方法的灵活性。结果表明,四旋翼飞行器能够自组织成所需的轨迹,并保持均匀分布。
https://arxiv.org/abs/2411.11142
Anomaly detection in video surveillance has recently gained interest from the research community. Temporal duration of anomalies vary within video streams, leading to complications in learning the temporal dynamics of specific events. This paper presents a temporal-granularity method for an anomaly detection model (TeG) in real-world surveillance, combining spatio-temporal features at different time-scales. The TeG model employs multi-head cross-attention blocks and multi-head self-attention blocks for this purpose. Additionally, we extend the UCF-Crime dataset with new anomaly types relevant to Smart City research project. The TeG model is deployed and validated in a city surveillance system, achieving successful real-time results in industrial settings.
视频监控中的异常检测最近引起了研究社区的兴趣。视频流中异常的时间长度各不相同,这导致了学习特定事件时间动态的复杂性。本文提出了一种适用于现实世界监控的异常检测模型(TeG)的时间粒度方法,该方法结合了不同时间尺度上的时空特征。TeG 模型采用了多头交叉注意力块和多头自注意力块来实现这一目标。此外,我们扩展了 UCF-Crime 数据集,加入了与智慧城市研究项目相关的新型异常类型。TeG 模型已在城市监控系统中部署并验证,在工业环境中实现了成功的实时结果。
https://arxiv.org/abs/2411.11003
Computer vision is increasingly used in areas such as unmanned vehicles, surveillance systems and remote sensing. However, in foggy scenarios, image degradation leads to loss of target details, which seriously affects the accuracy and effectiveness of these vision tasks. Polarized light, due to the fact that its electromagnetic waves vibrate in a specific direction, is able to resist scattering and refraction effects in complex media more effectively compared to unpolarized light. As a result, polarized light has a greater ability to maintain its polarization characteristics in complex transmission media and under long-distance imaging conditions. This property makes polarized imaging especially suitable for complex scenes such as outdoor and underwater, especially in foggy environments, where higher quality images can be obtained. Based on this advantage, we propose an innovative semi-physical polarization dehazing method that does not rely on an external light source. The method simulates the diffusion process of fog and designs a diffusion kernel that corresponds to the image blurriness caused by this diffusion. By employing spatiotemporal Fourier transforms and deconvolution operations, the method recovers the state of fog droplets prior to diffusion and the light inversion distribution of objects. This approach effectively achieves dehazing and detail enhancement of the scene.
计算机视觉在无人驾驶车辆、监控系统和遥感等领域中的应用日益增多。然而,在有雾的情况下,图像退化导致目标细节丢失,这严重影响了这些视觉任务的准确性和有效性。由于其电磁波沿特定方向振动,偏振光能够比非偏振光更有效地抵抗复杂介质中的散射和折射效应。因此,偏振光在复杂传输介质中以及长距离成像条件下保持其偏振特性的能力更强。这一特性使得偏振成像特别适用于户外和水下等复杂场景,尤其是在雾天环境中,可以获得更高的图像质量。基于此优势,我们提出了一种创新的半物理偏振去雾方法,该方法不依赖外部光源。该方法模拟了雾的扩散过程,并设计了一个对应于这种扩散所引起的图像模糊程度的扩散核。通过使用时空傅里叶变换和反卷积操作,这种方法能够恢复雾滴在扩散前的状态以及物体的光逆分布。此方法有效地实现了场景去雾和细节增强。
https://arxiv.org/abs/2411.09924