Existing video anomaly detection datasets are inadequate for representing complex anomalies that occur due to the interactions between objects. The absence of complex anomalies in previous video anomaly detection datasets affects research by shifting the focus onto simple anomalies. To address this problem, we introduce a new large-scale dataset: ComplexVAD. In addition, we propose a novel method to detect complex anomalies via modeling the interactions between objects using a scene graph with spatio-temporal attributes. With our proposed method and two other state-of-the-art video anomaly detection methods, we obtain baseline scores on ComplexVAD and demonstrate that our new method outperforms existing works.
现有的视频异常检测数据集在表示由于物体之间相互作用而产生的复杂异常方面是不足的。以前的视频异常检测数据集中缺乏复杂的异常情况,这影响了研究方向,使其偏向于关注简单的异常情况。为了解决这个问题,我们引入了一个新的大规模数据集:ComplexVAD。此外,我们还提出了一种新颖的方法,通过使用带有时空属性的场景图来建模物体之间的相互作用,以此检测复杂异常。利用我们的新方法以及其他两种最先进的视频异常检测方法,在ComplexVAD上获得了基准分数,并证明了我们的新方法优于现有的工作。
https://arxiv.org/abs/2501.09733
With the increased use of the internet and social networks for online discussions, the spread of toxic and inappropriate content on social networking sites has also increased. Several studies have been conducted in different languages. However, there is less work done for South Asian languages for inappropriate content identification using deep learning techniques. In Urdu language, the spellings are not unique, and people write different common spellings for the same word, while mixing it other languages, like English in the text makes it more challenging, and limited research work is available to process such language with the finest algorithms. The use of attention layer with a deep learning model can help handling the long-term dependencies and increase its efficiency . To explore the effects of the attention layer, this study proposes attention-based Bidirectional GRU hybrid model for identifying inappropriate content in Urdu Unicode text language. Four different baseline deep learning models; LSTM, Bi-LSTM, GRU, and TCN, are used to compare the performance of the proposed model. The results of these models were compared based on evaluation metrics, dataset size, and impact of the word embedding layer. The pre-trained Urdu word2Vec embeddings were utilized for our case. Our proposed model BiGRU-A outperformed all other baseline models by yielding 84\% accuracy without using pre-trained word2Vec layer. From our experiments, we have established that the attention layer improves the model's efficiency, and pre-trained word2Vec embedding does not work well with an inappropriate content dataset.
随着互联网和社交网络在在线讨论中的使用增加,社交媒体平台上毒性和不适当内容的传播也有所增加。不同语言中已经进行了多项研究,但在南亚语言中利用深度学习技术进行不当内容识别的研究工作较少。乌尔都语拼写并不唯一,同一单词有多种常见的拼写方式,而且会与其他语言(如英语)混合使用,这使得处理这种语言更具挑战性,并且可用的算法研究有限。 使用注意力层与深度学习模型相结合可以帮助处理长期依赖关系并提高其效率。为了探索注意力层的效果,本研究提出了一种基于注意力的双向GRU混合模型,用于识别乌尔都语Unicode文本中的不当内容。四种不同的基线深度学习模型:LSTM、Bi-LSTM、GRU和TCN被用来比较所提出的模型性能。根据评估指标、数据集大小以及词嵌入层的影响来对比这些模型的结果。我们使用了预训练的乌尔都语word2Vec嵌入。 我们的拟议模型BiGRU-A在不使用预训练的word2Vec层的情况下达到了84%的准确率,优于所有其他基线模型。从实验中得出结论,注意力层可以提高模型效率,并且与不当内容数据集相比,预训练的词向量层表现不佳。
https://arxiv.org/abs/2501.09722
The multimodal language models (MLMs) based on generative pre-trained Transformer are considered powerful candidates for unifying various domains and tasks. MLMs developed for remote sensing (RS) have demonstrated outstanding performance in multiple tasks, such as visual question answering and visual grounding. In addition to visual grounding that detects specific objects corresponded to given instruction, aerial detection, which detects all objects of multiple categories, is also a valuable and challenging task for RS foundation models. However, aerial detection has not been explored by existing RS MLMs because the autoregressive prediction mechanism of MLMs differs significantly from the detection outputs. In this paper, we present a simple baseline for applying MLMs to aerial detection for the first time, named LMMRotate. Specifically, we first introduce a normalization method to transform detection outputs into textual outputs to be compatible with the MLM framework. Then, we propose a evaluation method, which ensures a fair comparison between MLMs and conventional object detection models. We construct the baseline by fine-tuning open-source general-purpose MLMs and achieve impressive detection performance comparable to conventional detector. We hope that this baseline will serve as a reference for future MLM development, enabling more comprehensive capabilities for understanding RS images. Code is available at this https URL.
基于生成预训练Transformer的多模态语言模型(MLMs)被视为统一各种领域和任务的强大候选者。专为遥感(RS)开发的MLMs在多项任务中展现了卓越性能,如视觉问答和视觉接地。除了检测与给定指令相对应的具体物体的视觉接地外,检测多种类别的所有对象的航空检测也是一个对RS基础模型有价值的且具有挑战性的任务。然而,由于MLMs的自回归预测机制与检测输出显著不同,现有的RS MLMs尚未探索航空检测领域。在这篇文章中,我们首次提出了一种简单的方法用于将MLMs应用于航空检测,并将其命名为LMMRotate。 具体而言,首先引入一种归一化方法,以将检测输出转换为文本形式的输出,从而使其与MLM框架兼容。然后,我们提出了一种评估方法,确保MLMs和传统目标检测模型之间的公平比较。通过微调开源通用MLMs构建基线,并取得了与传统检测器相当的出色检测性能。我们希望这一基线将作为未来MLM发展的参考,使理解RS图像的能力更加全面。 代码可在以下网址获得:[此URL](https://this-url.com)(原文中的链接请自行替换为实际提供的地址)。
https://arxiv.org/abs/2501.09720
For privacy and security concerns, the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios, erasure requests originate at any time from both users and model owners, and these requests usually form a sequence. Therefore, under such a setting, selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify three key challenges. (i) For unwanted knowledge, efficient and effective deleting is crucial. (ii) For remaining knowledge, the impact brought by the forgetting procedure should be minimal. (iii) In real-world scenarios, the training samples may be scarce or partially missing during the process of forgetting. To address them, we first propose Group Sparse LoRA (GS-LoRA). Specifically, towards (i), we introduce LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently, and towards (ii), a simple group sparse regularization is adopted, enabling automatic selection of specific LoRA groups and zeroing out the others. To further extend GS-LoRA to more practical scenarios, we incorporate prototype information as additional supervision and introduce a more practical approach, GS-LoRA++. For each forgotten class, we move the logits away from its original prototype. For the remaining classes, we pull the logits closer to their respective prototypes. We conduct extensive experiments on face recognition, object detection and image classification and demonstrate that our method manages to forget specific classes with minimal impact on other classes. Codes have been released on this https URL.
出于隐私和安全方面的考虑,从预训练的视觉模型中删除不需要的信息的需求变得越来越明显。在现实场景中,用户和模型所有者随时都可能提出擦除请求,并且这些请求通常形成一个序列。因此,在这种设置下,期望能够持续地从预训练模型中移除特定信息的同时保持其余部分不受影响。我们将这个问题定义为连续遗忘问题,并识别出三个关键挑战。(i)对于不需要的知识,高效的删除方法至关重要。(ii)对于保留下来的知识,遗忘过程带来的负面影响应该最小化。(iii)在现实场景中,在遗忘过程中可用的训练样本可能非常有限或不完整。 为了应对这些挑战,我们首先提出了组稀疏LoRA(GS-LoRA)。具体来说,针对(i),我们引入了用于独立微调Transformer块中的FFN层的LoRA模块,并且对于(ii),采用了简单的组稀疏正则化方法,从而能够自动选择特定的LoRA组并将其他部分置零。为了将GS-LoRA进一步扩展到更多实际场景中使用,我们将原型信息作为额外监督引入,并提出了一种更实用的方法——GS-LoRA++。对于每个被遗忘的类别,我们将其logits远离其原始原型;而对于剩余的类别,则吸引它们各自的原型。我们在人脸识别、目标检测和图像分类上进行了广泛的实验,证明我们的方法能够以最小影响从特定类中进行遗忘操作。 代码已经在以下网址发布:[此链接处应填写实际提供的GitHub或相关代码存储库URL]。
https://arxiv.org/abs/2501.09705
Autonomous docking remains one of the most challenging maneuvers in marine robotics, requiring precise control and robust perception in confined spaces. This paper presents a novel approach integrating Model Predictive Path Integral(MPPI) control with real-time LiDAR-based dock detection for autonomous surface vessel docking. Our framework uniquely combines probabilistic trajectory optimization with a multiobjective cost function that simultaneously considers docking precision, safety constraints, and motion efficiency. The MPPI controller generates optimal trajectories by intelligently sampling control sequences and evaluating their costs based on dynamic clearance requirements, orientation alignment, and target position objectives. We introduce an adaptive dock detection pipeline that processes LiDAR point clouds to extract critical geometric features, enabling real-time updates of docking parameters. The proposed method is extensively validated in a physics-based simulation environment that incorporates realistic sensor noise, vessel dynamics, and environmental constraints. Results demonstrate successful docking from various initial positions while maintaining safe clearances and smooth motion characteristics.
自主对接仍然是海洋机器人技术中最具挑战性的操作之一,要求在狭小空间内进行精确控制和稳健感知。本文提出了一种新颖的方法,将模型预测路径积分(MPPI)控制与实时LiDAR-based船坞检测相结合,用于自主水面船舶的靠泊。我们的框架独特地结合了概率轨迹优化和一个多目标成本函数,同时考虑了对接精度、安全约束以及运动效率。 MPPI控制器通过智能抽样控制序列并根据动态避碰要求、方向对齐及目标位置目标来评估其成本,从而生成最优轨迹。我们引入了一种自适应船坞检测流水线,该流程处理LiDAR点云以提取关键几何特征,使对接参数能够在实时中更新。 所提出的方法在物理基础的仿真环境中进行了广泛的验证,该环境包括了现实传感器噪声、船舶动力学以及环境约束等要素。结果表明,从各种初始位置成功实现靠泊,并且保持安全距离和流畅运动特性。
https://arxiv.org/abs/2501.09668
Face recognition technology has dramatically transformed the landscape of security, surveillance, and authentication systems, offering a user-friendly and non-invasive biometric solution. However, despite its significant advantages, face recognition systems face increasing threats from physical and digital spoofing attacks. Current research typically treats face recognition and attack detection as distinct classification challenges. This approach necessitates the implementation of separate models for each task, leading to considerable computational complexity, particularly on devices with limited resources. Such inefficiencies can stifle scalability and hinder performance. In response to these challenges, this paper introduces an innovative unified model designed for face recognition and detection of physical and digital attacks. By leveraging the advanced Swin Transformer backbone and incorporating HiLo attention in a convolutional neural network framework, we address unified face recognition and spoof attack detection more effectively. Moreover, we introduce augmentation techniques that replicate the traits of physical and digital spoofing cues, significantly enhancing our model robustness. Through comprehensive experimental evaluation across various datasets, we showcase the effectiveness of our model in unified face recognition and spoof detection. Additionally, we confirm its resilience against unseen physical and digital spoofing attacks, underscoring its potential for real-world applications.
面部识别技术已显著改变了安全、监控和认证系统的格局,提供了一种用户友好且非侵入性的生物特征解决方案。然而,尽管其具有明显的优势,但面部识别系统面临着来自物理和数字伪造攻击的日益增加的威胁。目前的研究通常将面部识别与攻击检测视为两个独立的分类挑战。这种方法需要为每个任务实施单独的模型,导致计算复杂性大幅增加,尤其是在资源有限的设备上。这种低效会限制可扩展性并阻碍性能。 为了应对这些挑战,本文介绍了一种创新的一体化模型,用于面部识别和物理及数字攻击检测。通过利用先进的Swin Transformer骨干网,并在卷积神经网络框架中融入HiLo注意力机制,我们更有效地解决了统一的面部识别和伪造攻击检测问题。此外,我们引入了增强技术来复制物理和数字伪造线索的特点,大大增强了模型的鲁棒性。 通过跨多种数据集进行全面实验评估,我们展示了我们的模型在统一面部识别和伪造检测方面的有效性。另外,我们也确认了该模型对未见过的物理及数字伪造攻击具有抗御能力,突显其在实际应用中的潜力。
https://arxiv.org/abs/2501.09635
With the rapid advancement of deepfake generation technologies, the demand for robust and accurate face forgery detection algorithms has become increasingly critical. Recent studies have demonstrated that wavelet analysis can uncover subtle forgery artifacts that remain imperceptible in the spatial domain. Wavelets effectively capture important facial contours, which are often slender, fine-grained, and global in nature. However, existing wavelet-based approaches fail to fully leverage these unique characteristics, resulting in sub-optimal feature extraction and limited generalizability. To address this challenge, we introduce WMamba, a novel wavelet-based feature extractor built upon the Mamba architecture. WMamba maximizes the utility of wavelet information through two key innovations. First, we propose Dynamic Contour Convolution (DCConv), which employs specially crafted deformable kernels to adaptively model slender facial contours. Second, by leveraging the Mamba architecture, our method captures long-range spatial relationships with linear computational complexity. This efficiency allows for the extraction of fine-grained, global forgery artifacts from small image patches. Extensive experimental results show that WMamba achieves state-of-the-art (SOTA) performance, highlighting its effectiveness and superiority in face forgery detection.
随着深度伪造生成技术的迅速发展,对稳健且准确的人脸伪造检测算法的需求变得愈发重要。近期研究显示,小波分析能够揭示在空间域中难以察觉的细微伪造痕迹。小波变换能有效地捕捉到重要的面部轮廓特征,这些特征往往是纤细、精细和具有全局性的。然而,现有的基于小波的方法未能充分利用这些独特特性,导致了次优的特征提取效果和有限的应用广度。 为了解决这一挑战,我们引入了一种新型的小波基特征提取器——WMamba,它是基于Mamba架构设计的。WMamba通过两个关键创新最大化了小波信息的作用。首先,我们提出了动态轮廓卷积(DCConv),采用特制的可变形核来适应性地建模纤细的面部轮廓;其次,利用Mamba架构的优势,我们的方法能够在线性计算复杂度下捕捉长距离的空间关系。这种效率使得可以从小型图像块中提取出精细、全局性的伪造痕迹。 广泛的实验结果表明,WMamba达到了最先进的性能(SOTA),突显了其在人脸伪造检测中的有效性和优越性。
https://arxiv.org/abs/2501.09617
The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87\%) and Large Language Models with Quantized Low-Rank Approximation (F1-89\%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on Github to foster research in this direction.
假新闻的迅速传播带来了全球性的挑战,特别是在像孟加拉语这样的资源匮乏语言中,这些语言缺乏足够的数据集和检测工具。尽管人工事实核查准确度高,但成本高昂且耗时长,无法有效阻止假新闻的扩散。为了解决这一缺口,我们推出了BanFakeNews-2.0,这是一个旨在增强孟加拉语假新闻检测能力的强大数据集。这个版本包含11,700篇额外、精心策划并从可信来源验证过的假新闻文章,形成了一个由47,000条真实和13,000条虚假新闻组成的数据集,涵盖13个类别。此外,我们还创建了一个独立的手工策划测试数据集,其中包含460篇假新闻和540篇真实新闻,用于严格的评估。我们在收集来自可信来源的假新闻时投入了大量努力,并进行了人工验证以保留语言的丰富性。 我们开发了一个基准系统,采用基于变压器的架构,包括微调后的双向编码器表示(F1-87%)和量化低秩近似的大规模语言模型(F1-89%),这些方法显著优于传统的方法。BanFakeNews-2.0为资源匮乏语言中的假新闻检测研究和应用提供了宝贵的资源。我们将在Github上公开发布我们的数据集和模型,以促进在这个方向上的研究。
https://arxiv.org/abs/2501.09604
The appearance of surface impurities (e.g., water stains, fingerprints, stickers) is an often-mentioned issue that causes degradation of automated visual inspection systems. At the same time, synthetic data generation techniques for visual surface inspection have focused primarily on generating perfect examples and defects, disregarding impurities. This study highlights the importance of considering impurities when generating synthetic data. We introduce a procedural method to include photorealistic water stains in synthetic data. The synthetic datasets are generated to correspond to real datasets and are further used to train an anomaly detection model and investigate the influence of water stains. The high-resolution images used for surface inspection lead to memory bottlenecks during anomaly detection training. To address this, we introduce Sequential PatchCore - a method to build coresets sequentially and make training on large images using consumer-grade hardware tractable. This allows us to perform transfer learning using coresets pre-trained on different dataset versions. Our results show the benefits of using synthetic data for pre-training an explicit coreset anomaly model and the extended performance benefits of finetuning the coreset using real data. We observed how the impurities and labelling ambiguity lower the model performance and have additionally reported the defect-wise recall to provide an industrially relevant perspective on model performance.
表面杂质(如水渍、指纹和贴纸)的出现是导致自动化视觉检测系统性能下降的一个经常被提及的问题。同时,用于视觉表面检查的合成数据生成技术主要集中在生成完美示例及缺陷上,而忽略了杂质的影响。本研究强调了在生成合成数据时考虑杂质的重要性,并提出了一种程序化方法来将逼真的水渍纳入合成数据中。我们生成的合成数据集与真实数据集相对应,并进一步用于训练异常检测模型以调查水渍的影响。 高分辨率图像在表面检查中的使用会导致在进行异常检测训练时出现内存瓶颈问题。为了解决这个问题,我们引入了Sequential PatchCore方法——一种顺序构建核心样本集(coresets)的方法,使在大型图像上使用消费级硬件进行训练成为可能。这使得我们可以利用预先在不同数据版本上经过训练的核心集来进行迁移学习。 我们的结果显示,在预训练显式核心异常模型时使用合成数据是有益的,并且通过真实数据对核心集进行微调可以进一步提升性能表现。我们观察到杂质和标签模糊度会降低模型性能,此外还报告了缺陷级别的召回率,以提供一个与工业相关的视角来衡量模型性能。
https://arxiv.org/abs/2501.09579
De-identification of medical images is a critical step to ensure privacy during data sharing in research and clinical settings. The initial step in this process involves detecting Protected Health Information (PHI), which can be found in image metadata or imprinted within image pixels. Despite the importance of such systems, there has been limited evaluation of existing AI-based solutions, creating barriers to the development of reliable and robust tools. In this study, we present an AI-based pipeline for PHI detection, comprising three key components: text detection, text extraction, and analysis of PHI content in medical images. By experimenting with exchanging roles of vision and language models within the pipeline, we evaluate the performance and recommend the best setup for the PHI detection task.
医学图像去识别是确保研究和临床环境中数据共享期间隐私保护的关键步骤。此过程的初始阶段涉及检测受保护的健康信息(PHI),这些信息可能存在于图像元数据中或嵌印在图像像素内。尽管此类系统的至关重要性,现有的基于AI的解决方案却很少被评估,这阻碍了可靠且稳健工具的发展。在这项研究中,我们提出了一种用于检测PHI的基于人工智能的流程,包括三个关键组成部分:文本检测、文本提取以及医学图像中PHI内容的分析。通过在管道中交换视觉和语言模型的角色进行实验,我们评估了其性能,并推荐了最适合执行PHI检测任务的最佳配置。
https://arxiv.org/abs/2501.09552
Text-to-SQL enables users to interact with databases through natural language, simplifying the retrieval and synthesis of information. Despite the success of large language models (LLMs) in converting natural language questions into SQL queries, their broader adoption is limited by two main challenges: achieving robust generalization across diverse queries and ensuring interpretative confidence in their predictions. To tackle these issues, our research investigates the integration of selective classifiers into Text-to-SQL systems. We analyse the trade-off between coverage and risk using entropy based confidence estimation with selective classifiers and assess its impact on the overall performance of Text-to-SQL models. Additionally, we explore the models' initial calibration and improve it with calibration techniques for better model alignment between confidence and accuracy. Our experimental results show that encoder-decoder T5 is better calibrated than in-context-learning GPT 4 and decoder-only Llama 3, thus the designated external entropy-based selective classifier has better performance. The study also reveal that, in terms of error detection, selective classifier with a higher probability detects errors associated with irrelevant questions rather than incorrect query generations.
文本到SQL(Text-to-SQL)技术允许用户通过自然语言与数据库进行交互,从而简化信息的检索和合成。尽管大型语言模型(LLMs)在将自然语言问题转换为SQL查询方面取得了成功,但它们更广泛的应用受到两大挑战的限制:实现跨多种查询类型的稳健泛化以及确保其预测结果具有解释性的信心。 为了应对这些问题,我们的研究探讨了将选择性分类器集成到Text-to-SQL系统中的方法。我们使用基于熵的信任度估计来分析覆盖率和风险之间的权衡,并评估这种方法对整体性能的影响。此外,我们还探索了模型的初始校准情况,并通过校准技术对其进行改进,以更好地使信任度与准确性之间达成一致。 实验结果显示,编码器-解码器T5模型比上下文学习GPT 4以及仅解码器Llama 3具有更好的校准效果。因此,指定的基于外部熵的选择性分类器表现出更佳性能。研究还发现,在错误检测方面,概率更高的选择性分类器能够更准确地识别与无关问题相关的错误,而非因查询生成不正确导致的错误。 总之,我们的研究表明了将选择性分类器应用于Text-to-SQL系统的潜力,并展示了如何通过改进模型校准来提高其可靠性和准确性。
https://arxiv.org/abs/2501.09527
This study presents a comprehensive review of the potential of multimodal deep learning (DL) in medical diagnosis, using COVID-19 as a case example. Motivated by the success of artificial intelligence applications during the COVID-19 pandemic, this research aims to uncover the capabilities of DL in disease screening, prediction, and classification, and to derive insights that enhance the resilience, sustainability, and inclusiveness of science, technology, and innovation systems. Adopting a systematic approach, we investigate the fundamental methodologies, data sources, preprocessing steps, and challenges encountered in various studies and implementations. We explore the architecture of deep learning models, emphasising their data-specific structures and underlying algorithms. Subsequently, we compare different deep learning strategies utilised in COVID-19 analysis, evaluating them based on methodology, data, performance, and prerequisites for future research. By examining diverse data types and diagnostic modalities, this research contributes to scientific understanding and knowledge of the multimodal application of DL and its effectiveness in diagnosis. We have implemented and analysed 11 deep learning models using COVID-19 image, text, and speech (ie, cough) data. Our analysis revealed that the MobileNet model achieved the highest accuracy of 99.97% for COVID-19 image data and 93.73% for speech data (i.e., cough). However, the BiGRU model demonstrated superior performance in COVID-19 text classification with an accuracy of 99.89%. The broader implications of this research suggest potential benefits for other domains and disciplines that could leverage deep learning techniques for image, text, and speech analysis.
这项研究提供了多模态深度学习(DL)在医学诊断中潜在应用的全面回顾,以COVID-19为例。鉴于人工智能技术在新冠疫情期间的成功应用,本研究旨在揭示深度学习在疾病筛查、预测和分类方面的潜力,并从中获得有助于增强科学、技术和创新体系韧性、可持续性和包容性的见解。采用系统方法,我们探讨了各种研究与实施中遇到的基本方法论、数据来源、预处理步骤以及所面临的挑战。我们还探索了深度学习模型的架构,强调其特定于数据的结构及其基础算法。接下来,我们将比较在COVID-19分析中使用的不同深度学习策略,并根据方法学、数据、性能和未来研究的需求对其进行评估。 通过考察不同类型的数据及诊断模式,本研究为多模态应用下的DL科学理解和知识贡献了力量,并探讨其在诊断中的有效性。我们实施并分析了11种基于COVID-19图像、文本以及语音(即咳嗽)数据的深度学习模型。我们的分析表明,MobileNet模型对COVID-19图像数据实现了最高精度为99.97%,而针对语音数据(如咳嗽)的准确率达到了93.73%。然而,在COVID-19文本分类中,BiGRU模型表现出色,其准确性达到99.89%。 这项研究更广泛的含义在于,它可能对其他领域和学科产生潜在益处,这些领域和学科可以利用深度学习技术进行图像、文本以及语音分析。
https://arxiv.org/abs/2501.09506
LiDAR is a crucial sensor in autonomous driving, commonly used alongside cameras. By exploiting this camera-LiDAR setup and recent advances in image representation learning, prior studies have shown the promising potential of image-to-LiDAR distillation. These prior arts focus on the designs of their own losses to effectively distill the pre-trained 2D image representations into a 3D model. However, the other parts of the designs have been surprisingly unexplored. We find that fundamental design elements, e.g., the LiDAR coordinate system, quantization according to the existing input interface, and data utilization, are more critical than developing loss functions, which have been overlooked in prior works. In this work, we show that simple fixes to these designs notably outperform existing methods by 16% in 3D semantic segmentation on the nuScenes dataset and 13% in 3D object detection on the KITTI dataset in downstream task performance. We focus on overlooked design choices along the spatial and temporal axes. Spatially, prior work has used cylindrical coordinate and voxel sizes without considering their side effects yielded with a commonly deployed sparse convolution layer input interface, leading to spatial quantization errors in 3D models. Temporally, existing work has avoided cumbersome data curation by discarding unsynced data, limiting the use to only the small portion of data that is temporally synced across sensors. We analyze these effects and propose simple solutions for each overlooked aspect.
在自动驾驶领域,LiDAR(光探测和测距)传感器与摄像头共同使用是至关重要的。通过利用这种相机-LiDAR配置及近期图像表示学习的进展,先前的研究展示了从二维图像表示向三维模型提炼信息的巨大潜力,即所谓的“图像到LiDAR的知识蒸馏”。这些早期研究主要集中在设计自己的损失函数以有效提炼预训练的2D图像表示方面。然而,其他设计方案却鲜被探索。 我们发现,一些基本的设计要素——比如LiDAR坐标系统、依据现有输入接口进行量化的方法以及数据利用方式——比开发损失函数更为关键,而这些却被之前的工作所忽视了。在这项工作中,我们展示了对这些设计的简单改进可以显著超越现有的方法,在nuScenes数据集上的3D语义分割任务中性能提高了16%,在KITTI数据集上的3D物体检测任务中性能提高了13%。 我们的工作主要集中在被忽略的空间和时间轴的设计选择上。在空间维度上,早期的工作采用了柱状坐标系和体素尺寸大小而没有考虑它们与常用稀疏卷积层输入接口相结合时的副作用,这导致了三维模型中的空间量化误差。在时间维度上,为了避开繁琐的数据整理工作,现有的方法丢弃了不同步的数据,这限制了只有很小部分同步数据能够被利用。 我们分析了这些影响并针对每个被忽视的部分提出了简单的解决方案。
https://arxiv.org/abs/2501.09485
Detecting the three-dimensional position and orientation of objects using a single RGB camera is a foundational task in computer vision with many important applications. Traditionally, 3D object detection methods are trained in a fully-supervised setup, requiring vast amounts of human annotations, which are laborious, costly, and do not scale well with the ever-increasing amounts of data being captured. In this paper, we present the first method to train 3D object detectors for monocular RGB cameras without domain-specific human annotations, thus making orders of magnitude more data available for training. Thanks to newly proposed Canonical Object Space, the method can not only exploit data across a variety of datasets and camera setups to train a single 3D detector, but unlike previous work it also works out of the box in previously unseen camera setups. All this is crucial for practical applications, where the data and cameras are extremely heterogeneous. The method is evaluated on two standard autonomous driving datasets, where it outperforms previous works, which, unlike our method, still rely on 2D human annotations.
使用单个RGB相机检测物体的三维位置和姿态是计算机视觉中的一个基础任务,具有许多重要的应用。传统上,3D对象检测方法在完全监督的学习环境中进行训练,需要大量的手工标注数据,这种标注过程耗时、昂贵,并且随着捕获的数据量不断增加而难以扩展。在这篇论文中,我们提出了首个无需特定领域的人工注释即可为单目RGB相机训练3D物体检测器的方法,这使得可以利用数量级更多的数据进行训练。 得益于新提出的规范对象空间(Canonical Object Space),该方法不仅可以利用来自各种数据集和相机设置的大量数据来训练一个单一的3D检测器,而且与先前的工作不同的是,它在之前未见过的相机设置中也能直接应用。所有这些对于实际应用场景来说至关重要,在这些场景中,数据和相机的异质性极高。 该方法已在两个标准的自动驾驶数据集上进行了评估,并且其性能超过了仍依赖于2D人工注释的先前工作。
https://arxiv.org/abs/2501.09481
Object detection plays a crucial role in smart video analysis, with applications ranging from autonomous driving and security to smart cities. However, achieving real-time object detection on edge devices presents significant challenges due to their limited computational resources and the high demands of deep neural network (DNN)-based detection models, particularly when processing high-resolution video. Conventional strategies, such as input down-sampling and network up-scaling, often compromise detection accuracy for faster performance or lead to higher inference latency. To address these issues, this paper introduces RE-POSE, a Reinforcement Learning (RL)-Driven Partitioning and Edge Offloading framework designed to optimize the accuracy-latency trade-off in resource-constrained edge environments. Our approach features an RL-Based Dynamic Clustering Algorithm (RL-DCA) that partitions video frames into non-uniform blocks based on object distribution and the computational characteristics of DNNs. Furthermore, a parallel edge offloading scheme is implemented to distribute these blocks across multiple edge servers for concurrent processing. Experimental evaluations show that RE-POSE significantly enhances detection accuracy and reduces inference latency, surpassing existing methods.
物体检测在智能视频分析中扮演着至关重要的角色,应用范围从自动驾驶和安全到智慧城市。然而,在计算资源有限的边缘设备上实现实时物体检测面临着巨大挑战,尤其是当处理高分辨率视频时,深度神经网络(DNN)模型的需求变得更高。传统的策略,如输入降采样和网络扩展,通常会牺牲检测精度以换取更快的速度或导致推理延迟增加。为了解决这些问题,本文介绍了RE-POSE框架,这是一种基于强化学习(RL)的分区与边缘卸载方法,旨在优化资源受限的边缘环境中准确性与延迟之间的权衡。我们的方法包括一个基于RL的动力聚类算法(RL-DCA),该算法根据物体分布和DNN计算特性将视频帧分割成非均匀块。此外,还实施了一种并行边缘卸载方案,用于将这些块分布在多个边缘服务器上进行并发处理。实验评估表明,RE-POSE显著提高了检测精度,并减少了推理延迟,超越了现有方法。
https://arxiv.org/abs/2501.09465
We explore the impact of aperture size and shape on automotive camera systems for deep-learning-based tasks like traffic sign recognition and light state detection. A method is proposed to simulate optical effects using the point spread function (PSF), enhancing realism and reducing the domain gap between synthetic and real-world images. Computer-generated scenes are refined with this technique to model optical distortions and improve simulation accuracy.
我们探讨了光圈大小和形状对基于深度学习的汽车相机系统任务(如交通标志识别和灯光状态检测)的影响。提出了一种使用点扩散函数(PSF)模拟光学效果的方法,以增强现实感并减少合成图像与真实世界图像之间的领域差距。通过这项技术,计算机生成的场景得到了改进,用于建模光学畸变并提高仿真精度。
https://arxiv.org/abs/2501.09456
Video synthetic aperture radar (ViSAR) has attracted substantial attention in the moving target detection (MTD) field due to its ability to continuously monitor changes in the target area. In ViSAR, the moving targets' shadows will not offset and defocus, which is widely used as a feature for MTD. However, the shadows are difficult to distinguish from the low scattering region in the background, which will cause more missing and false alarms. Therefore, it is worth investigating how to enhance the distinction between the shadows and background. In this study, we proposed the Shadow Enhancement and Background Suppression for ViSAR (SE-BSFV) algorithm. The SE-BSFV algorithm is based on the low-rank representation (LRR) theory and adopts online subspace learning technique to enhance shadows and suppress background for ViSAR images. Firstly, we use a registration algorithm to register the ViSAR images and utilize Gaussian mixture distribution (GMD) to model the ViSAR data. Secondly, the knowledge learned from the previous frames is leveraged to estimate the GMD parameters of the current frame, and the Expectation-maximization (EM) algorithm is used to estimate the subspace parameters. Then, the foreground matrix of the current frame can be obtained. Finally, the alternating direction method of multipliers (ADMM) is used to eliminate strong scattering objects in the foreground matrix to obtain the final results. The experimental results indicate that the SE-BSFV algorithm significantly enhances the shadows' saliency and greatly improves the detection performance while ensuring efficiency compared with several other advanced pre-processing algorithms.
视频合成孔径雷达(ViSAR)在移动目标检测(MTD)领域引起了广泛关注,因为它能够持续监测目标区域的变化。在ViSAR中,移动目标的阴影不会偏移和模糊,这被广泛用作MTD的一个特征。然而,由于背景中的低散射区难以与阴影区分,会导致更多的漏报和误报。因此,研究如何增强阴影与背景之间的区别变得非常重要。 为此,在本研究中我们提出了用于ViSAR的阴影增强及背景抑制算法(SE-BSFV)。该算法基于低秩表示(LRR)理论,并采用在线子空间学习技术来增强ViSAR图像中的阴影并压制背景。首先,我们使用一个配准算法对ViSAR图像进行配准,并利用高斯混合分布(GMD)模型化ViSAR数据。其次,从先前帧中获得的知识被用来估计当前帧的GMD参数,同时采用期望最大算法来估算子空间参数。然后可以获取当前帧的前景矩阵。最后,使用交替方向乘子法(ADMM)消除前景矩阵中的强散射物体以得到最终结果。 实验结果显示,与几种其他先进的预处理算法相比,SE-BSFV算法显著增强了阴影的重要性,并在确保效率的同时大幅提高了检测性能。
https://arxiv.org/abs/2501.09341
Large language models (LLMs) have been adopted to perform text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into structured query language (SQL). However, such a technique faces correctness problems and requires efficient repairing solutions. In this paper, we conduct the first comprehensive study of text-to-SQL errors. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 29 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement at the cost of high computational overhead with many mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleRepair outperforms existing solutions by repairing 13.8% more queries with neglectable mis-repairs and 67.4% less overhead.
大型语言模型(LLMs)已被用于执行文本到SQL任务,利用它们的上下文学习(ICL)能力将自然语言问题转换为结构化查询语言(SQL)。然而,这种技术面临着正确性的问题,并且需要高效的修复解决方案。在这篇论文中,我们进行了首次关于文本到SQL错误的全面研究。我们的研究涵盖了四种代表性的基于ICL的技术、五种基本的修复方法、两个基准以及两种LLM设置。我们发现文本到SQL错误普遍存在,并总结了7类共29种错误类型。同时我们也发现现有的修复尝试在付出高昂计算代价的情况下,对正确性改进的效果有限且存在许多误修的情况。基于这些发现,我们提出了MapleRepair,这是一种新颖的用于检测和修复文本到SQL错误的框架。评估结果表明,MapleRepair优于现有解决方案,能够修复更多(13.8%)的查询,并减少67.4%的计算开销,同时误修率几乎可以忽略不计。
https://arxiv.org/abs/2501.09310
Retrieval-Augmented Generation equips large language models with the capability to retrieve external knowledge, thereby mitigating hallucinations by incorporating information beyond the model's intrinsic abilities. However, most prior works have focused on invoking retrieval deterministically, which makes it unsuitable for tasks such as long-form question answering. Instead, dynamically performing retrieval by invoking it only when the underlying LLM lacks the required knowledge can be more efficient. In this context, we delve deeper into the question, "To Retrieve or Not to Retrieve?" by exploring multiple uncertainty detection methods. We evaluate these methods for the task of long-form question answering, employing dynamic retrieval, and present our comparisons. Our findings suggest that uncertainty detection metrics, such as Degree Matrix Jaccard and Eccentricity, can reduce the number of retrieval calls by almost half, with only a slight reduction in question-answering accuracy.
检索增强生成使大型语言模型具备了获取外部知识的能力,从而通过引入超出模型固有能力的信息来减轻幻觉问题。然而,大多数先前的工作主要集中在确定性地调用检索上,这使得它们不适合诸如长篇问答任务等场景。相反,在底层LLM(大语言模型)缺乏所需知识时动态执行检索可以更高效。在这一背景下,我们深入探讨了“是否要进行检索?”这个问题,并通过探索多种不确定性检测方法来解答它。我们在长期问题回答的任务中使用动态检索评估这些方法,并展示了我们的比较结果。研究发现表明,诸如度矩阵杰卡德(Degree Matrix Jaccard)和离心率(Eccentricity)之类的不确定性检测指标可以将检索调用次数减少近一半,同时仅略微降低问答准确性。
https://arxiv.org/abs/2501.09292
In soccer video analysis, player detection is essential for identifying key events and reconstructing tactical positions. The presence of numerous players and frequent occlusions, combined with copyright restrictions, severely restricts the availability of datasets, leaving limited options such as SoccerNet-Tracking and SportsMOT. These datasets suffer from a lack of diversity, which hinders algorithms from adapting effectively to varied soccer video contexts. To address these challenges, we developed SoccerSynth-Detection, the first synthetic dataset designed for the detection of synthetic soccer players. It includes a broad range of random lighting and textures, as well as simulated camera motion blur. We validated its efficacy using the object detection model (Yolov8n) against real-world datasets (SoccerNet-Tracking and SportsMoT). In transfer tests, it matched the performance of real datasets and significantly outperformed them in images with motion blur; in pre-training tests, it demonstrated its efficacy as a pre-training dataset, significantly enhancing the algorithm's overall performance. Our work demonstrates the potential of synthetic datasets to replace real datasets for algorithm training in the field of soccer video analysis.
在足球视频分析中,球员检测对于识别关键事件和重建战术位置至关重要。由于存在大量球员及频繁遮挡,并且受版权限制的影响,可用数据集的数量非常有限,仅有如SoccerNet-Tracking和SportsMOT等少数选项可供选择。这些数据集因多样性不足而难以使算法有效适应各种足球视频环境。为了解决这些问题,我们开发了SoccerSynth-Detection,这是首个专用于合成足球运动员检测的合成数据集。该数据集包含了广泛的随机光照及纹理变化,并且模拟了相机运动模糊的效果。 为了验证其有效性,我们将基于对象检测模型(YOLOv8n)进行测试,对比真实世界数据集(SoccerNet-Tracking和SportsMoT)。在迁移学习测试中,它达到了与真实数据集相当的性能,在带有运动模糊效果的图像上更是显著超越了真实数据集;而在预训练测试中,它展示了作为预训练数据集的有效性,并大幅提升了算法的整体性能。我们的工作证明了合成数据集具备替代真实数据集用于足球视频分析领域中的算法训练的巨大潜力。
https://arxiv.org/abs/2501.09281