Large language models (LLMs) often produce errors, including factual inaccuracies, biases, and reasoning failures, collectively referred to as "hallucinations". Recent studies have demonstrated that LLMs' internal states encode information regarding the truthfulness of their outputs, and that this information can be utilized to detect errors. In this work, we show that the internal representations of LLMs encode much more information about truthfulness than previously recognized. We first discover that the truthfulness information is concentrated in specific tokens, and leveraging this property significantly enhances error detection performance. Yet, we show that such error detectors fail to generalize across datasets, implying that -- contrary to prior claims -- truthfulness encoding is not universal but rather multifaceted. Next, we show that internal representations can also be used for predicting the types of errors the model is likely to make, facilitating the development of tailored mitigation strategies. Lastly, we reveal a discrepancy between LLMs' internal encoding and external behavior: they may encode the correct answer, yet consistently generate an incorrect one. Taken together, these insights deepen our understanding of LLM errors from the model's internal perspective, which can guide future research on enhancing error analysis and mitigation.
大语言模型(LLMs)通常会产生错误,包括事实性不准确、偏见和推理失败等,这些共同称为“幻觉”。 近年来,研究表明,LLMs的内部状态编码了其输出真实性相关的信息,并且这种信息可以用于检测错误。在本文中,我们证明了LLMs的内部表示比以前想象的更能编码真实性信息。我们首先发现,真实性信息集中在特定的标记上,并利用这一特性显著增强了错误检测性能。然而,我们发现,这样的错误检测器无法在数据集之间泛化,暗示着——与先前的说法相反——真实性编码不是普遍的,而是多面的。接下来,我们展示了内部表示还可以用于预测模型可能出现的错误类型,促进开发定制化缓解策略。最后,我们揭示了LLMs的内部编码和外部行为之间的差异:它们可能编码正确的答案,但总是生成错误的答案。这些见解从模型内部的角度进一步加深了我们对于LLM错误的了解,这对于未来研究增强错误分析和缓解方法具有指导意义。
https://arxiv.org/abs/2410.02707
Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions.
在现实环境准确3D物体检测需要大量带有高质量标注的数据。获取这样的数据费时且昂贵,而且在采用新传感器或将检测器部署到新环境时,通常需要重复尝试。我们研究了一个新场景来构建3D物体检测器:从配备准确检测器的相邻单元的预测中学习。例如,当自动驾驶汽车进入新区域时,它可以从对该区域进行优化以适应该区域的交通参与者那里学习。这个设置具有标签效率、传感器无关性和通信效率:附近的单元只需要将自己的预测与自车共享。然而,将接收到的预测作为自车训练检测器的地面真值,导致自车性能劣化。我们系统地研究了这个问题,并确定视角不匹配和定位误差(由同步和GPS误差导致的)是主要原因,这导致假阳性、假阴性和不准确的伪标签。我们提出了一个基于距离的教程,首先从具有相似观点的更近的单元学习,然后通过自训练提高其他单位的预测质量。我们还进一步证明了,只需要少量的标注数据,就可以通过自训练训练出有效的伪标签修复模块,从而大大减少训练一个物体检测器所需的數據量。我们在最近发布的合作驾驶数据集上验证了我们的方法,使用自车的预测作为自车的伪标签。包括多个场景(例如不同的传感器、检测器和领域)的丰富实验表明,我们的方法在从其他单位的预测中实现标签有效的3D感知方面是有效的。
https://arxiv.org/abs/2410.02646
This paper has been accepted in the NeurIPS 2024 D & B Track. Harmful memes have proliferated on the Chinese Internet, while research on detecting Chinese harmful memes significantly lags behind due to the absence of reliable datasets and effective detectors. To this end, we focus on the comprehensive detection of Chinese harmful memes. We construct ToxiCN MM, the first Chinese harmful meme dataset, which consists of 12,000 samples with fine-grained annotations for various meme types. Additionally, we propose a baseline detector, Multimodal Knowledge Enhancement (MKE), incorporating contextual information of meme content generated by the LLM to enhance the understanding of Chinese memes. During the evaluation phase, we conduct extensive quantitative experiments and qualitative analyses on multiple baselines, including LLMs and our MKE. The experimental results indicate that detecting Chinese harmful memes is challenging for existing models while demonstrating the effectiveness of MKE. The resources for this paper are available at this https URL.
这篇论文已成功通过2024年的NeurIPS D&B track的审核。 汉语互联网上有害的流行 meme 大量繁殖,然而,由于缺乏可靠的数据集和有效的检测器,检测中国有害 meme 的研究明显滞后。为此,我们关注全面检测中国有害 meme。我们构建了ToxiCN MM,第一个中国有害 meme 数据集,它包括针对各种 meme 类型的 12,000 个样本,具有精细的注释。此外,我们提出了一个 baseline detector: Multimodal Knowledge Enhancement (MKE),该探测器包含 LLM 生成的 meme 内容的上下文信息,以增强对 Chinese meme 的理解。在评估阶段,我们对多个基准进行广泛的定量实验和定性分析,包括LLM 和我们的 MKE。实验结果表明,对于现有的模型来说,检测 Chinese有害 meme 具有挑战性,但证明了 MKE 的有效性。本文的资源可在此链接访问:https://www.academia.edu/38411041/ToxiCN_MM
https://arxiv.org/abs/2410.02378
Detecting 3D keypoints with semantic consistency is widely used in many scenarios such as pose estimation, shape registration and robotics. Currently, most unsupervised 3D keypoint detection methods focus on the rigid-body objects. However, when faced with deformable objects, the keypoints they identify do not preserve semantic consistency well. In this paper, we introduce an innovative unsupervised keypoint detector Key-Grid for both the rigid-body and deformable objects, which is an autoencoder framework. The encoder predicts keypoints and the decoder utilizes the generated keypoints to reconstruct the objects. Unlike previous work, we leverage the identified keypoint in formation to form a 3D grid feature heatmap called grid heatmap, which is used in the decoder section. Grid heatmap is a novel concept that represents the latent variables for grid points sampled uniformly in the 3D cubic space, where these variables are the shortest distance between the grid points and the skeleton connected by keypoint pairs. Meanwhile, we incorporate the information from each layer of the encoder into the decoder section. We conduct an extensive evaluation of Key-Grid on a list of benchmark datasets. Key-Grid achieves the state-of-the-art performance on the semantic consistency and position accuracy of keypoints. Moreover, we demonstrate the robustness of Key-Grid to noise and downsampling. In addition, we achieve SE-(3) invariance of keypoints though generalizing Key-Grid to a SE(3)-invariant backbone.
检测3D关键点与语义一致性是许多应用场景(如姿态估计、形状配准和机器人技术)中广泛使用的。目前,大多数无监督3D关键点检测方法都关注于刚体物体。然而,面对变形物体,它们确定的关键点在语义上并不保持一致。在本文中,我们提出了一种创新的无监督关键点检测器Key-Grid,适用于刚体和变形物体,是一种自动编码器框架。编码器预测关键点,解码器利用生成的关键点重构物体。与之前的工作不同,我们利用已识别的关键点形成一个3D立方空间中采样均匀的网格点特征热图,即网格热图,用于解码器部分。网格热图是一种新颖的概念,它表示在3D立方空间中,网格点与通过关键点对齐的骨架之间的最短距离。同时,我们将编码器每一层的有关信息融入解码器部分。我们在一系列基准数据集上对Key-Grid进行广泛评估。Key-Grid在关键点的语义一致性和位置精度上实现了最先进的性能。此外,我们还证明了Key-Grid对噪声和下采样具有鲁棒性。此外,通过将Key-Grid扩展到SE(3)-不变的骨干网络,我们实现了关键点的SE(3)不变性。
https://arxiv.org/abs/2410.02237
Being able to accurately monitor the screen exposure of young children is important for research on phenomena linked to screen use such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel sensor informatics framework that utilizes egocentric images from a wearable sensor, termed the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image sequences and interprets screen exposure dynamically. We validated our approach by using a dataset of children's free-living activities, demonstrating significant improvement over existing methods in plain vision language models and object detection models. Results supported the promise of this monitoring approach, which could optimize behavioral research on screen exposure in children's naturalistic settings.
准确监测年轻儿童的屏幕曝光对于与屏幕使用相关的研究(如儿童肥胖、体力活动和社会互动)非常重要。现有研究主要依赖于笨重的可穿戴传感器自报告或手动测量,因此缺乏效率和准确性,无法捕捉定量屏幕曝光数据。在这项工作中,我们开发了一种名为屏幕时间跟踪器(STT)的新传感器信息框架和视觉语言模型(VLM)。特别地,我们设计了一个多视角VLM,它从自旋式图像序列中获取多个视角,并动态地解释屏幕曝光。我们通过使用儿童自由活动数据的子集来验证我们的方法,证明在普通视觉语言模型和目标检测模型中,与现有方法相比具有显著的改善。结果证实了这种监测方法的优势,即在儿童的自然环境中优化屏幕曝光的行为研究。
https://arxiv.org/abs/2410.01966
Neural Radiance Fields (NeRF) are widely used for novel-view synthesis and have been adapted for 3D Object Detection (3DOD), offering a promising approach to 3DOD through view-synthesis representation. However, NeRF faces inherent limitations: (i) limited representational capacity for 3DOD due to its implicit nature, and (ii) slow rendering speeds. Recently, 3D Gaussian Splatting (3DGS) has emerged as an explicit 3D representation that addresses these limitations. Inspired by these advantages, this paper introduces 3DGS into 3DOD for the first time, identifying two main challenges: (i) Ambiguous spatial distribution of Gaussian blobs: 3DGS primarily relies on 2D pixel-level supervision, resulting in unclear 3D spatial distribution of Gaussian blobs and poor differentiation between objects and background, which hinders 3DOD; (ii) Excessive background blobs: 2D images often include numerous background pixels, leading to densely reconstructed 3DGS with many noisy Gaussian blobs representing the background, negatively affecting detection. To tackle the challenge (i), we leverage the fact that 3DGS reconstruction is derived from 2D images, and propose an elegant and efficient solution by incorporating 2D Boundary Guidance to significantly enhance the spatial distribution of Gaussian blobs, resulting in clearer differentiation between objects and their background. To address the challenge (ii), we propose a Box-Focused Sampling strategy using 2D boxes to generate object probability distribution in 3D spaces, allowing effective probabilistic sampling in 3D to retain more object blobs and reduce noisy background blobs. Benefiting from our designs, our 3DGS-DET significantly outperforms the SOTA NeRF-based method, NeRF-Det, achieving improvements of +6.6 on mAP@0.25 and +8.1 on mAP@0.5 for the ScanNet dataset, and impressive +31.5 on mAP@0.25 for the ARKITScenes dataset.
神经辐射场(NeRF)广泛用于新颖视角合成,并用于3D物体检测(3DOD),通过视图合成表示为3DOD提供了有前途的方法。然而,NeRF面临固有局限性:(i)由于其隐含性,对3D物体检测的表示能力有限;(ii)渲染速度较慢。最近,3D高斯平铺(3DGS)作为一种明确的3D表示方法,应对这些局限性。受到这些优势的启发,本文将3DGS引入3D物体检测中,并提出了两个主要挑战: (i)平滑高斯斑块的模糊空间分布:3DGS主要依赖于2D像素级别监督,导致不清楚的高斯斑块的3D空间分布和物体与背景之间的区分度较低,阻碍3D物体检测;(ii)过多的背景斑块:2D图像通常包括大量背景像素,导致带有大量噪声的高斯斑块重建,降低了检测效果。为了应对挑战(i),我们利用3DGS从2D图像中提取的事实,提出了一种优雅而有效的解决方案,通过将2D边界指导相结合,显著增强了高斯斑块的空间分布,使得物体与背景之间的区分度更加清晰。为了应对挑战(ii),我们提出了一种使用2D盒子进行目标概率分布的策略,生成物体概率分布在3D空间中,允许在3D中进行有效的概率采样,以保留更多的物体斑块并减少噪声背景斑块。得益于我们的设计,我们的3DGS-DET在ScanNet数据集上显著优于当前最先进的NeRF-based方法,在ArkitScenes数据集上实现了+6.6的mAP@0.25和+8.1的mAP@0.5的改善,并且令人印象深刻的+31.5mAP@0.25在ArkitScenes数据集上。
https://arxiv.org/abs/2410.01647
Text-to-image generation requires large amount of training data to synthesizing high-quality images. For augmenting training data, previous methods rely on data interpolations like cropping, flipping, and mixing up, which fail to introduce new information and yield only marginal improvements. In this paper, we propose a new data augmentation method for text-to-image generation using linear extrapolation. Specifically, we apply linear extrapolation only on text feature, and new image data are retrieved from the internet by search engines. For the reliability of new text-image pairs, we design two outlier detectors to purify retrieved images. Based on extrapolation, we construct training samples dozens of times larger than the original dataset, resulting in a significant improvement in text-to-image performance. Moreover, we propose a NULL-guidance to refine score estimation, and apply recurrent affine transformation to fuse text information. Our model achieves FID scores of 7.91, 9.52 and 5.00 on the CUB, Oxford and COCO datasets. The code and data will be available on GitHub (this https URL).
文本转图像生成需要大量的训练数据来生成高质量的图像。为了增加训练数据,之前的方法依赖于数据插值,如裁剪、翻转和混合等,但这些方法无法引入新信息,仅产生微小的改进。在本文中,我们提出了一种基于线性扩展的文本转图像生成的新数据增强方法。具体来说,我们仅对文本特征进行线性扩展,然后通过搜索引擎从互联网上检索新的图像数据。为了确保新文本图像对齐的可靠性,我们设计了两个异常检测器来净化检索到的图像。基于扩展,我们构建了比原始数据集更大的训练样本,从而在文本转图像性能上取得了显著的改进。此外,我们还提出了一种空指导来优化分数估计,并应用循环平移变换来融合文本信息。我们的模型在CUB、牛津和COCO数据集上的FID分数分别为7.91、9.52和5.00。代码和数据将 available on GitHub (this <https://github.com>)。
https://arxiv.org/abs/2410.01638
While generative AI (GenAI) offers countless possibilities for creative and productive tasks, artificially generated media can be misused for fraud, manipulation, scams, misinformation campaigns, and more. To mitigate the risks associated with maliciously generated media, forensic classifiers are employed to identify AI-generated content. However, current forensic classifiers are often not evaluated in practically relevant scenarios, such as the presence of an attacker or when real-world artifacts like social media degradations affect images. In this paper, we evaluate state-of-the-art AI-generated image (AIGI) detectors under different attack scenarios. We demonstrate that forensic classifiers can be effectively attacked in realistic settings, even when the attacker does not have access to the target model and post-processing occurs after the adversarial examples are created, which is standard on social media platforms. These attacks can significantly reduce detection accuracy to the extent that the risks of relying on detectors outweigh their benefits. Finally, we propose a simple defense mechanism to make CLIP-based detectors, which are currently the best-performing detectors, robust against these attacks.
虽然生成型人工智能(GenAI)为创意和生产任务提供了无尽的可能,但人为生成的媒体可能被用于欺诈、操纵、诈骗、虚假信息活动等目的。为了减轻恶意生成媒体所带来的风险,法医分类器被用于识别由AI生成的内容。然而,现有的法医分类器通常在实际相关场景中没有进行评估,例如攻击者存在或现实世界的社交媒体降级影响图像时。在本文中,我们评估了不同攻击场景下最先进的AI生成图像(AIGI)检测器。我们证明了即使在攻击者没有访问目标模型,并且攻击在创建对抗样本之后进行,法医分类器也可以在现实环境中被有效攻击。这些攻击可能导致检测准确度显著降低,以至于依赖检测器的风险超过了其益处。最后,我们提出了一种简单的防御机制,使基于CLIP的检测器对这种攻击具有免疫力。目前,CLIP是最优秀的检测器,对于这种攻击,CLIP的检测准确度至关重要。
https://arxiv.org/abs/2410.01574
Skins wrapping around our bodies, leathers covering over the sofa, sheet metal coating the car - it suggests that objects are enclosed by a series of continuous surfaces, which provides us with informative geometry prior for objectness deduction. In this paper, we propose Gaussian-Det which leverages Gaussian Splatting as surface representation for multi-view based 3D object detection. Unlike existing monocular or NeRF-based methods which depict the objects via discrete positional data, Gaussian-Det models the objects in a continuous manner by formulating the input Gaussians as feature descriptors on a mass of partial surfaces. Furthermore, to address the numerous outliers inherently introduced by Gaussian splatting, we accordingly devise a Closure Inferring Module (CIM) for the comprehensive surface-based objectness deduction. CIM firstly estimates the probabilistic feature residuals for partial surfaces given the underdetermined nature of Gaussian Splatting, which are then coalesced into a holistic representation on the overall surface closure of the object proposal. In this way, the surface information Gaussian-Det exploits serves as the prior on the quality and reliability of objectness and the information basis of proposal refinement. Experiments on both synthetic and real-world datasets demonstrate that Gaussian-Det outperforms various existing approaches, in terms of both average precision and recall.
皮肤包裹在我们的身体上,皮沙发覆盖,金属涂覆在汽车上——这表明物体周围有一系列连续的表面,为我们提供物体存在的信息几何依据。在本文中,我们提出Gaussian-Det,它利用高斯聚类作为多视角基于3D物体检测的表面表示。与现有的单目或NeRF为基础的方法不同,它们通过离散的位置数据描述物体,而Gaussian-Det通过将输入的高斯散列表示为大量部分表面的特征描述来建模物体。此外,为了应对由高斯聚类引起的大量异常,我们相应地设计了一个全面基于表面的物体完整性推断模块(CIM)。CIM首先估计了给定高斯聚类的部分表面的概率特征残余,然后将它们凝聚为物体建议的整体表面闭合表示。在这种情况下,表面信息Gaussian-Det所利用的成为物体存在质量和可靠性的信息基础以及提案细化的信息基础。在合成和真实世界数据集上进行的实验证明,Gaussian-Det在平均准确性和召回方面优于各种现有方法。
https://arxiv.org/abs/2410.01404
Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.
生成模型现在可以生成几乎无法分辨于训练数据的光学真实数据。这是一个在以前模型上发生的显著演变,以前模型可以生成训练数据的合理伪本,但通过人类评估可以视觉上区分于训练数据。最近关于自监督检测(OOD)的研究引起了人们对生成模型置信度是否为最优OOD检测器产生怀疑,因为涉及置信度误估计、生成过程的熵以及典型性等问题。我们认为,生成OOD检测器也可能失败,因为它们的模型关注像素而不是数据的语义内容,导致在近OOD情况下,像素可能相似,但信息内容可能有很大差异。我们猜想,通过自监督学习估计典型集将产生更好的OOD检测器。我们引入了一种新方法,该方法利用表示学习以及根据层次估计的信息总结统计量来解决上述所有问题。我们的方法在 其他无监督方法上表现优异,并在经过良好验证的具有挑战性的基准测试和新的合成数据检测任务上实现了最先进的性能。
https://arxiv.org/abs/2410.01322
LiDAR-based 3D object detectors have been largely utilized in various applications, including autonomous vehicles or mobile robots. However, LiDAR-based detectors often fail to adapt well to target domains with different sensor configurations (e.g., types of sensors, spatial resolution, or FOVs) and location shifts. Collecting and annotating datasets in a new setup is commonly required to reduce such gaps, but it is often expensive and time-consuming. Recent studies suggest that pre-trained backbones can be learned in a self-supervised manner with large-scale unlabeled LiDAR frames. However, despite their expressive representations, they remain challenging to generalize well without substantial amounts of data from the target domain. Thus, we propose a novel method, called Domain Adaptive Distill-Tuning (DADT), to adapt a pre-trained model with limited target data (approximately 100 LiDAR frames), retaining its representation power and preventing it from overfitting. Specifically, we use regularizers to align object-level and context-level representations between the pre-trained and finetuned models in a teacher-student architecture. Our experiments with driving benchmarks, i.e., Waymo Open dataset and KITTI, confirm that our method effectively finetunes a pre-trained model, achieving significant gains in accuracy.
LiDAR-based 3D物体检测器已经在各种应用中得到了广泛应用,包括自动驾驶车辆或移动机器人。然而,LiDAR-based 检测器通常很难适应具有不同传感器配置(例如传感器类型、空间分辨率或 FOV)和位置漂移的目标领域。因此,收集和标注新数据集以缩小这些差距通常需要耗费大量时间和金钱。最近的研究表明,通过在大规模无标签 LiDAR 帧中以自监督方式学习预训练骨架,可以实现预训练模型的迁移学习。然而,尽管它们具有表现力的表示,但在没有大量目标领域数据的情况下,它们仍然很难进行良好的泛化。因此,我们提出了一种名为领域自适应蒸馏调整(DADT)的新方法,以适应有限目标数据的预训练模型,同时保留其表现力和防止过拟合。具体来说,我们使用正则化器在预训练和微调模型之间的对象级别和上下文级别表示之间进行对齐。我们在 Waymo Open 数据集和 KITTI 数据集上的实验证实,我们的方法有效地对预训练模型进行了微调,实现了显著的准确性提升。
https://arxiv.org/abs/2410.01319
3D object detection with omnidirectional views enables safety-critical applications such as mobile robot navigation. Such applications increasingly operate on resource-constrained edge devices, facilitating reliable processing without privacy concerns or network delays. To enable cost-effective deployment, cameras have been widely adopted as a low-cost alternative to LiDAR sensors. However, the compute-intensive workload to achieve high performance of camera-based solutions remains challenging due to the computational limitations of edge devices. In this paper, we present Panopticus, a carefully designed system for omnidirectional and camera-based 3D detection on edge devices. Panopticus employs an adaptive multi-branch detection scheme that accounts for spatial complexities. To optimize the accuracy within latency limits, Panopticus dynamically adjusts the model's architecture and operations based on available edge resources and spatial characteristics. We implemented Panopticus on three edge devices and conducted experiments across real-world environments based on the public self-driving dataset and our mobile 360° camera dataset. Experiment results showed that Panopticus improves accuracy by 62% on average given the strict latency objective of 33ms. Also, Panopticus achieves a 2.1{\times} latency reduction on average compared to baselines.
使用3D物体检测的全方位视角能够实现诸如移动机器人导航等关键应用。这类应用越来越多地运行在资源受限的边缘设备上,通过隐私问题或网络延迟不考虑,实现可靠的处理。为了实现成本效益的部署,摄像头已被广泛采用作为低成本的LiDAR传感器替代品。然而,由于边缘设备的计算限制,实现基于摄像头的解决方案的高性能仍然具有挑战性。在本文中,我们介绍了Panopticus,一个专门为边缘设备设计的全方位和相机3D检测系统。Panopticus采用自适应多分支检测方案来考虑空间复杂性。为了优化延迟限制内的准确性,Panopticus根据可用边缘资源和服务器进行动态调整模型架构和操作。我们在三个边缘设备上实现了Panopticus,并基于公开的自驾驶数据集和我们的移动360°相机数据集进行了实验。实验结果表明,在严格的延迟目标下,Panopticus提高了平均准确率62%。此外,与基线相比,Panopticus实现了2.1倍的延迟降低。
https://arxiv.org/abs/2410.01270
Designing roadside sensing for intelligent transportation applications requires balancing cost and performance,especially when choosing between high and low-resolution sensors. The tradeoff is challenging due to sensor heterogeneity,where different sensors produce unique data modalities due to varying physical principles. High-resolution LiDAR offers detailed point cloud, while 4D millimeter-wave radar, despite providing sparser data, delivers velocity information useful for distinguishing objects based on movement patterns. To assess whether reductions in spatial resolution can be compensated by the informational richness of sensors, particularly in recognizing both vehicles and vulnerable road users (VRUs), we propose Residual Fusion Net (ResFusionNet) to fuse multimodal data for 3D object detection. This enables a quantifiable tradeoff between spatial resolution and information richness across different modalities. Furthermore, we introduce a sensor placement algorithm utilizing probabilistic modeling to manage uncertainties in sensor visibility influenced by environmental or human-related factors. Through simulation-assisted ex-ante evaluation on a real-world testbed, our findings show marked marginal gains in detecting VRUs--an average of 16.7% for pedestrians and 11% for cyclists--when merging velocity-encoded radar with LiDAR, compared to LiDAR only configurations. Additionally, experimental results from 300 runs reveal a maximum loss of 11.5% and a average of 5.25% in sensor coverage due to uncertainty factors. These findings underscore the potential of using low spatial resolution but information-rich sensors to enhance detection capabilities for vulnerable road users while highlighting the necessity of thoroughly evaluating sensor modality heterogeneity, traffic participant diversity, and operational uncertainties when making sensor tradeoffs in practical applications.
在为智能交通应用设计路边感知时,需要平衡成本和性能,尤其是在选择高分辨率传感器和低分辨率传感器时。由于传感器异质性,不同传感器根据不同的物理原理产生独特的数据模式,因此这种权衡很具有挑战性。高分辨率LiDAR提供详细的点云,而4D毫米波雷达,尽管提供较少的数据,但仍然可以提供用于区分物体运动模式的有用速度信息。为了评估是否可以通过传感器空间分辨率的降低来补偿传感器的信息丰富程度,尤其是在识别车辆和易受伤害的道路用户(VRUs)方面,我们提出了残差融合网络(ResFusionNet)来融合多模态数据进行3D物体检测。这使得在不同模态之间可以量化空间分辨率和信息丰富度之间的权衡。此外,我们还使用概率建模来管理由环境或人类因素影响传感器可见性的不确定性。通过在现实世界的测试台上使用仿真辅助前评估,我们的研究结果表明,在将速度编码雷达与LiDAR合并时,检测易受伤害的道路用户的平均比例从LiDAR only配置时的16.7%降低到300次运行中的11.5%,平均为5.25%。这些发现强调,在实践中使用低分辨率但信息丰富的传感器可以提高易受伤害的道路用户的检测能力,同时需要对传感器模态异质性、交通参与者多样性和操作不确定性进行彻底评估。
https://arxiv.org/abs/2410.01250
This study proposes a novel deep learning framework inspired by atmospheric scattering and human visual cortex mechanisms to enhance object detection under poor visibility scenarios such as fog, smoke, and haze. These conditions pose significant challenges for object recognition, impacting various sectors, including autonomous driving, aviation management, and security systems. The objective is to enhance the precision and reliability of detection systems under adverse environmental conditions. The research investigates the integration of human-like visual cues, particularly focusing on selective attention and environmental adaptability, to ascertain their impact on object detection's computational efficiency and accuracy. This paper proposes a multi-tiered strategy that integrates an initial quick detection process, followed by targeted region-specific dehazing, and concludes with an in-depth detection phase. The approach is validated using the Foggy Cityscapes, RESIDE-beta (OTS and RTTS) datasets and is anticipated to set new performance standards in detection accuracy while significantly optimizing computational efficiency. The findings offer a viable solution for enhancing object detection in poor visibility and contribute to the broader understanding of integrating human visual principles into deep learning algorithms for intricate visual recognition challenges.
本研究提出了一种新的人工智能框架,受到大气散射和人类视觉皮层机制的启发,以增强在低可见度场景(如雾、烟和迷雾)下的物体检测。这些条件对物体识别造成了重大挑战,影响了自动驾驶、航空管理和安全系统等各个领域。该研究的目的是提高在不利环境下检测系统的准确性和可靠性。研究探讨了将人类类似于的视觉线索(特别是 selective attention 和环境适应性)集成到物体检测中,以确定它们对物体检测计算效率和准确性的影响。本文提出了一种多层策略,包括初步快速检测过程、针对特定区域的大气消雾,以及深入检测阶段。该方法通过使用Foggy Cityscapes和RESIDE-beta(OTS和RTTS)数据集进行验证,预计将在检测准确性方面设定新的性能标准,同时显著优化计算效率。研究结果为增强在低可见度下的物体检测提供了一个可行的解决方案,并为将人类视觉原理融入深度学习算法以解决复杂视觉识别挑战提供了更广泛的了解。
https://arxiv.org/abs/2410.01225
Accurate and efficient characterization of nanoparticles (NPs), particularly regarding particle size distribution, is essential for advancing our understanding of their structure-property relationships and facilitating their design for various applications. In this study, we introduce a novel two-stage artificial intelligence (AI)-driven workflow for NP analysis that leverages prompt engineering techniques from state-of-the-art single-stage object detection and large-scale vision transformer (ViT) architectures. This methodology was applied to transmission electron microscopy (TEM) and scanning TEM (STEM) images of heterogeneous catalysts, enabling high-resolution, high-throughput analysis of particle size distributions for supported metal catalysts. The model's performance in detecting and segmenting NPs was validated across diverse heterogeneous catalyst systems, including various metals (Cu, Ru, Pt, and PtCo), supports (silica ($\text{SiO}_2$), $\gamma$-alumina ($\gamma$-$\text{Al}_2\text{O}_3$), and carbon black), and particle diameter size distributions with means and standard deviations of 2.9 $\pm$ 1.1 nm, 1.6 $\pm$ 0.2 nm, 9.7 $\pm$ 4.6 nm, and 4 $\pm$ 1.0 nm. Additionally, the proposed machine learning (ML) approach successfully detects and segments overlapping NPs anchored on non-uniform catalytic support materials, providing critical insights into their spatial arrangements and interactions. Our AI-assisted NP analysis workflow demonstrates robust generalization across diverse datasets and can be readily applied to similar NP segmentation tasks without requiring costly model retraining.
准确和高效地描述纳米颗粒(NPs)的尺寸分布,特别是关于颗粒尺寸分布,对于促进我们对它们结构-性能关系的研究和为各种应用设计纳米颗粒至关重要。在这项研究中,我们引入了一种新颖的人工智能(AI)驱动的NP分析两阶段工作流程,利用了最先进的单阶段物体检测和大规模视觉变换器(ViT)架构中的提示工程技术。该方法应用于透射电子显微镜(TEM)和扫描透射电镜(STEM)图像的异质催化剂上,实现了对支持金属催化剂颗粒尺寸分布的高分辨率、高吞吐量分析。模型对检测和分割NPs在不同异质催化剂系统上的性能进行了验证,包括各种金属(Cu,Ru,Pt和PtCo),支持(硅石,$\gamma$-氧化铝,碳黑)和非均匀催化剂材料。此外,所提出的机器学习(ML)方法成功检测和分割附着于非均匀催化支持材料上的NPs,为了解析它们的空间排列和相互作用提供了关键见解。我们的人工智能辅助NP分析工作流程在各种数据集上的稳健泛化能力得到了展示,可以轻松应用于类似的NP分割任务,而无需进行昂贵的模型重新训练。
https://arxiv.org/abs/2410.01213
The fuzzy object detection is a challenging field of research in computer vision (CV). Distinguishing between fuzzy and non-fuzzy object detection in CV is important. Fuzzy objects such as fire, smoke, mist, and steam present significantly greater complexities in terms of visual features, blurred edges, varying shapes, opacity, and volume compared to non-fuzzy objects such as trees and cars. Collection of a balanced and diverse dataset and accurate annotation is crucial to achieve better ML models for fuzzy objects, however, the task of collection and annotation is still highly manual. In this research, we propose and leverage an alternative method of generating and automatically annotating fully synthetic fire images based on 3D models for training an object detection model. Moreover, the performance, and efficiency of the trained ML models on synthetic images is compared with ML models trained on real imagery and mixed imagery. Findings proved the effectiveness of the synthetic data for fire detection, while the performance improves as the test dataset covers a broader spectrum of real fires. Our findings illustrates that when synthetic imagery and real imagery is utilized in a mixed training set the resulting ML model outperforms models trained on real imagery as well as models trained on synthetic imagery for detection of a broad spectrum of fires. The proposed method for automating the annotation of synthetic fuzzy objects imagery carries substantial implications for reducing both time and cost in creating computer vision models specifically tailored for detecting fuzzy objects.
模糊物体检测是计算机视觉(CV)中一个具有挑战性的研究领域。在CV中区分模糊和清晰物体检测非常重要。例如,火焰、烟雾、雾和蒸汽等模糊物体在视觉特征、模糊边缘、不同形状、透明度和体积方面比非模糊物体(如树木和汽车)具有更大的复杂性。收集平衡和多样性的数据集以及准确的标注对于为模糊物体实现更好的ML模型至关重要,然而,数据收集和标注的任务仍然非常手动。在这项研究中,我们提出了并利用了一种基于3D模型生成和自动标注完全合成火灾图像的方法,以训练物体检测模型。此外,将训练的ML模型在合成图像上的性能和效率与在真实图像和混合图像上训练的ML模型进行比较。结果表明,合成数据对火灾检测非常有效,而随着测试数据集涵盖更广泛的现实火种,模型的性能会得到提高。我们的研究结果表明,当使用合成图像和真实图像进行混合训练时,训练的ML模型在检测广泛范围的火种方面不仅能够超越只使用真实图像训练的模型,也能够超越只使用合成图像训练的模型。为专门用于检测模糊物体的计算机视觉模型的自动化标注方法提出了重大影响,以降低创建专门用于检测模糊物体的计算机视觉模型的成本和时间。
https://arxiv.org/abs/2410.01124
As the uses of augmented reality (AR) become more complex and widely available, AR applications will increasingly incorporate intelligent features that require developers to understand the user's behavior and surrounding environment (e.g. an intelligent assistant). Such applications rely on video captured by an AR headset, which often contains disjointed camera movement with a limited field of view that cannot capture the full scope of what the user sees at any given time. Moreover, standard methods of visualizing object detection model outputs are limited to capturing objects within a single frame and timestep, and therefore fail to capture the temporal and spatial context that is often necessary for various domain applications. We propose ARPOV, an interactive visual analytics tool for analyzing object detection model outputs tailored to video captured by an AR headset that maximizes user understanding of model performance. The proposed tool leverages panorama stitching to expand the view of the environment while automatically filtering undesirable frames, and includes interactive features that facilitate object detection model debugging. ARPOV was designed as part of a collaboration between visualization researchers and machine learning and AR experts; we validate our design choices through interviews with 5 domain experts.
随着增强现实(AR)应用的复杂性和可用性的增加,AR应用程序将越来越多地包括需要开发者理解用户行为和周围环境(例如智能助手)的智能功能。这些应用程序依赖于由AR头盔捕获的视频,往往包含不连贯的相机运动和有限的视野范围,无法捕捉用户在任何给定时间看到的完整范围。此外,通常用于可视化物体检测模型输出的标准方法仅限于捕获单个帧和时间步的物体,因此无法捕捉通常对于各种领域应用程序必要的动态和空间上下文。我们提出了ARPOV,一种针对通过AR头盔捕获的视频进行定制的物体检测模型输出分析工具,以最大化用户对模型性能的理解。该工具利用全景拼接来扩展环境的视图,并自动过滤不良帧,同时包括交互功能,便于用户进行物体检测模型调试。ARPOV的设计是视觉研究人员和机器学习和AR专家之间的合作的一部分;我们通过与5个领域专家的访谈来验证我们的设计选择。
https://arxiv.org/abs/2410.01055
Children often suffer wrist trauma in daily life, while they usually need radiologists to analyze and interpret X-ray images before surgical treatment by surgeons. The development of deep learning has enabled neural networks to serve as computer-assisted diagnosis (CAD) tools to help doctors and experts in medical image diagnostics. Since the You Only Look Once Version-8 (YOLOv8) model has obtained the satisfactory success in object detection tasks, it has been applied to various fracture detection. This work introduces four variants of Feature Contexts Excitation-YOLOv8 (FCE-YOLOv8) model, each incorporating a different FCE module (i.e., modules of Squeeze-and-Excitation (SE), Global Context (GC), Gather-Excite (GE), and Gaussian Context Transformer (GCT)) to enhance the model performance. Experimental results on GRAZPEDWRI-DX dataset demonstrate that our proposed YOLOv8+GC-M3 model improves the mAP@50 value from 65.78% to 66.32%, outperforming the state-of-the-art (SOTA) model while reducing inference time. Furthermore, our proposed YOLOv8+SE-M3 model achieves the highest mAP@50 value of 67.07%, exceeding the SOTA performance. The implementation of this work is available at this https URL.
孩子们在日常生活中常常因手腕受伤而受到困扰,而在进行手术治疗前,他们通常需要放射科医生来分析和解剖X光片。深度学习的出现使得神经网络能够成为医疗图像诊断(CAD)工具,帮助医生和医疗专家。由于You Only Look Once版本8(YOLOv8)模型在物体检测任务上已经取得了令人满意的成功,因此它被应用于各种骨折检测。这项工作介绍了一种名为Feature Contexts Excitation-YOLOv8(FCE-YOLOv8)的四种不同类型的特征上下文激发-YOLOv8(GC-M3)模型,每个模型都包含一个不同的FCE模块(即SE、GC、GE和GCT模块),以提高模型性能。GRAZPEDWRI-DX数据集的实验结果表明,与最先进的(SOTA)模型相比,我们提出的YOLOv8+GC-M3模型将mAP@50值从65.78%提高到了66.32%,同时缩短了推理时间。此外,我们提出的YOLOv8+SE-M3模型实现了SOTA性能的最高mAP@50值67.07%,超过了最先进的性能。本工作的实现可在https://url.com/中找到。
https://arxiv.org/abs/2410.01031
Despite their success in various vision tasks, deep neural network architectures often underperform in out-of-distribution scenarios due to the difference between training and target domain style. To address this limitation, we introduce One-Shot Style Adaptation (OSSA), a novel unsupervised domain adaptation method for object detection that utilizes a single, unlabeled target image to approximate the target domain style. Specifically, OSSA generates diverse target styles by perturbing the style statistics derived from a single target image and then applies these styles to a labeled source dataset at the feature level using Adaptive Instance Normalization (AdaIN). Extensive experiments show that OSSA establishes a new state-of-the-art among one-shot domain adaptation methods by a significant margin, and in some cases, even outperforms strong baselines that use thousands of unlabeled target images. By applying OSSA in various scenarios, including weather, simulated-to-real (sim2real), and visual-to-thermal adaptations, our study explores the overarching significance of the style gap in these contexts. OSSA's simplicity and efficiency allow easy integration into existing frameworks, providing a potentially viable solution for practical applications with limited data availability. Code is available at this https URL
尽管在各种视觉任务中取得了成功,深度神经网络架构在离散场景下通常由于训练和目标域风格之间的差异而表现不佳。为了克服这一局限,我们引入了一项新的无监督域适应方法:One-Shot Style Adaptation(OSSA),一种用于物体检测的新颖的无监督域适应方法,它利用一个未标记的目标图像来估计目标域风格。具体来说,OSSA通过扰动从单个目标图像中获得的风格统计量来生成多样化的目标风格,然后将这些风格应用到具有自适应实例正则化(AdaIN)的特征级别的标记源数据中。大量实验证明,OSSA在单次域适应方法中建立了新的领先地位,有时甚至超过了使用成千上万个未标记目标图像的强基线。通过在各种场景中应用OSSA,包括天气、模拟-真实(sim2real)和视觉-热适应,我们的研究探讨了这些上下文中的风格差距的总体意义。OSSA的简单性和效率允许将其轻松地集成到现有的框架中,为具有有限数据可用性的实际应用提供了一个可能的可行解决方案。代码可在此处访问:https://www.ossa-lab.org/
https://arxiv.org/abs/2410.00900
Efficient point cloud (PC) compression is crucial for streaming applications, such as augmented reality and cooperative perception. Classic PC compression techniques encode all the points in a frame. Tailoring compression towards perception tasks at the receiver side, we ask the question, "Can we remove the ground points during transmission without sacrificing the detection performance?" Our study reveals a strong dependency on the ground from state-of-the-art (SOTA) 3D object detection models, especially on those points below and around the object. In this work, we propose a lightweight obstacle-aware Pillar-based Ground Removal (PGR) algorithm. PGR filters out ground points that do not provide context to object recognition, significantly improving compression ratio without sacrificing the receiver side perception performance. Not using heavy object detection or semantic segmentation models, PGR is light-weight, highly parallelizable, and effective. Our evaluations on KITTI and Waymo Open Dataset show that SOTA detection models work equally well with PGR removing 20-30% of the points, with a speeding of 86 FPS.
高效的点云(PC)压缩对于流式应用(如增强现实和合作感知)至关重要。经典的PC压缩技术将帧中的所有点进行编码。将压缩适应接收端的感知任务,我们提出一个问题:“在传输过程中,我们能否在保留检测性能的同时消除地面点?”我们的研究揭示了来自最先进(SOTA)3D物体检测模型的地面点的强烈依赖,尤其是那些位于物体周围和下方的点。在这项工作中,我们提出了一个轻量级的基于小支柱的地面去除(PGR)算法。PGR排除了不提供物体识别上下文的地点,显著提高了压缩比,同时不牺牲接收端的感知性能。不使用沉重的物体检测或语义分割模型,PGR轻便、高度并行,且有效。我们对KITTI和Waymo Open Dataset的评估结果表明,与PGR一起移除20-30%的点,检测模型的性能相同,速度为86 FPS。
https://arxiv.org/abs/2410.00582