The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe. However, effectively analyzing this vast amount of data poses a significant challenge. Astronomers are turning to deep learning techniques to address this, but the methods are limited by their specific training sets, leading to considerable duplicate workloads too. Hence, as an example to present how to overcome the issue, we built a framework for general analysis of galaxy images, based on a large vision model (LVM) plus downstream tasks (DST), including galaxy morphological classification, image restoration, object detection, parameter extraction, and more. Considering the low signal-to-noise ratio of galaxy images and the imbalanced distribution of galaxy categories, we have incorporated a Human-in-the-loop (HITL) module into our large vision model, which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively. The proposed framework exhibits notable few-shot learning capabilities and versatile adaptability to all the abovementioned tasks on galaxy images in the DESI legacy imaging surveys. Expressly, for object detection, trained by 1000 data points, our DST upon the LVM achieves an accuracy of 96.7%, while ResNet50 plus Mask R-CNN gives an accuracy of 93.1%; for morphology classification, to obtain AUC ~0.9, LVM plus DST and HITL only requests 1/50 training sets compared to ResNet18. Expectedly, multimodal data can be integrated similarly, which opens up possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-message astronomy.
星际数据的增长为人类深入了解宇宙提供了前所未有的机会。然而,有效地分析这些大量数据仍然是一个巨大的挑战。天文学家开始利用深度学习技术解决这个问题,但是这些方法受到其特定训练集的局限,导致大量的重复工作。因此,作为展示如何克服这个问题的示例,我们构建了一个基于大型视觉模型(LVM)和下游任务的框架,包括星系形态分类、图像修复、目标检测、参数提取等。考虑到星系图像的低信噪比和星系类别的失衡分布,我们将人机交互模块纳入我们的大型视觉模型中,利用人类知识增强处理星系图像的可靠性和可解释性。所提出的框架在DESI遗产成像调查中的星系图像上表现出显著的少样本学习能力和对上述所有任务的变通适应性。具体来说,在我们的LVM上训练1000个数据点后,我们的DST在LVM上可以达到96.7%的准确率,而ResNet50加Mask R-CNN可以实现93.1%的准确率;对于形态分类,要获得AUC ~0.9,只需要1/50的训练集,而ResNet18需要更多的训练集。预计,多模态数据可以按照这种方式整合,为多样域数据集的联合分析提供了可能性,这为在多信使天文学时代进行联合分析提供了可能性。
https://arxiv.org/abs/2405.10890
Objectives: This work aims to explore the impact of multicenter data heterogeneity on deep learning brain metastases (BM) autosegmentation performance, and assess the efficacy of an incremental transfer learning technique, namely learning without forgetting (LWF), to improve model generalizability without sharing raw data. Materials and methods: A total of six BM datasets from University Hospital Erlangen (UKER), University Hospital Zurich (USZ), Stanford, UCSF, NYU and BraTS Challenge 2023 on BM segmentation were used for this evaluation. First, the multicenter performance of a convolutional neural network (DeepMedic) for BM autosegmentation was established for exclusive single-center training and for training on pooled data, respectively. Subsequently bilateral collaboration was evaluated, where a UKER pretrained model is shared to another center for further training using transfer learning (TL) either with or without LWF. Results: For single-center training, average F1 scores of BM detection range from 0.625 (NYU) to 0.876 (UKER) on respective single-center test data. Mixed multicenter training notably improves F1 scores at Stanford and NYU, with negligible improvement at other centers. When the UKER pretrained model is applied to USZ, LWF achieves a higher average F1 score (0.839) than naive TL (0.570) and single-center training (0.688) on combined UKER and USZ test data. Naive TL improves sensitivity and contouring accuracy, but compromises precision. Conversely, LWF demonstrates commendable sensitivity, precision and contouring accuracy. When applied to Stanford, similar performance was observed. Conclusion: Data heterogeneity results in varying performance in BM autosegmentation, posing challenges to model generalizability. LWF is a promising approach to peer-to-peer privacy-preserving model training.
目标:本研究旨在探讨多中心数据异质性对深度学习脑转移瘤(BM)自分割性能的影响,并评估增广转移学习技术(学习不遗忘,LWF)对提高模型泛化能力而不共享原始数据的有效性。材料和方法:本研究使用了英国埃尔兰大学医院(UKER)、瑞士苏黎世大学医院(USZ)、斯坦福大学、加州大学旧金山分校(UCSF)、纽约大学(NYU)和2023年BrTS挑战赛BM分割数据集进行评估。首先,对于单中心训练和基于池化数据的训练,建立了DeepMedic卷积神经网络(BM)自分割的异中心性能。然后,评估了双边合作,其中 UKER 预训练模型在一个中心进行共享,用于进一步训练,无论是否使用 LWF。结果:对于单中心训练,单中心测试数据的 BM 检测范围平均分数从0.625(NYU)到0.876(UKER)不等。混合多中心训练在斯坦福和纽约大学上显著提高了 F1 分数,而在其他中心上影响较小。当 UKER 预训练模型应用于 USZ 时,LWF 获得的平均 F1 分数(0.839)高于 naive TL(0.570)和单中心训练(0.688)的总和。 naive TL 提高了灵敏度和轮廓准确性,但牺牲了精确性。相反,LWF 表现出值得称赞的灵敏度、精度和轮廓准确性。当应用于斯坦福时,观察到了类似的表现。结论:数据异质性导致 BM 自分割性能存在差异,对模型的泛化能力构成挑战。LWF 是保护隐私的同时实现模型协同训练的有前景的方法。
https://arxiv.org/abs/2405.10870
This paper presents a novel approach to the digital signing of electronic documents through the use of a camera-based interaction system, single-finger tracking for sign recognition, and multi commands executing hand gestures. The proposed solution, referred to as "Air Signature," involves writing the signature in front of the camera, rather than relying on traditional methods such as mouse drawing or physically signing on paper and showing it to a web camera. The goal is to develop a state-of-the-art method for detecting and tracking gestures and objects in real-time. The proposed methods include applying existing gesture recognition and object tracking systems, improving accuracy through smoothing and line drawing, and maintaining continuity during fast finger movements. An evaluation of the fingertip detection, sketching, and overall signing process is performed to assess the effectiveness of the proposed solution. The secondary objective of this research is to develop a model that can effectively recognize the unique signature of a user. This type of signature can be verified by neural cores that analyze the movement, speed, and stroke pixels of the signing in real time. The neural cores use machine learning algorithms to match air signatures to the individual's stored signatures, providing a secure and efficient method of verification. Our proposed System does not require sensors or any hardware other than the camera.
本文提出了一种新颖的方法,通过使用摄像头交互系统、单手指跟踪签名和多命令手势来对电子文档进行数字签名。所提出的解决方案被称为“空气签名”,涉及在摄像头前签名,而不是依赖传统的如鼠标绘图或纸质签名并将其显示给网络摄像头的方法。目标是开发出一种在实时检测和跟踪手势和物体的高级方法。所提出的方法包括应用现有的手势识别和物体跟踪系统、通过平滑和绘线来提高准确性以及保持连续性在快速手指运动过程中。为了评估所提出的解决方案的有效性,对指尖检测、绘图和整体签名过程进行了评估。本研究的主要目标是开发一个模型,可以有效地识别用户的独特签名。这种签名可以通过分析签署者的运动、速度和画笔像素来验证。神经内核使用机器学习算法将空气签名与存储在个人中的签名匹配,提供了一种安全且高效的身份验证方法。我们所提出的系统不需要传感器或任何其他硬件,只需要摄像头。
https://arxiv.org/abs/2405.10868
Spatio-temporal action detection (STAD) is an important fine-grained video understanding task. Current methods require box and label supervision for all action classes in advance. However, in real-world applications, it is very likely to come across new action classes not seen in training because the action category space is large and hard to enumerate. Also, the cost of data annotation and model training for new classes is extremely high for traditional methods, as we need to perform detailed box annotations and re-train the whole network from scratch. In this paper, we propose a new challenging setting by performing open-vocabulary STAD to better mimic the situation of action detection in an open world. Open-vocabulary spatio-temporal action detection (OV-STAD) requires training a model on a limited set of base classes with box and label supervision, which is expected to yield good generalization performance on novel action classes. For OV-STAD, we build two benchmarks based on the existing STAD datasets and propose a simple but effective method based on pretrained video-language models (VLM). To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs. This customized fine-tuning endows the VLM with better motion understanding, thus contributing to a more accurate alignment between video regions and texts. Local region feature and global video feature fusion before alignment is adopted to further improve the action detection performance by providing global context. Our method achieves a promising performance on novel classes.
Spatio-temporal action detection (STAD) 是一个重要的细粒度视频理解任务。 目前的解决方案需要对所有动作类别进行箱和标签监督。然而,在现实应用中,可能会遇到在训练中没有见过的全新动作类别,因为动作类别空间很大,很难枚举。 传统方法的代价是数据注释和模型训练成本非常高,因为我们需要对整个网络进行详细的箱注释,并从头开始重新训练整个网络。 在本文中,我们提出了一个新的具有挑战性的设置,通过进行开放式词汇 STAD 来更好地模仿在开放世界中动作检测的情况。开放式词汇spatiotemporal动作检测 (OV-STAD)需要对有限的基础类别的模型进行训练,并且预计将产生对 novel 动作类别的良好泛化性能。对于 OV-STAD,我们基于现有的 STAD 数据集构建了两个基准,并提出了一个简单但有效的方法,基于预训练的视频语言模型 (VLM)。为了更好地适应细粒度动作检测任务,我们在局部视频区域文本对上进行微调。 这种自定义微调使 VLM 具有更好的运动理解,从而提高了视频区域和文本之间的更准确的对齐。 在对齐之前采用局部区域特征和全局视频特征的融合,进一步提高了动作检测性能,提供了全局上下文。我们的方法在 novel 类上取得了良好的性能。
https://arxiv.org/abs/2405.10832
Earlier diagnosis of Leukemia can save thousands of lives annually. The prognosis of leukemia is challenging without the morphological information of White Blood Cells (WBC) and relies on the accessibility of expensive microscopes and the availability of hematologists to analyze Peripheral Blood Samples (PBS). Deep Learning based methods can be employed to assist hematologists. However, these algorithms require a large amount of labeled data, which is not readily available. To overcome this limitation, we have acquired a realistic, generalized, and large dataset. To collect this comprehensive dataset for real-world applications, two microscopes from two different cost spectrums (high-cost HCM and low-cost LCM) are used for dataset capturing at three magnifications (100x, 40x, 10x) through different sensors (high-end camera for HCM, middle-level camera for LCM and mobile-phone camera for both). The high-sensor camera is 47 times more expensive than the middle-level camera and HCM is 17 times more expensive than LCM. In this collection, using HCM at high resolution (100x), experienced hematologists annotated 10.3k WBC types (14) and artifacts, having 55k morphological labels (Cell Size, Nuclear Chromatin, Nuclear Shape, etc.) from 2.4k images of several PBS leukemia patients. Later on, these annotations are transferred to other 2 magnifications of HCM, and 3 magnifications of LCM, and on each camera captured images. Along with the LeukemiaAttri dataset, we provide baselines over multiple object detectors and Unsupervised Domain Adaptation (UDA) strategies, along with morphological information-based attribute prediction. The dataset will be publicly available after publication to facilitate the research in this direction.
早期对白血病的诊断可以拯救成千上万的生命。在没有白细胞形态信息的情况下,白血病的预后非常具有挑战性,而且依赖于昂贵的显微镜和能够分析外周血样本的血液学家。基于深度学习的方法可以协助血液学家。然而,这些算法需要大量标记数据,而这些数据并不容易获得。为了克服这一限制,我们获得了真实、泛化的一个大数据集。为了收集这个全面的现实世界的应用数据,我们使用了两种不同成本范围的显微镜(高成本HCM和低成本LCM)在三个放大倍数(100x,40x,10x)下对数据进行捕捉,通过不同传感器(高分辨率相机用于HCM,中档水平相机用于LCM和手机相机)进行数据采集。高传感器相机比中档相机贵47倍,而HCM比LCM贵17倍。在这个收集过程中,使用高分辨率(100x)对HCM进行采集的血液学家标注了10303个WBC类型(14),以及从2400个不同PBS白血病患者图像中的55000个形态标签(细胞大小、核染色质、核形状等)。后来,这些标注传到了其他2个HCM放大倍数和3个LCM放大倍数,以及每个相机捕获的图像。与白血病Attri数据集相结合,我们在多个目标检测器和无监督领域自适应(UDA)策略的基础上提供了基线。该数据集将在发布后公开提供,以促进关于这一方向的研究。
https://arxiv.org/abs/2405.10803
Place recognition is a fundamental task for robotic application, allowing robots to perform loop closure detection within simultaneous localization and mapping (SLAM), and achieve relocalization on prior maps. Current range image-based networks use single-column convolution to maintain feature invariance to shifts in image columns caused by LiDAR viewpoint change.However, this raises the issues such as "restricted receptive fields" and "excessive focus on local regions", degrading the performance of networks. To address the aforementioned issues, we propose a lightweight circular convolutional Transformer network denoted as CCTNet, which boosts performance by capturing structural information in point clouds and facilitating crossdimensional interaction of spatial and channel information. Initially, a Circular Convolution Module (CCM) is introduced, expanding the network's perceptual field while maintaining feature consistency across varying LiDAR perspectives. Then, a Range Transformer Module (RTM) is proposed, which enhances place recognition accuracy in scenarios with movable objects by employing a combination of channel and spatial attention mechanisms. Furthermore, we propose an Overlap-based loss function, transforming the place recognition task from a binary loop closure classification into a regression problem linked to the overlap between LiDAR frames. Through extensive experiments on the KITTI and Ford Campus datasets, CCTNet surpasses comparable methods, achieving Recall@1 of 0.924 and 0.965, and Recall@1% of 0.990 and 0.993 on the test set, showcasing a superior performance. Results on the selfcollected dataset further demonstrate the proposed method's potential for practical implementation in complex scenarios to handle movable objects, showing improved generalization in various datasets.
定位是一个机器人应用的基本任务,使机器人能够在同时定位和映射(SLAM)过程中执行闭环检测,并在先验地图上实现重新定位。目前,基于范围图像的网络使用单列卷积来保持特征不变,以应对由于激光雷达视点变化引起的图像列的位移。然而,这导致了诸如“受限制的接收域”和“过度关注局部区域”等问题,降低了网络的性能。为了应对上述问题,我们提出了一个轻量级的环状卷积Transformer网络,称为CCTNet,通过捕获点云中的结构信息并促进空间和通道信息的跨维度交互来提高性能。首先,引入环状卷积模块(CCM),在扩展网络的感知场的同时保持特征一致性,在不同的激光雷达视点下保持特征一致性。接着,我们提出了一个范围Transformer模块(RTM),通过结合通道和空间注意机制,在可移动物体场景中提高地点识别准确性。此外,我们提出了一个基于重叠损失函数的地点识别问题,将二进制环闭合分类问题转化为与激光雷达帧之间的重叠的回归问题。通过在KITTI和福特校园数据集上的广泛实验,CCTNet超越了类似方法,实现了召回率@1为0.924和0.965,以及召回率@1%为0.990和0.993在测试集上的成绩。结果在自收集数据集上进一步证明了该方法在复杂场景中进行实际应用的潜力,各种数据集上的泛化能力得到提高。
https://arxiv.org/abs/2405.10793
Coreference resolution, critical for identifying textual entities referencing the same entity, faces challenges in pronoun resolution, particularly identifying pronoun antecedents. Existing methods often treat pronoun resolution as a separate task from mention detection, potentially missing valuable information. This study proposes the first end-to-end neural network system for Persian pronoun resolution, leveraging pre-trained Transformer models like ParsBERT. Our system jointly optimizes both mention detection and antecedent linking, achieving a 3.37 F1 score improvement over the previous state-of-the-art system (which relied on rule-based and statistical methods) on the Mehr corpus. This significant improvement demonstrates the effectiveness of combining neural networks with linguistic models, potentially marking a significant advancement in Persian pronoun resolution and paving the way for further research in this under-explored area.
核心参考文献解决,对于识别相同实体文本中的实体,面临代词解决挑战,特别是确定代词前缀。现有方法通常将代词解决视为与提及检测分开的任务,可能错过有价值的信息。本研究提出了第一个端到端的波斯语代词解决方案,利用像ParsBERT这样的预训练Transformer模型。我们的系统共同优化提及检测和前缀链接,在Mehr语料库上的性能比之前最先进的系统(依赖规则和统计方法)提高了3.37个F1分数。这一显著的改善表明了将神经网络与语言模型相结合的有效性,可能标志着波斯语代词解决领域的重要进展,并为这个研究不足探索领域进一步研究铺平道路。
https://arxiv.org/abs/2405.10714
The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection shared task in the SemEval-2024 competition aims to tackle the problem of misusing collaborative human-AI writing. Although there are a lot of existing detectors of AI content, they are often designed to give a binary answer and thus may not be suitable for more nuanced problem of finding the boundaries between human-written and machine-generated texts, while hybrid human-AI writing becomes more and more popular. In this paper, we address the boundary detection problem. Particularly, we present a pipeline for augmenting data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score, according to the leaderboard of the competition, with this pipeline.
多生成器、多领域和多语言的黑色盒式机器翻译文本检测在SemEval-2024竞赛中旨在解决协同人类-人工智能写作中误用人工智能内容的问题。尽管有很多现有的人工智能内容检测器,但它们通常被设计为只给出二进制答案,因此可能不适合更细微的问题,即在人类撰写的和人工智能生成的文本之间找到边界。而混合人类-人工智能写作变得越来越受欢迎。在本文中,我们解决了边界检测问题。特别是,我们提出了一种用于增强数据以进行监督微调DeBERTaV3的管道。根据比赛的领导者板,我们使用这个管道获得了最优秀的MAE分数。
https://arxiv.org/abs/2405.10629
Deep learning methods for time series have already reached excellent performances in both prediction and classification tasks, including anomaly detection. However, the complexity inherent in Cyber Physical Systems (CPS) creates a challenge when it comes to explainability methods. To overcome this inherent lack of interpretability, we propose ECATS, a concept-based neuro-symbolic architecture where concepts are represented as Signal Temporal Logic (STL) formulae. Leveraging kernel-based methods for STL, concept embeddings are learnt in an unsupervised manner through a cross-attention mechanism. The network makes class predictions through these concept embeddings, allowing for a meaningful explanation to be naturally extracted for each input. Our preliminary experiments with a simple CPS-based dataset show that our model is able to achieve great classification performance while ensuring local interpretability.
深度学习方法已经在时间序列预测和分类任务中达到了优秀的表现,包括异常检测。然而,CPS固有的复杂性在可解释性方法上构成了挑战。为了解决这一固有可解释性不足的问题,我们提出了ECATS,一种基于概念的神经符号架构,其中概念用信号时序逻辑(STL)公式表示。通过跨注意机制,利用基于核的方法学习STL,概念嵌入以一种无监督的方式学习。网络通过这些概念嵌入进行分类预测,从而可以自然地提取每个输入的有意义的解释。我们对一个简单的CPS数据集进行的初步实验表明,我们的模型能够在保证局部可解释性的同时实现出色的分类性能。
https://arxiv.org/abs/2405.10608
In this work, we explore idiomatic language processing with Large Language Models (LLMs). We introduce the Idiomatic language Test Suite IdioTS, a new dataset of difficult examples specifically designed by language experts to assess the capabilities of LLMs to process figurative language at sentence level. We propose a comprehensive evaluation methodology based on an idiom detection task, where LLMs are prompted with detecting an idiomatic expression in a given English sentence. We present a thorough automatic and manual evaluation of the results and an extensive error analysis.
在这项研究中,我们探讨了使用大型语言模型(LLMs)进行惯用语处理。我们引入了惯用语测试集IdioTS,这是一个由语言专家专门设计的新数据集,用于评估LLMs在句子级别处理隐喻语言的能力。我们提出了一个全面评估方法,基于惯用语检测任务,其中LLMs被提示在给定英语句子中检测隐喻表达。我们全面自动和手动评估了结果,并对结果进行了深入分析。
https://arxiv.org/abs/2405.10579
Recent advances in multi-view camera-only 3D object detection either rely on an accurate reconstruction of bird's-eye-view (BEV) 3D features or on traditional 2D perspective view (PV) image features. While both have their own pros and cons, few have found a way to stitch them together in order to benefit from "the best of both worlds". To this end, we explore a duo space (i.e., BEV and PV) 3D perception framework, in conjunction with some useful duo space fusion strategies that allow effective aggregation of the two feature representations. To the best of our knowledge, our proposed method, DuoSpaceNet, is the first to leverage two distinct feature spaces and achieves the state-of-the-art 3D object detection and BEV map segmentation results on nuScenes dataset.
近年来,多视角相机仅3D物体检测技术的发展主要依赖于对鸟眼视图(BEV)3D特征的准确重建,或者依赖于传统2D透视图(PV)图像特征。虽然两者都有其自身的优点和缺点,但很少有方法将它们结合起来以实现“两者之最”。因此,我们探讨了一种结合鸟眼视图(BEV)和透视图(PV)的3D感知框架,并探讨了一些有用的二元空间融合策略,以实现两个特征表示的有效聚合。据我们所知,我们提出的方法DuoSpaceNet是第一个利用两个不同的特征空间并实现 nuScenes 数据集上最先进的3D物体检测和 BEV地图分割结果的方法。
https://arxiv.org/abs/2405.10577
Text classification is a fundamental task in Natural Language Processing (NLP), and the advent of Large Language Models (LLMs) has revolutionized the field. This paper introduces the Smart Expert System, a novel approach that leverages LLMs as text classifiers. The system simplifies the traditional text classification workflow, eliminating the need for extensive preprocessing and domain expertise. The performance of several LLMs, machine learning (ML) algorithms, and neural network (NN) based structures is evaluated on four datasets. Results demonstrate that certain LLMs surpass traditional methods in sentiment analysis, spam SMS detection and multi-label classification. Furthermore, it is shown that the system's performance can be further enhanced through few-shot or fine-tuning strategies, making the fine-tuned model the top performer across all datasets. Source code and datasets are available in this GitHub repository: this https URL.
文本分类是自然语言处理(NLP)中的基本任务,而大型语言模型的出现已经彻底颠覆了该领域。本文介绍了一种名为智能专家系统的新方法,该方法利用大型语言模型作为文本分类器。系统简化了传统的文本分类工作流程,消除了需要大量预处理和专业知识的需求。我们对四个数据集进行了对几种LLM、机器学习(ML)算法和神经网络(NN)构建结构的性能评估。结果表明,某些LLM在情感分析、垃圾短信检测和多标签分类方面超过了传统方法。此外,还发现通过少样本或微调策略可以提高系统的性能,使得微调后的模型成为所有数据集上的最佳 performer。代码和数据集可在GitHub存储库中找到:此链接。
https://arxiv.org/abs/2405.10523
The Unmanned Aerial Vehicles (UAVs) market has been significantly growing and Considering the availability of drones at low-cost prices the possibility of misusing them, for illegal purposes such as drug trafficking, spying, and terrorist attacks posing high risks to national security, is rising. Therefore, detecting and tracking unauthorized drones to prevent future attacks that threaten lives, facilities, and security, become a necessity. Drone detection can be performed using different sensors, while image-based detection is one of them due to the development of artificial intelligence techniques. However, knowing unauthorized drone types is one of the challenges due to the lack of drone types datasets. For that, in this paper, we provide a dataset of various drones as well as a comparison of recognized object detection models on the proposed dataset including YOLO algorithms with their different versions, like, v3, v4, and v5 along with the Detectronv2. The experimental results of different models are provided along with a description of each method. The collected dataset can be found in this https URL
无人机(UAVs)市场已经显著增长。考虑到低成本无人机(如货运无人机、农业无人机等)的可用性,非法用途(如毒品走私、间谍活动、恐怖主义袭击等)对国家安全的威胁越来越高,因此,检测和追踪未经授权的无人机以防止未来可能对生命、设施和安全的威胁,变得势在必行。 无人机检测可以通过各种传感器进行,而图像识别检测就是其中之一,因为人工智能技术的发展。然而,由于缺乏无人机类型的数据集,知道未经授权的无人机类型是一个挑战。 为了解决这个问题,本文提供了一个包括各种无人机的数据集,以及不同版本的YOLO算法,如v3、v4和v5,以及Detectronv2,并提供了各种模型的实验结果以及方法的描述。收集的數據可以通過此处的链接找到:https://www.example.com/
https://arxiv.org/abs/2405.10398
Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: this https URL.
之前的研究主要针对3D场景理解开发了针对特定任务或需要任务特定微调的专用模型。在本文中,我们提出了Grounded 3D-LLM,探讨了3D大型多模态模型(3D LMM)在统一生成框架中合并各种3D视觉任务的潜力。该模型使用场景参考词作为特殊名词短语来引用3D场景,从而处理跨3D和文本数据的中断序列。它为将3D视觉任务翻译成语言格式提供了自然的方法,使用任务特定的指令模板。为了方便在后续语言建模中使用参考词,我们通过通过现有的物体标签标签进行 bootstrapping 的大型规模 grounded 语言数据集来汇总场景-文本关系。然后,我们引入了 Contrastive Language-Scene 预训练(CLASP)来有效利用这些数据,从而将3D视觉与语言模型相结合。我们的全面评估涵盖了包括密集注解和3D QA等 open-ended 任务,以及包括物体检测和语言 grounding 等 close-ended 任务。在多个 3D 基准测试中进行实验,揭示了 Grounded 3D-LLM 的领先性能和广泛的适用性。代码和数据集将在项目页面上发布:此链接。
https://arxiv.org/abs/2405.10370
Integrating an RGB camera into a ToF imaging system has become a significant technique for perceiving the real world. The RGB guided ToF imaging system is crucial to several applications, including face anti-spoofing, saliency detection, and trajectory prediction. Depending on the distance of the working range, the implementation schemes of the RGB guided ToF imaging systems are different. Specifically, ToF sensors with a uniform field of illumination, which can output dense depth but have low resolution, are typically used for close-range measurements. In contrast, LiDARs, which emit laser pulses and can only capture sparse depth, are usually employed for long-range detection. In the two cases, depth quality improvement for RGB guided ToF imaging corresponds to two sub-tasks: guided depth super-resolution and guided depth completion. In light of the recent significant boost to the field provided by deep learning, this paper comprehensively reviews the works related to RGB guided ToF imaging, including network structures, learning strategies, evaluation metrics, benchmark datasets, and objective functions. Besides, we present quantitative comparisons of state-of-the-art methods on widely used benchmark datasets. Finally, we discuss future trends and the challenges in real applications for further research.
将RGB相机集成到ToF成像系统中已成为感知现实世界的重要技术。RGB引导ToF成像系统对于多个应用场景至关重要,包括面部抗伪造、轮廓检测和轨迹预测。根据工作范围的不同,RGB引导ToF成像系统的实现方案是不同的。具体来说,具有均匀场照明的ToF传感器通常用于近距离测量。相反,激光雷达,它们只能捕获稀疏深度,通常用于远距离检测。在这两种情况下,RGB引导ToF成像系统的深度质量改进相当于两个子任务:引导深度超分辨率 和引导深度完成。 鉴于最近深度学习在领域提供的重大提升,本文全面回顾了与RGB引导ToF成像相关的论文,包括网络结构、学习策略、评估指标、基准数据集和目标函数。此外,我们还在广泛使用的基准数据集上对最先进的方法进行了定量比较。最后,我们讨论了在实际应用中未来的趋势和挑战,为进一步研究提供了指导。
https://arxiv.org/abs/2405.10357
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" of open-set object detection. The suite encompasses two models: Grounding DINO 1.5 Pro, a high-performance model designed for stronger generalization capability across a wide range of scenarios, and Grounding DINO 1.5 Edge, an efficient model optimized for faster speed demanded in many applications requiring edge deployment. The Grounding DINO 1.5 Pro model advances its predecessor by scaling up the model architecture, integrating an enhanced vision backbone, and expanding the training dataset to over 20 million images with grounding annotations, thereby achieving a richer semantic understanding. The Grounding DINO 1.5 Edge model, while designed for efficiency with reduced feature scales, maintains robust detection capabilities by being trained on the same comprehensive dataset. Empirical results demonstrate the effectiveness of Grounding DINO 1.5, with the Grounding DINO 1.5 Pro model attaining a 54.3 AP on the COCO detection benchmark and a 55.7 AP on the LVIS-minival zero-shot transfer benchmark, setting new records for open-set object detection. Furthermore, the Grounding DINO 1.5 Edge model, when optimized with TensorRT, achieves a speed of 75.2 FPS while attaining a zero-shot performance of 36.2 AP on the LVIS-minival benchmark, making it more suitable for edge computing scenarios. Model examples and demos with API will be released at this https URL
本文介绍了IDEA研究开发的高级开放集物体检测模型Grounding DINO 1.5,该模型的目标是提高开放集物体检测的“边缘”。该系列包括两个模型:Grounding DINO 1.5 Pro,一种高性能模型,旨在在广泛的场景中提高泛化能力,以及Grounding DINO 1.5 Edge,一种专注于更快速度要求的许多需要边缘部署的应用程序的低延迟模型。Grounding DINO 1.5 Pro模型通过扩展模型架构、集成增强的视觉骨架和扩展训练数据集(带有 grounding 注释的超过2000万图像)来超越其前辈,从而实现更丰富的语义理解。Grounding DINO 1.5 Edge模型,虽然设计为具有较低特征缩放的高效模型,但在全面的数据集上训练仍具有稳健的检测能力。实验结果证明了Grounding DINO 1.5的有效性,Grounding DINO 1.5 Pro模型在COCO检测基准上获得了54.3的AP,在LVIS-minival零散转移基准上获得了55.7的AP,创造了新的开放集物体检测纪录。此外,当使用TensorRT优化Grounding DINO 1.5 Edge模型时,其速度达到75.2 FPS,同时取得了LVIS-minival基准上的零散转移性能为36.2 AP,使其更适合边缘计算场景。模型示例和API演示将会在这个链接上发布。
https://arxiv.org/abs/2405.10300
Authorship obfuscation techniques hold the promise of helping people protect their privacy in online communications by automatically rewriting text to hide the identity of the original author. However, obfuscation has been evaluated in narrow settings in the NLP literature and has primarily been addressed with superficial edit operations that can lead to unnatural outputs. In this work, we introduce an automatic text privatization framework that fine-tunes a large language model via reinforcement learning to produce rewrites that balance soundness, sense, and privacy. We evaluate it extensively on a large-scale test set of English Reddit posts by 68k authors composed of short-medium length texts. We study how the performance changes among evaluative conditions including authorial profile length and authorship detection strategy. Our method maintains high text quality according to both automated metrics and human evaluation, and successfully evades several automated authorship attacks.
作者身份混淆技术在帮助人们保护在线交流中的隐私方面具有潜力,通过自动重写文本来隐藏原始作者的身份。然而,在自然语言处理文献中,混淆性主要通过粗略的编辑操作来解决,可能导致输出不自然。在这项工作中,我们介绍了一种自动文本保密框架,通过强化学习方法微调一个大语言模型,以产生平衡音调、意义和隐私的重新编写。我们在由68k名作者组成的英语Reddit帖子的大型测试集中对其进行了评估。我们研究了评估条件包括作者个人资料长度和作者身份检测策略时,性能的变化。我们的方法根据自动和人类评估都保持了高质量,并成功逃避了几个自动作者身份攻击。
https://arxiv.org/abs/2405.10260
Foundation models in computational pathology promise to unlock the development of new clinical decision support systems and models for precision medicine. However, there is a mismatch between most clinical analysis, which is defined at the level of one or more whole slide images, and foundation models to date, which process the thousands of image tiles contained in a whole slide image separately. The requirement to train a network to aggregate information across a large number of tiles in multiple whole slide images limits these models' impact. In this work, we present a slide-level foundation model for H&E-stained histopathology, PRISM, that builds on Virchow tile embeddings and leverages clinical report text for pre-training. Using the tile embeddings, PRISM produces slide-level embeddings with the ability to generate clinical reports, resulting in several modes of use. Using text prompts, PRISM achieves zero-shot cancer detection and sub-typing performance approaching and surpassing that of a supervised aggregator model. Using the slide embeddings with linear classifiers, PRISM surpasses supervised aggregator models. Furthermore, we demonstrate that fine-tuning of the PRISM slide encoder yields label-efficient training for biomarker prediction, a task that typically suffers from low availability of training data; an aggregator initialized with PRISM and trained on as little as 10% of the training data can outperform a supervised baseline that uses all of the data.
翻译:计算病理学中的基础模型承诺要解锁精确医学中新的临床决策支持系统和模型的开发。然而,大多数临床分析(以一或多个整片图像的水平定义)与现有基础模型之间的差距很大,后者在整片图像中处理成千上万的图像块。在训练网络以跨越大量 tile 的跨多个整片图像的信息方面,这些模型的影响受到限制。在这项工作中,我们提出了一个用于 H&E 染色组织学的基础模型,PRISM,它基于 Virchow 图像块嵌入并利用临床报告文本进行预训练。利用块嵌入,PRISM 产生具有生成临床报告能力的多模式使用。利用文本提示,PRISM 实现零散癌症检测和亚型检测性能,其效果接近并超过了有监督聚合器的模型。使用线性分类器的块嵌入,PRISM 超越了有监督聚合器模型。此外,我们还证明了 PRISM 块编码器的微调使得生物标志物预测的标签效率训练,这种任务通常缺乏足够的训练数据。基于 PRISM 的聚合器初始化并仅使用 10% 的训练数据进行训练,可以击败使用所有数据的开环基线模型。
https://arxiv.org/abs/2405.10254
We identify an issue in multi-task learnable compression, in which a representation learned for one task does not positively contribute to the rate-distortion performance of a different task as much as expected, given the estimated amount of information available in it. We interpret this issue using the predictive $\mathcal{V}$-information framework. In learnable scalable coding, previous work increased the utilization of side-information for input reconstruction by also rewarding input reconstruction when learning this shared representation. We evaluate the impact of this idea in the context of input reconstruction more rigorously and extended it to other computer vision tasks. We perform experiments using representations trained for object detection on COCO 2017 and depth estimation on the Cityscapes dataset, and use them to assist in image reconstruction and semantic segmentation tasks. The results show considerable improvements in the rate-distortion performance of the assisted tasks. Moreover, using the proposed representations, the performance of the base tasks are also improved. Results suggest that the proposed method induces simpler representations that are more compatible with downstream processes.
我们在多任务学习压缩中识别出一个问题,即在给定可用的信息量的情况下,为某一任务学习的表示对另一任务的目标速率失真性能的贡献不如预期。我们使用预测$\mathcal{V}$-信息框架来解释这个问题。在可学习可扩展编码中,以前的工作通过在学习和共享表示时奖励输入重建来增加了侧信息用于输入复原的利用率。我们将在输入复原的背景下更严格地评估这个想法,并将其扩展到其他计算机视觉任务上。我们在COCO 2017上为物体检测训练的表示和城市景观数据集上为深度估计训练的表示,并使用它们来协助图像复原和语义分割任务。实验结果表明,辅助任务的速率失真性能得到了显著的提高。此外,使用所提出的表示,基本任务的性能也得到了提高。结果表明,与所提出的方法相关的更简单的表示对下游过程更兼容。
https://arxiv.org/abs/2405.10244
Hyperspectral target detection (HTD) aims to identify specific materials based on spectral information in hyperspectral imagery and can detect point targets, some of which occupy a smaller than one-pixel area. However, existing HTD methods are developed based on per-pixel binary classification, which limits the feature representation capability for point targets. In this paper, we rethink the hyperspectral point target detection from the object detection perspective, and focus more on the object-level prediction capability rather than the pixel classification capability. Inspired by the token-based processing flow of Detection Transformer (DETR), we propose the first specialized network for hyperspectral multi-class point object detection, SpecDETR. Without the backbone part of the current object detection framework, SpecDETR treats the spectral features of each pixel in hyperspectral images as a token and utilizes a multi-layer Transformer encoder with local and global coordination attention modules to extract deep spatial-spectral joint features. SpecDETR regards point object detection as a one-to-many set prediction problem, thereby achieving a concise and efficient DETR decoder that surpasses the current state-of-the-art DETR decoder in terms of parameters and accuracy in point object detection. We develop a simulated hyperSpectral Point Object Detection benchmark termed SPOD, and for the first time, evaluate and compare the performance of current object detection networks and HTD methods on hyperspectral multi-class point object detection. SpecDETR demonstrates superior performance as compared to current object detection networks and HTD methods on the SPOD dataset. Additionally, we validate on a public HTD dataset that by using data simulation instead of manual annotation, SpecDETR can detect real-world single-spectral point objects directly.
超分辨率目标检测(HTD)旨在根据超分辨率图像的谱信息识别特定的材料,并可以检测到一些占据小于一个像素面积的点目标。然而,现有的HTD方法是基于每个像素的二分类,这限制了点目标的特征表示能力。在本文中,我们从目标检测的角度重新思考超分辨率点目标检测,并将重点放在物体级别预测能力上,而不是像素分类能力上。受到检测Transformer(DETR)的标记基于处理流的启发,我们提出了第一个专门的超分辨率多类点物体检测网络,SpecDETR。与当前物体检测框架的后端不同,SpecDETR将超分辨率图像中每个像素的谱特征视为一个标记,并采用具有局部和全局注意力的多层Transformer编码器来提取深空间-光谱联合特征。SpecDETR将点物体检测视为一个一对一的集合预测问题,从而实现简洁且高效的DETR解码器,在点物体检测方面超越了当前最先进的DETR解码器。我们开发了一个名为SPOD的超分辨率点物体检测模拟基准,并首次评估和比较了当前物体检测网络和HTD方法在超分辨率多类点物体检测上的性能。实验结果表明,SpecDETR在SPOD数据集上的性能优于当前物体检测网络和HTD方法。此外,我们还验证了SpecDETR在公共HTD数据集上的能力,通过使用数据模拟而非手动注释,SpecDETR可以直接检测到真实世界中的单个光谱点物体。
https://arxiv.org/abs/2405.10148