Implicit neural representation methods have shown impressive advancements in learning 3D scenes from unstructured in-the-wild photo collections but are still limited by the large computational cost of volumetric rendering. More recently, 3D Gaussian Splatting emerged as a much faster alternative with superior rendering quality and training efficiency, especially for small-scale and object-centric scenarios. Nevertheless, this technique suffers from poor performance on unstructured in-the-wild data. To tackle this, we extend over 3D Gaussian Splatting to handle unstructured image collections. We achieve this by modeling appearance to seize photometric variations in the rendered images. Additionally, we introduce a new mechanism to train transient Gaussians to handle the presence of scene occluders in an unsupervised manner. Experiments on diverse photo collection scenes and multi-pass acquisition of outdoor landmarks show the effectiveness of our method over prior works achieving state-of-the-art results with improved efficiency.
隐式神经表示方法在从无结构的自然照片集中学习3D场景方面取得了令人印象深刻的进展,但仍然受到体积渲染的大计算成本的限制。更最近,3D高斯平铺作为一种更快的替代方法出现了,具有卓越的渲染质量和训练效率,特别是对于小规模和以物体为中心的场景。然而,这种技术在处理无结构自然数据集时表现不佳。为了解决这个问题,我们将3D高斯平铺扩展到处理无结构图像集合。我们通过建模来捕捉渲染图像中的光度变化来实现这一目标。此外,我们引入了一种新的机制来以无需监督的方式训练暂时的Gauss分布,以处理场景遮挡的存在。在多样照片集场景和多程获取户外标志的实验中,我们的方法比先前的作品在提高效率的同时实现了最先进的渲染结果。
https://arxiv.org/abs/2403.10427
Surface parameterization is a fundamental geometry processing problem with rich downstream applications. Traditional approaches are designed to operate on well-behaved mesh models with high-quality triangulations that are laboriously produced by specialized 3D modelers, and thus unable to meet the processing demand for the current explosion of ordinary 3D data. In this paper, we seek to perform UV unwrapping on unstructured 3D point clouds. Technically, we propose ParaPoint, an unsupervised neural learning pipeline for achieving global free-boundary surface parameterization by building point-wise mappings between given 3D points and 2D UV coordinates with adaptively deformed boundaries. We ingeniously construct several geometrically meaningful sub-networks with specific functionalities, and assemble them into a bi-directional cycle mapping framework. We also design effective loss functions and auxiliary differential geometric constraints for the optimization of the neural mapping process. To the best of our knowledge, this work makes the first attempt to investigate neural point cloud parameterization that pursues both global mappings and free boundaries. Experiments demonstrate the effectiveness and inspiring potential of our proposed learning paradigm. The code will be publicly available.
表面参数化是一个基本的几何处理问题,具有丰富的下游应用。传统的解决方案旨在操作在良好行为网格模型上的高质量三角形,这些三角形是由专门的3D建模软件生成的,因此无法满足当前普通3D数据处理需求的爆炸性增长。在本文中,我们试图对无结构3D点云进行UV解包。从技术上讲,我们提出了ParaPoint,一种通过自适应变形来构建点对2D UV坐标的无监督神经网络,以实现全局自由边界表面参数化。我们巧妙地构建了几个具有特定功能的几何有意义子网络,并将它们组装成一个双向循环映射框架。我们还为神经映射过程的优化设计了有效的损失函数和辅助差分几何约束。据我们所知,这篇论文是首次研究追求全局映射和自由边界的神经点云参数化。实验证明了我们提出的学习范式的有效性和鼓舞人心的潜力。代码将公开可用。
https://arxiv.org/abs/2403.10349
Threat hunting is sifting through system logs to detect malicious activities that might have bypassed existing security measures. It can be performed in several ways, one of which is based on detecting anomalies. We propose an unsupervised framework, called continuous bag-of-terms-and-time (CBoTT), and publish its application programming interface (API) to help researchers and cybersecurity analysts perform anomaly-based threat hunting among SIEM logs geared toward process auditing on endpoint devices. Analyses show that our framework consistently outperforms benchmark approaches. When logs are sorted by likelihood of being an anomaly (from most likely to least), our approach identifies anomalies at higher percentiles (between 1.82-6.46) while benchmark approaches identify the same anomalies at lower percentiles (between 3.25-80.92). This framework can be used by other researchers to conduct benchmark analyses and cybersecurity analysts to find anomalies in SIEM logs.
威胁狩猎是通过分析系统日志来检测可能绕过现有安全措施的恶意活动。它可以通过多种方式执行,其中一种是基于检测异常。我们提出了一个名为连续 bag-of-terms-and-time (CBoTT) 的无监督框架,并发布其应用程序编程接口 (API),以帮助研究人员和网络安全分析师在针对终端设备的SIEM日志中进行基于异常的威胁狩猎。分析结果表明,我们的框架在基准方法上始终表现优异。当日志按异常可能性从最可能到最不可能排序时,我们的方法在较高的百分位数(1.82-6.46)上识别出异常,而基准方法在较低的百分位数(3.25-80.92)上识别出相同的异常。这个框架可以被其他研究人员用于进行基准分析,也可以被网络安全分析师用于在SIEM日志中查找异常。
https://arxiv.org/abs/2403.10327
With the development of astronomical facilities, large-scale time series data observed by these facilities is being collected. Analyzing anomalies in these astronomical observations is crucial for uncovering potential celestial events and physical phenomena, thus advancing the scientific research process. However, existing time series anomaly detection methods fall short in tackling the unique characteristics of astronomical observations where each star is inherently independent but interfered by random concurrent noise, resulting in a high rate of false alarms. To overcome the challenges, we propose AERO, a novel two-stage framework tailored for unsupervised anomaly detection in astronomical observations. In the first stage, we employ a Transformer-based encoder-decoder architecture to learn the normal temporal patterns on each variate (i.e., star) in alignment with the characteristic of variate independence. In the second stage, we enhance the graph neural network with a window-wise graph structure learning to tackle the occurrence of concurrent noise characterized by spatial and temporal randomness. In this way, AERO is not only capable of distinguishing normal temporal patterns from potential anomalies but also effectively differentiating concurrent noise, thus decreasing the number of false alarms. We conducted extensive experiments on three synthetic datasets and three real-world datasets. The results demonstrate that AERO outperforms the compared baselines. Notably, compared to the state-of-the-art model, AERO improves the F1-score by up to 8.76% and 2.63% on synthetic and real-world datasets respectively.
随着天文设施的发展,这些设施观测到的大规模时间序列数据正在被收集。分析这些天文观测中的异常情况对于揭示潜在的天文事件和物理现象具有重要意义,从而促进科学研究的进程。然而,现有的时间序列异常检测方法在处理天文观测中独特特征的地方存在局限性,即每个恒星固有的独立性,但又受到随机同时噪声的影响,导致虚假警报率高。为了克服这些挑战,我们提出了AERO,一种专门为天文观测中的无监督异常检测设计的双阶段框架。 在第一阶段,我们采用基于Transformer的编码器-解码器架构来学习每个可变(即恒星)的normal temporal patterns,与可变独立性的特点相一致。在第二阶段,我们通过窗口卷积神经网络来增强模型,以应对由空间和时间随机性特征引起的并行噪声。这样,AERO不仅能够区分正常的时间序列模式,而且能够有效地区分并降低虚假警报率。 我们在三个合成数据集和三个真实世界数据集上进行了广泛的实验。实验结果表明,AERO超越了与它进行比较的基线。值得注意的是,与最先进的模型相比,AERO在合成和真实世界数据集上分别提高了8.76%和2.63%的F1得分。
https://arxiv.org/abs/2403.10220
The ability to model the underlying dynamics of visual scenes and reason about the future is central to human intelligence. Many attempts have been made to empower intelligent systems with such physical understanding and prediction abilities. However, most existing methods focus on pixel-to-pixel prediction, which suffers from heavy computational costs while lacking a deep understanding of the physical dynamics behind videos. Recently, object-centric prediction methods have emerged and attracted increasing interest. Inspired by it, this paper proposes an unsupervised object-centric prediction model that makes future predictions by learning visual dynamics between objects. Our model consists of two modules, perceptual, and dynamic module. The perceptual module is utilized to decompose images into several objects and synthesize images with a set of object-centric representations. The dynamic module fuses contextual information, takes environment-object and object-object interaction into account, and predicts the future trajectory of objects. Extensive experiments are conducted to validate the effectiveness of the proposed method. Both quantitative and qualitative experimental results demonstrate that our model generates higher visual quality and more physically reliable predictions compared to the state-of-the-art methods.
模拟视觉场景的底层动态和推理未来是人工智能的核心。已经尝试了许多方法来通过物理理解和预测能力来增强智能系统。然而,大多数现有方法都集中在像素级预测,这会造成沉重的计算成本,同时缺乏对视频背后的物理动态的深入理解。最近,以物体为中心的预测方法应运而生,并吸引了越来越多的关注。在它的基础上,本文提出了一种自监督的物体中心预测模型,通过学习物体之间的视觉动态来进行未来预测。我们的模型由两个模块组成:感知模块和动态模块。感知模块用于将图像分解成多个物体并生成具有对象中心表示的图像。动态模块融合上下文信息,考虑环境-物体和物体之间的相互作用,预测物体的未来轨迹。为了验证所提出方法的的有效性,进行了大量实验。实验结果表明,与最先进的方法相比,我们模型的视觉效果和物理可靠性更高。无论是定量还是定性实验结果,都证实了所提出方法的有效性。
https://arxiv.org/abs/2403.10079
Video-based surgical instrument segmentation plays an important role in robot-assisted surgeries. Unlike supervised settings, unsupervised segmentation relies heavily on motion cues, which are challenging to discern due to the typically lower quality of optical flow in surgical footage compared to natural scenes. This presents a considerable burden for the advancement of unsupervised segmentation techniques. In our work, we address the challenge of enhancing model performance despite the inherent limitations of low-quality optical flow. Our methodology employs a three-pronged approach: extracting boundaries directly from the optical flow, selectively discarding frames with inferior flow quality, and employing a fine-tuning process with variable frame rates. We thoroughly evaluate our strategy on the EndoVis2017 VOS dataset and Endovis2017 Challenge dataset, where our model demonstrates promising results, achieving a mean Intersection-over-Union (mIoU) of 0.75 and 0.72, respectively. Our findings suggest that our approach can greatly decrease the need for manual annotations in clinical environments and may facilitate the annotation process for new datasets. The code is available at this https URL
基于视频的手术器械分割在机器人辅助手术中扮演着重要角色。与监督设置不同,无监督分割依赖于运动线索,而由于手术视频通常比自然场景的图像质量较低,因此很难辨别。这为无监督分割技术的进步带来了巨大的负担。在我们的工作中,我们探讨了如何在不降低模型性能的前提下提高模型性能。我们的方法采用了一种三明治策略:直接从光学流中提取边界,有选择地丢弃质量较低的帧,并采用具有变帧率的微调过程。我们在EndoVis2017 VOS数据集和Endovis2017挑战数据集上进行彻底评估,结果显示我们的模型在这些数据集上取得了良好的效果,平均交集与并集(mIoU)分别为0.75和0.72。我们的研究结果表明,我们的方法可以在临床环境中大大减少手动注释的需求,并为新数据集的标注过程提供便利。代码可在此处访问:https://www.thorlabs.com/newgrouppage9.cfm?objectgroup_id=11375
https://arxiv.org/abs/2403.10039
Relying on paired synthetic data, existing learning-based Computational Aberration Correction (CAC) methods are confronted with the intricate and multifaceted synthetic-to-real domain gap, which leads to suboptimal performance in real-world applications. In this paper, in contrast to improving the simulation pipeline, we deliver a novel insight into real-world CAC from the perspective of Unsupervised Domain Adaptation (UDA). By incorporating readily accessible unpaired real-world data into training, we formalize the Domain Adaptive CAC (DACAC) task, and then introduce a comprehensive Real-world aberrated images (Realab) dataset to benchmark it. The setup task presents a formidable challenge due to the intricacy of understanding the target aberration domain. To this intent, we propose a novel Quntized Domain-Mixing Representation (QDMR) framework as a potent solution to the issue. QDMR adapts the CAC model to the target domain from three key aspects: (1) reconstructing aberrated images of both domains by a VQGAN to learn a Domain-Mixing Codebook (DMC) which characterizes the degradation-aware priors; (2) modulating the deep features in CAC model with DMC to transfer the target domain knowledge; and (3) leveraging the trained VQGAN to generate pseudo target aberrated images from the source ones for convincing target domain supervision. Extensive experiments on both synthetic and real-world benchmarks reveal that the models with QDMR consistently surpass the competitive methods in mitigating the synthetic-to-real gap, which produces visually pleasant real-world CAC results with fewer artifacts. Codes and datasets will be made publicly available.
依赖于成对合成数据,现有的基于学习的计算错配纠正(CAC)方法在现实世界中遇到了复杂的多面体合成到真实世界的领域差距,导致在现实应用中的性能较低。在本文中,我们通过无监督领域自适应(UDA)的视角,从合成到真实世界的角度,提供了一种新颖的实世界CAC洞察。通过将易于获取的实世界数据集成到训练中,我们形式化定义了领域自适应CAC(DACAC)任务,然后引入了一个全面的实世界错配图像(Realab)数据集来对其进行基准测试。设置任务因为对目标错配域的复杂性而带来了巨大的挑战。为了实现这一目标,我们提出了一个新的量化域混合表示(QDMR)框架作为解决这个问题的有效途径。QDMR从三个方面对CAC模型进行了修改:(1)通过VQGAN重构两个领域的错配图像,学习领域混合码本(DMC),描述了降解注意到的 prior;(2)通过DMC调整CAC模型的深度特征以传递目标域知识;(3)利用训练过的VQGAN生成从源域生成的伪目标错配图像,以说服目标域的监督。在合成和真实世界基准测试的大量实验中,具有QDMR的模型在减轻合成到真实世界的差距方面始终超越了竞争方法,生成了具有较少伪影的视觉上愉悦的实世界CAC结果。代码和数据集将公开提供。
https://arxiv.org/abs/2403.10012
Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
无监督域适应(UDA)对于减轻标注3D点云数据的劳动量并在面对新定义领域时缓解缺乏标签非常重要。最近,出现了许多利用图像增强跨域3D分割性能的方法。然而,由于其固有的噪声问题,伪标签,即从预训练于源域的模型生成的提供给未见领域额外监督信号的模型,在用于3D分割时是不够的,从而限制了神经网络的准确性。随着二维视觉基础模型(VFMs)的出现及其丰富的知识储备,我们提出了一个名为VFMSeg的新管道VFMSeg,通过利用这些模型进一步增强了跨模态的无监督域适应框架。 在这项工作中,我们研究了VFMs通过学习知识储备如何为未标记的目标域产生更准确标签,从而提高整体性能。首先,我们利用预训练的跨模态VFM,该模型在大型图像-文本对上进行预训练,为源域的图像和点云提供监督标签(VFM-PL)。然后,我们选择另一个在细粒度2D掩码上训练的VFM,该模型用于生成语义增强的图像和点云,以提高神经网络的性能,这些数据来自源域和目标域,就像视差(FrustumMixing)一样混合数据。最后,我们将类别级别的预测合并,以产生更准确的无标记目标域的注释。 我们对各种自动驾驶数据集进行了评估,结果表明,我们的方法在3D分割任务上取得了显著的改进。
https://arxiv.org/abs/2403.10001
Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at this https URL
遮蔽自动编码器(MAEs)从未标记的数据中学习丰富的低层次表示,但需要大量的标记数据才能有效地适应下游任务。相反,实例分类(ID)强调高级语义,为解决MAEs中的注释需求提供了一个潜在的解决方案。虽然将这两种方法结合起来可以在有限的标记数据下解决下游任务,但无意识地将ID集成到MAEs中会导致延长训练时间和高计算成本。为了应对这个挑战,我们引入了uaMix-MAE,一种有效的ID调整策略,利用无监督音频混合。通过对比调整,uaMix-MAE将预训练MAEs的表示对齐,从而促进针对任务特定语义的有效适应。为优化模型并在少量未标记数据上进行调整,我们提出了一种音频混合技术,在输入和虚拟标签空间中操作音频样本。在低/少样本设置下的实验表明,当用有限的无标记数据对模型进行调整时,\modelname在各种基准测试中的准确率可以达到4-6%的提高,例如AudioSet-20K。代码可以从该链接获取:<https://this-url>
https://arxiv.org/abs/2403.09579
Current human pose estimation systems focus on retrieving an accurate 3D global estimate of a single person. Therefore, this paper presents one of the first 3D multi-person human pose estimation systems that is able to work in real-time and is also able to handle basic forms of occlusion. First, we adjust an off-the-shelf 2D detector and an unsupervised 2D-3D lifting model for use with a 360$^\circ$ panoramic camera and mmWave radar sensors. We then introduce several contributions, including camera and radar calibrations, and the improved matching of people within the image and radar space. The system addresses both the depth and scale ambiguity problems by employing a lightweight 2D-3D pose lifting algorithm that is able to work in real-time while exhibiting accurate performance in both indoor and outdoor environments which offers both an affordable and scalable solution. Notably, our system's time complexity remains nearly constant irrespective of the number of detected individuals, achieving a frame rate of approximately 7-8 fps on a laptop with a commercial-grade GPU.
当前的人体姿态估计系统主要关注准确地提取单个人员的3D全局估计。因此,本文介绍了一种能够实时工作的3D多人体姿态估计系统,并且能够处理基本的遮挡形式。首先,我们调整了一个标准的2D检测器和一种无监督的2D-3D提升模型,使其适用于360°全景相机和毫米波雷达传感器。然后,我们引入了几项贡献,包括相机和雷达的校准以及图像和雷达空间中人们的匹配。通过采用一种轻量级的2D-3D姿态提升算法,能够在实时过程中实现准确的表演,无论是在室内还是室外环境中,提供了一种经济且可扩展的解决方案。值得注意的是,我们的系统的运行时间几乎不受检测个体数量的影响,在配备商用级GPU的笔记本电脑上的帧率为约7-8fps。
https://arxiv.org/abs/2403.09437
Identifying highlight moments of raw video materials is crucial for improving the efficiency of editing videos that are pervasive on internet platforms. However, the extensive work of manually labeling footage has created obstacles to applying supervised methods to videos of unseen categories. The absence of an audio modality that contains valuable cues for highlight detection in many videos also makes it difficult to use multimodal strategies. In this paper, we propose a novel model with cross-modal perception for unsupervised highlight detection. The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task. To achieve unsupervised highlight detection, we investigate the latent representations of the network and propose the representation activation sequence learning (RASL) module with k-point contrastive learning to learn significant representation activations. To connect the visual modality with the audio modality, we use the symmetric contrastive learning (SCL) module to learn the paired visual and audio representations. Furthermore, an auxiliary task of masked feature vector sequence (FVS) reconstruction is simultaneously conducted during pretraining for representation enhancement. During inference, the cross-modal pretrained model can generate representations with paired visual-audio semantics given only the visual modality. The RASL module is used to output the highlight scores. The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
识别原始视频材料的 highlight时刻对提高在互联网平台上编辑视频的效率至关重要。然而,手动标注视频的工作量巨大,这使得将监督方法应用于未见过的类别的视频上遇到了困难。许多视频中缺少包含许多视频 highlight 有用线索的音频模式,这也使得多模态策略难以使用。在本文中,我们提出了一个新颖的跨模态感知模型来进行无监督 highlight 检测。该模型通过自重构任务从图像-音频对数据中学习具有视觉-音频级别语义的代表。为了实现无监督 highlight 检测,我们研究了网络的潜在表示,并提出了具有 k-点对比学习的安全哈希(RASL)模块来学习重要的表示激活。为了将视觉模式与音频模式连接起来,我们使用对称对比学习(SCL)模块学习成对的视觉和音频表示。此外,在预训练期间还进行了遮罩特征向量序列(FVS)重构的辅助任务,以增强表示。在推理期间,跨模态预训练模型可以根据仅使用视觉模式生成具有成对视觉-音频语义的代表。RASL 模块用于输出 highlight 分数。实验结果表明,与其它最先进的解决方案相比,所提出的框架具有卓越的性能。
https://arxiv.org/abs/2403.09401
We introduce Emu Video Edit (EVE), a model that establishes a new state-of-the art in video editing without relying on any supervised video editing data. To develop EVE we separately train an image editing adapter and a video generation adapter, and attach both to the same text-to-image model. Then, to align the adapters towards video editing we introduce a new unsupervised distillation procedure, Factorized Diffusion Distillation. This procedure distills knowledge from one or more teachers simultaneously, without any supervised data. We utilize this procedure to teach EVE to edit videos by jointly distilling knowledge to (i) precisely edit each individual frame from the image editing adapter, and (ii) ensure temporal consistency among the edited frames using the video generation adapter. Finally, to demonstrate the potential of our approach in unlocking other capabilities, we align additional combinations of adapters
我们介绍Emu Video Edit(EVE),一种不依赖于任何监督视频编辑数据的新颖视频编辑状态。为了开发EVE,我们分别训练了一个图像编辑适配器和一个视频生成适配器,并将它们附加到同一个文本到图像模型上。然后,为了将适配器对准视频编辑,我们引入了一种名为联合蒸馏的新无监督分解过程。这个过程中,知识从一或多个教师中同时蒸馏。我们利用这个过程来共同蒸馏知识,以精确编辑图像编辑器中的每个图像帧,并使用视频生成器适配器确保编辑帧之间的时间一致性。最后,为了展示我们方法解锁其他功能潜力,我们调整了附加适配器的组合。
https://arxiv.org/abs/2403.09334
Ill-posed image reconstruction problems appear in many scenarios such as remote sensing, where obtaining high quality images is crucial for environmental monitoring, disaster management and urban planning. Deep learning has seen great success in overcoming the limitations of traditional methods. However, these inverse problems rarely come with ground truth data, highlighting the importance of unsupervised learning from partial and noisy measurements alone. We propose perspective-equivariant imaging (EI), a framework that leverages perspective variability in optical camera-based imaging systems, such as satellites or handheld cameras, to recover information lost in ill-posed optical camera imaging problems. This extends previous EI work to include a much richer non-linear class of group transforms and is shown to be an excellent prior for satellite and urban image data, where perspective-EI achieves state-of-the-art results in multispectral pansharpening, outperforming other unsupervised methods in the literature. Code at this https URL
欠拟合图像重建问题在许多场景中都有出现,如遥感,获取高质量图像对环境监测、灾害管理和城市规划至关重要。深度学习已经在克服传统方法的局限性方面取得了巨大的成功。然而,这些反问题很少附带真实数据,突出了从部分和噪声测量的无监督学习的重要性。我们提出了观点等价成像(EI)框架,一种利用光学相机成像系统的观点变化来恢复在欠拟合光学相机成像问题中丢失的信息的方法。这扩展了之前基于观点等价的图像重建工作,包括一个更加丰富的非线性类群变换,并在卫星和城市图像数据上取得了非常好的结果,实现了多光谱 pan-sharpening 的最佳结果,超过了文献中其他无监督方法的性能。代码在此处:
https://arxiv.org/abs/2403.09327
Electronic health records (EHR) and claims data are rich sources of real-world data that reflect patient health status and healthcare utilization. Querying these databases to answer epidemiological questions is challenging due to the intricacy of medical terminology and the need for complex SQL queries. Here, we introduce an end-to-end methodology that combines text-to-SQL generation with retrieval augmented generation (RAG) to answer epidemiological questions using EHR and claims data. We show that our approach, which integrates a medical coding step into the text-to-SQL process, significantly improves the performance over simple prompting. Our findings indicate that although current language models are not yet sufficiently accurate for unsupervised use, RAG offers a promising direction for improving their capabilities, as shown in a realistic industry setting.
电子病历(EHR)和索赔数据是反映患者健康状况和医疗保健利用情况的丰富数据来源。通过查询这些数据库来回答流行病学问题具有挑战性,因为医学术语的复杂性以及需要复杂SQL查询的需求。在这里,我们介绍了一种端到端的方法,结合了文本到SQL生成和检索增强生成(RAG)来使用EHR和索赔数据回答流行病学问题。我们证明了我们的方法,其中将医学编码步骤融入文本到SQL过程中,显著提高了性能。我们的研究结果表明,尽管当前的语言模型还不够准确用于无需监督的使用,但RAG为提高它们的能力提供了有前途的方向,正如在一个现实行业的设置中所示。
https://arxiv.org/abs/2403.09226
A connectional brain template (CBT) is a holistic representation of a population of multi-view brain connectivity graphs, encoding shared patterns and normalizing typical variations across individuals. The federation of CBT learning allows for an inclusive estimation of the representative center of multi-domain brain connectivity datasets in a fully data-preserving manner. However, existing methods overlook the non-independent and identically distributed (non-IDD) issue stemming from multidomain brain connectivity heterogeneity, in which data domains are drawn from different hospitals and imaging modalities. To overcome this limitation, we unprecedentedly propose a metadata-driven federated learning framework, called MetaFedCBT, for cross-domain CBT learning. Given the data drawn from a specific domain (i.e., hospital), our model aims to learn metadata in a fully supervised manner by introducing a local client-based regressor network. The generated meta-data is forced to meet the statistical attributes (e.g., mean) of other domains, while preserving their privacy. Our supervised meta-data generation approach boosts the unsupervised learning of a more centered, representative, and holistic CBT of a particular brain state across diverse domains. As the federated learning progresses over multiple rounds, the learned metadata and associated generated connectivities are continuously updated to better approximate the target domain information. MetaFedCBT overcomes the non-IID issue of existing methods by generating informative brain connectivities for privacy-preserving holistic CBT learning with guidance using metadata. Extensive experiments on multi-view morphological brain networks of normal and patient subjects demonstrate that our MetaFedCBT is a superior federated CBT learning model and significantly advances the state-of-the-art performance.
一个连接性的脑模板(CBT)是对多视角脑连接图 population 的全面表示,编码共享模式并 normalizing 典型变异跨越个体。CBT 学习联盟允许以完全数据保留的方式估计多领域脑连接数据的代表中心。然而,现有方法忽视了多领域脑连接异质性导致的非独立且等距(non-IDD)问题,其中数据领域来自不同的医院和成像方式。为了克服这一局限,我们提出了一个以元数据驱动的联邦学习框架,名为 MetaFedCBT,用于跨领域 CBT 学习。 给定特定领域的数据(即医院数据),我们的模型通过引入基于客户端的局部回归网络,以完全监督的方式学习元数据。生成的元数据被迫满足其他领域的统计特征(例如平均值),同时保留它们的隐私。我们的监督元数据生成方法通过元数据中的信息增强 brain 状态的更中心化、代表性和平衡的 CBT 提供了 unsupervised 学习。随着联邦学习在多个轮次中进行,学习到的元数据和相关生成连接性持续更新,以更好地逼近目标领域信息。通过使用元数据指导隐私保留的全面 CBT 学习,MetaFedCBT 克服了现有方法的 non-IID 问题。 在多视角正常和患者subject 的多次元脑网络 extensive 实验中,我们的 MetaFedCBT 是优越的联邦 CBT 学习模型,显著提高了现有技术的水平。
https://arxiv.org/abs/2403.09139
Anomaly detection in dynamic graphs presents a significant challenge due to the temporal evolution of graph structures and attributes. The conventional approaches that tackle this problem typically employ an unsupervised learning framework, capturing normality patterns with exclusive normal data during training and identifying deviations as anomalies during testing. However, these methods face critical drawbacks: they either only depend on proxy tasks for general representation without directly pinpointing normal patterns, or they neglect to differentiate between spatial and temporal normality patterns, leading to diminished efficacy in anomaly detection. To address these challenges, we introduce a novel Spatial-Temporal memories-enhanced graph autoencoder (STRIPE). Initially, STRIPE employs Graph Neural Networks (GNNs) and gated temporal convolution layers to extract spatial features and temporal features, respectively. Then STRIPE incorporates separate spatial and temporal memory networks, which capture and store prototypes of normal patterns, thereby preserving the uniqueness of spatial and temporal normality. After that, through a mutual attention mechanism, these stored patterns are then retrieved and integrated with encoded graph embeddings. Finally, the integrated features are fed into the decoder to reconstruct the graph streams which serve as the proxy task for anomaly detection. This comprehensive approach not only minimizes reconstruction errors but also refines the model by emphasizing the compactness and distinctiveness of the embeddings in relation to the nearest memory prototypes. Through extensive testing, STRIPE has demonstrated a superior capability to discern anomalies by effectively leveraging the distinct spatial and temporal dynamics of dynamic graphs, significantly outperforming existing methodologies, with an average improvement of 15.39% on AUC values.
在动态图中的异常检测是一个挑战性的任务,因为图结构和属性的时间演化。解决这个问题的传统方法通常采用无监督学习框架,在训练期间捕获规范模式,并在测试期间识别异常。然而,这些方法面临着关键的缺陷:它们要么只依赖于一般表示的代理任务,没有直接确定规范模式,要么忽视了空间和时间规范模式之间的区别,导致异常检测的有效性降低。为了应对这些挑战,我们引入了一种新颖的空间-时间记忆增强图自编码器(STRIPE)。 首先,STRIPE采用图神经网络(GNNs)和有门时间卷积层来提取空间特征和时间特征。然后,STRIPE引入了单独的空间和时间记忆网络,它们捕获并存储规范模式的模板,从而保留空间和时间的独特性。接下来,通过自注意力机制,这些存储的模式被检索并整合与编码的图嵌入。最后,将整合的嵌入输入解码器以重构图流作为异常检测的代理任务。 这种全面的方法不仅减少了重构误差,而且通过强调嵌入与最近记忆原型之间的简洁性和差异性,优化了模型。通过广泛的测试,STRIPE已经证明了自己在区分异常方面的优越性能,有效提高了平均异常检测的准确率15.39%。
https://arxiv.org/abs/2403.09039
Radiography imaging protocols focus on particular body regions, therefore producing images of great similarity and yielding recurrent anatomical structures across patients. Exploiting this structured information could potentially ease the detection of anomalies from radiography images. To this end, we propose a Simple Space-Aware Memory Matrix for In-painting and Detecting anomalies from radiography images (abbreviated as SimSID). We formulate anomaly detection as an image reconstruction task, consisting of a space-aware memory matrix and an in-painting block in the feature space. During the training, SimSID can taxonomize the ingrained anatomical structures into recurrent visual patterns, and in the inference, it can identify anomalies (unseen/modified visual patterns) from the test image. Our SimSID surpasses the state of the arts in unsupervised anomaly detection by +8.0%, +5.0%, and +9.9% AUC scores on ZhangLab, COVIDx, and CheXpert benchmark datasets, respectively. Code: this https URL
翻译:放射性成像协议关注特定的身体部位,因此产生具有很大相似性和给出跨患者解剖结构的图像。利用这种结构化信息,可能有助于从放射性图像中检测异常。为此,我们提出了一个简单的空间感知记忆矩阵用于从放射性图像中修复和检测异常(缩写为SimSID)。我们将异常检测视为图像重建任务,包括一个空间感知的记忆矩阵和特征空间中的修复块。在训练过程中,SimSID可以将固有解剖结构划分为重复视觉模式,而在推理过程中,它可以从测试图像中识别出(未见/修改的视觉模式)。我们的SimSID在ZhangLab、COVIDx和CheXpert基准数据集上的自监督异常检测AUC分数分别+8.0%、+5.0%和+9.9%超过了现有水平。代码:https:// this URL
https://arxiv.org/abs/2403.08689
Diffusion models have advanced unsupervised anomaly detection by improving the transformation of pathological images into pseudo-healthy equivalents. Nonetheless, standard approaches may compromise critical information during pathology removal, leading to restorations that do not align with unaffected regions in the original scans. Such discrepancies can inadvertently increase false positive rates and reduce specificity, complicating radiological evaluations. This paper introduces Temporal Harmonization for Optimal Restoration (THOR), which refines the de-noising process by integrating implicit guidance through temporal anomaly maps. THOR aims to preserve the integrity of healthy tissue in areas unaffected by pathology. Comparative evaluations show that THOR surpasses existing diffusion-based methods in detecting and segmenting anomalies in brain MRIs and wrist X-rays. Code: this https URL.
扩散模型通过将病理性图像的变换化为伪健康等价物,实现了无监督异常检测的先进。然而,标准的治疗方法可能会在病理学去除过程中破坏关键信息,导致复原结果与原始扫描中的无病区域不重叠。这些不一致可能会无意中增加假阳性率并降低特异性,使放射学评估变得复杂。本文介绍了Temporal Harmonization for Optimal Restoration (THOR),它通过整合通过时间异常图隐含指导来优化去噪过程。THOR旨在保留未受病理影响的区域的健康组织完整性。比较评估显示,THOR在检测和分割脑部MRI和手腕X光片中的异常方面超过了现有基于扩散的方法。代码:此链接:
https://arxiv.org/abs/2403.08464
The acquisition and analysis of high-quality sensor data constitute an essential requirement in shaping the development of fully autonomous driving systems. This process is indispensable for enhancing road safety and ensuring the effectiveness of the technological advancements in the automotive industry. This study introduces the Interaction of Autonomous and Manually-Controlled Vehicles (IAMCV) dataset, a novel and extensive dataset focused on inter-vehicle interactions. The dataset, enriched with a sophisticated array of sensors such as Light Detection and Ranging, cameras, Inertial Measurement Unit/Global Positioning System, and vehicle bus data acquisition, provides a comprehensive representation of real-world driving scenarios that include roundabouts, intersections, country roads, and highways, recorded across diverse locations in Germany. Furthermore, the study shows the versatility of the IAMCV dataset through several proof-of-concept use cases. Firstly, an unsupervised trajectory clustering algorithm illustrates the dataset's capability in categorizing vehicle movements without the need for labeled training data. Secondly, we compare an online camera calibration method with the Robot Operating System-based standard, using images captured in the dataset. Finally, a preliminary test employing the YOLOv8 object-detection model is conducted, augmented by reflections on the transferability of object detection across various LIDAR resolutions. These use cases underscore the practical utility of the collected dataset, emphasizing its potential to advance research and innovation in the area of intelligent vehicles.
高质量传感器数据的获取和分析是塑造自动驾驶系统发展的关键要求。这个过程对于提高道路安全和确保汽车工业中技术进步的有效性至关重要。本研究介绍了一个新的、广泛的自动驾驶和手动控制的车辆(IAMCV)数据集,这是一个关注车辆间互动的数据集。数据集通过丰富传感器设备,如光检测和测距、摄像头、惯性测量单元/全球定位系统(IMU)和车辆总线数据采集,提供了德国各地各种地点的自动驾驶场景的全面代表。此外,研究展示了IAMCV数据集的多样性,通过多个演示案例进行了说明。首先,无监督的轨迹聚类算法说明了数据集在不需要标记训练数据的情况下对车辆运动进行分类的能力。其次,我们使用数据集中的图像与基于机器人操作系统(ROS)的标准进行在线相机校准的比较。最后,进行了一个初步的YOLOv8目标检测模型测试,通过思考不同LIDAR分辨率的对象检测的可转移性来加强。这些用例突出了收集到的数据的实际应用价值,强调了在智能汽车领域推动研究和创新的可能性。
https://arxiv.org/abs/2403.08455
A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with $9$ billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.
句法语言模型(SLM)以从左到右的方式生成句子的合成语言模型(SLM)。我们提出了 Generative Pretrained Structured Transformers(GPST),一种无监督的大规模SLM,可以从零开始在原始文本上进行预训练,具有高并行度。GPST克服了以前SLM的局限性,例如依赖于金树和序列训练。它由两个组件组成,一个是通常的SLM,由单向语言建模损失监督,另一个是额外的组合模型,它通过双向语言建模损失诱导语义解析树,并计算组成表示。我们提出了一个表示代理,以实现这两个模型在 hard-EM 方式下的联合并行训练。我们在 OpenWebText 上预训练了 GPST,这是一个包含900亿个标记的语料库,并证明了GPST在许多任务中优于GPT-2,尽管它们具有相当的数量。同时,GPST在从左到右的语法推导方面也显著优于现有的无监督SLM,而在训练过程中具有显著的加速。
https://arxiv.org/abs/2403.08293