Diffusion generative modelling (DGM) based on stochastic differential equations (SDEs) with score matching has achieved unprecedented results in data generation. In this paper, we propose a novel fast high-quality generative modelling method based on high-order Langevin dynamics (HOLD) with score matching. This motive is proved by third-order Langevin dynamics. By augmenting the previous SDEs, e.g. variance exploding or variance preserving SDEs for single-data variable processes, HOLD can simultaneously model position, velocity, and acceleration, thereby improving the quality and speed of the data generation at the same time. HOLD is composed of one Ornstein-Uhlenbeck process and two Hamiltonians, which reduce the mixing time by two orders of magnitude. Empirical experiments for unconditional image generation on the public data set CIFAR-10 and CelebA-HQ show that the effect is significant in both Frechet inception distance (FID) and negative log-likelihood, and achieves the state-of-the-art FID of 1.85 on CIFAR-10.
基于随机微分方程(SDEs)的扩散生成建模(DGM)取得了空前的数据生成结果。在本文中,我们提出了一种基于高阶Langevin动力(HOLD)的新的快速高质量生成建模方法。这一动机由高阶Langevin动力证明。通过增加以前SDEs,例如单数据变量过程的方差爆炸或方差保持SDE,HOLD可以同时建模位置、速度和加速度,从而提高数据生成的质量和速度。HOLD由一个Ornstein-Uhlenbeck过程和两个哈密顿组成,它们减少了混合时间的两倍。在公共数据集CIFAR-10和CelebA-HQ上进行无条件图像生成的实证实验表明,该效果在弗雷歇创新距离(FID)和负对数似然上都有显著影响,并在CIFAR-10上实现了最先进的FID值1.85。
https://arxiv.org/abs/2404.12814
This report introduces a solution to the Topic 1 Zero-shot Image Captioning of 2024 NICE : New frontiers for zero-shot Image Captioning Evaluation. In contrast to NICE 2023 datasets, this challenge involves new annotations by humans with significant differences in caption style and content. Therefore, we enhance image captions effectively through retrieval augmentation and caption grading methods. At the data level, we utilize high-quality captions generated by image caption models as training data to address the gap in text styles. At the model level, we employ OFA (a large-scale visual-language pre-training model based on handcrafted templates) to perform the image captioning task. Subsequently, we propose caption-level strategy for the high-quality caption data generated by the image caption models and integrate them with retrieval augmentation strategy into the template to compel the model to generate higher quality, more matching, and semantically enriched captions based on the retrieval augmentation prompts. Our approach ranks first on the leaderboard, achieving a CIDEr score of 234.11 and 1st in all other metrics.
本报告为2024 NICE主题1:零 shot 图像标题介绍了一种解决方案:零 shot图像标题评估的新领域。与NICE 2023数据集相比,这个挑战涉及了人类对标题风格和内容的显著差异。因此,我们通过检索增强和评分方法有效地增强图像标题。在数据层面上,我们利用图像标题模型生成的高质量标题作为训练数据来解决文本风格的空白。在模型层面上,我们采用OFA(一个基于手工模板的大型视觉语言预训练模型)进行图像标题任务。随后,我们提出了针对图像标题模型的高质量标题策略,并将它们与检索增强策略集成到模板中,以迫使模型根据检索增强提示生成更高质量、更匹配、更具语义丰富的标题。我们的方法在排行榜上排名第一,实现了CIDEr分数为234.11,并且在所有其他指标上都排名第一。
https://arxiv.org/abs/2404.12739
The integration of visual cues has revitalized the performance of the target speech extraction task, elevating it to the forefront of the field. Nevertheless, this multi-modal learning paradigm often encounters the challenge of modality imbalance. In audio-visual target speech extraction tasks, the audio modality tends to dominate, potentially overshadowing the importance of visual guidance. To tackle this issue, we propose AVSepChain, drawing inspiration from the speech chain concept. Our approach partitions the audio-visual target speech extraction task into two stages: speech perception and speech production. In the speech perception stage, audio serves as the dominant modality, while visual information acts as the conditional modality. Conversely, in the speech production stage, the roles are reversed. This transformation of modality status aims to alleviate the problem of modality imbalance. Additionally, we introduce a contrastive semantic matching loss to ensure that the semantic information conveyed by the generated speech aligns with the semantic information conveyed by lip movements during the speech production stage. Through extensive experiments conducted on multiple benchmark datasets for audio-visual target speech extraction, we showcase the superior performance achieved by our proposed method.
视觉提示的集成已经恢复了目标语音提取任务的性能,将其提升到领域的最前沿。然而,这种多模态学习范式通常会遇到模态不平衡的挑战。在音频-视频目标语音提取任务中,音频模态往往占主导地位,可能削弱视觉指导的重要性。为解决这个问题,我们提出了AVSepChain,受到语音链概念的启发。我们的方法将音频-视频目标语音提取任务分为两个阶段:语音感知和语音生成。在语音感知阶段,音频作为主导模态,而视觉信息作为条件模态。相反,在语音生成阶段,这两个角色是相反的。这种模态状态的转换旨在减轻模态不平衡的问题。此外,我们引入了对比性语义匹配损失,以确保生成的语音所传达的语义信息与语音生产阶段时唇运动所传达的语义信息相一致。通过在多个音频-视频目标语音提取基准数据集上进行广泛实验,我们展示了我们方法所取得的优越性能。
https://arxiv.org/abs/2404.12725
With the development of remote sensing technology in recent decades, spaceborne sensors with sub-meter and meter spatial resolution (Worldview and PlanetScope) have achieved a considerable image quality to generate 3D geospatial data via a stereo matching pipeline. These achievements have significantly increased the data accessibility in 3D, necessitating adapting these 3D geospatial data to analyze human and natural environments. This dissertation explores several novel approaches based on stereo and multi-view satellite image-derived 3D geospatial data, to deal with remote sensing application issues for built-up area modeling and natural environment monitoring, including building model 3D reconstruction, glacier dynamics tracking, and lake algae monitoring. Specifically, the dissertation introduces four parts of novel approaches that deal with the spatial and temporal challenges with satellite-derived 3D data. The first study advances LoD-2 building modeling from satellite-derived Orthophoto and DSMs with a novel approach employing a model-driven workflow that generates building rectangular 3D geometry models. Secondly, we further enhanced our building reconstruction framework for dense urban areas and non-rectangular purposes, we implemented deep learning for unit-level segmentation and introduced a gradient-based circle reconstruction for circular buildings to develop a polygon composition technique for advanced building LoD2 reconstruction. Our third study utilizes high-spatiotemporal resolution PlanetScope satellite imagery for glacier tracking at 3D level in mid-latitude regions. Finally, we proposed a term as "Algal Behavior Function" to refine the quantification of chlorophyll-a concentrations from satellite imagery in water quality monitoring, addressing algae fluctuations and timing discrepancies between satellite observations and field measurements, thus enhancing the precision of underwater algae volume estimates. Overall, this dissertation demonstrates the extensive potential of satellite photogrammetry applications in addressing urban and environmental challenges. It further showcases innovative analytical methodologies that enhance the applicability of adapting stereo and multi-view very high-resolution satellite-derived 3D data. (See full abstract in the document)
在近几十年来,遥感技术的发展使得空间分辨率亚米级和米级的陆地传感器(Worldview和PlanetScope)能够获得相当大的图像质量,通过立体匹配方法生成三维地理数据。这些成就是显著增加了三维数据的可访问性,迫使将这些三维地理数据应用于人类和自然环境的研究。本论文探讨了基于立体和多维卫星图像的几个新的方法,以解决建筑建模和自然环境监测中的遥感应用问题,包括建筑模型3D重建、冰川 dynamics跟踪和湖泊藻类监测。具体来说,论文介绍了一种处理卫星基于3D数据的空间和时间挑战的新方法。第一研究进展了一种利用新方法生成基于卫星Orthophoto和DSM的LOD-2级建模方法。第二,我们进一步提高了建筑重建框架,以应对密度的城市地区和非矩形目的,我们为环形建筑进行了深度学习,并引入了基于圆的曲线重建方法,以发展高级建模LOD2的聚合法。我们的第三研究利用高空间时间分辨率PlanScope卫星图像在 Mid-latitude 地区对冰川进行跟踪。最后,我们提出了一个称为"Algal Behavior Function"的术语,以定量测量水质量监测中卫星图像中蓝藻浓度,解决藻类波动和卫星观测与现场测量之间的时间差异,从而提高水下藻类体积估计的精度。总的来说,本论文展示了卫星摄影测量在城市和环境挑战中广泛应用的潜力。它还展示了创新的数据分析方法,提高了适应立体和多维高分辨率卫星数据的可行性。( see the full abstract in the document)
https://arxiv.org/abs/2404.12487
Large-scale geolocation telematics data acquired from connected vehicles has the potential to significantly enhance mobility infrastructures and operational systems within smart cities. To effectively utilize this data, it is essential to accurately match the geolocation data to the road segments. However, this matching is often not trivial due to the low sampling rate and errors exacerbated by multipath effects in urban environments. Traditionally, statistical modeling techniques such as Hidden-Markov models incorporating domain knowledge into the matching process have been extensively used for map-matching tasks. However, rule-based map-matching tasks are noise-sensitive and inefficient in processing large-scale trajectory data. Deep learning techniques directly learn the relationship between observed data and road networks from the data, often without the need for hand-crafted rules or domain knowledge. This renders them an efficient approach for map-matching large-scale datasets and makes them more robust to the noise. This paper introduces a sequence-to-sequence deep-learning model, specifically the transformer-based encoder-decoder model, to perform as a surrogate for map-matching algorithms. The encoder-decoder architecture initially encodes the series of noisy GPS points into a representation that automatically captures autoregressive behavior and spatial correlations between GPS points. Subsequently, the decoder associates data points with the road network features and thus transforms these representations into a sequence of road segments. The model is trained and evaluated using GPS traces collected in Manhattan, New York. Achieving an accuracy of 76%, transformer-based encoder-decoder models extensively employed in natural language processing presented a promising performance for translating noisy GPS data to the navigated routes in urban road networks.
从连接的车辆中获得的较大规模的地理定位数据具有显著增强智能城市中交通基础设施和操作系统的潜力。要有效利用这些数据,必须准确地将地理定位数据与道路段匹配。然而,由于城市环境中多径效应的影响,这种匹配通常是困难的。传统上,用于地图匹配的任务中,如隐马尔可夫模型(HMM)等统计建模技术,已经广泛使用了。然而,基于规则的地图匹配任务对噪声敏感,处理大规模轨迹数据效率低下。深度学习技术直接从数据中学习观测数据与道路网络之间的关系,通常不需要手动规则或领域知识。这使得它们成为处理大规模数据集的有效的地图匹配方法,并使它们对噪声更加鲁棒。本文介绍了一种序列到序列的深度学习模型,特别是基于Transformer的编码器-解码器模型,作为地图匹配算法的代理。编码器-解码器架构最初将一系列噪声GPS点编码成一个自动捕捉自回归行为和GPS点之间空间关联的表示。随后,解码器将数据点与道路网络特征关联,从而将这些表示转换为一系列道路段。该模型使用纽约市GPS轨迹进行训练和评估。实现76%的准确度,基于Transformer的编码器-解码器模型在自然语言处理领域广泛应用,其对翻译噪声GPS数据到城市道路网络的导航路线表现出有前景的性能。
https://arxiv.org/abs/2404.12460
Recognizing places from an opposing viewpoint during a return trip is a common experience for human drivers. However, the analogous robotics capability, visual place recognition (VPR) with limited field of view cameras under 180 degree rotations, has proven to be challenging to achieve. To address this problem, this paper presents Same Place Opposing Trajectory (SPOT), a technique for opposing viewpoint VPR that relies exclusively on structure estimated through stereo visual odometry (VO). The method extends recent advances in lidar descriptors and utilizes a novel double (similar and opposing) distance matrix sequence matching method. We evaluate SPOT on a publicly available dataset with 6.7-7.6 km routes driven in similar and opposing directions under various lighting conditions. The proposed algorithm demonstrates remarkable improvement over the state-of-the-art, achieving up to 91.7% recall at 100% precision in opposing viewpoint cases, while requiring less storage than all baselines tested and running faster than all but one. Moreover, the proposed method assumes no a priori knowledge of whether the viewpoint is similar or opposing, and also demonstrates competitive performance in similar viewpoint cases.
在往返旅行中,从对方面临识别地点是一个常见的人类驾驶者的经历。然而,具有有限视野相机的视场机器人学能力(VPR)在实现方面被证明具有挑战性。为解决这个问题,本文提出了 Same Place Opposing Trajectory(SPOT),一种基于立体视觉惯性测量(VO)的反对观点VPR技术。该方法扩展了最近在激光描述符和双距离矩阵序列匹配方面的最新进展,并采用了一种新颖的double(相似和反对)距离矩阵序列匹配方法。我们在各种光照条件下,使用公开可用的数据集对SPOT进行了评估。与最先进的实现相比,所提出的算法在反对观点情况下实现了显著的提高,达到91.7%的召回率,而在100%精确度时,所需存储比所有测试基线都要少,并且比所有基线都要快。此外,所提出的假设没有预先知识来确定视点的相似性或反对性,并且在相似观点情况下也具有竞争力的性能。
https://arxiv.org/abs/2404.12339
Finding correspondences between 3D shapes is an important and long-standing problem in computer vision, graphics and beyond. A prominent challenge are partial-to-partial shape matching settings, which occur when the shapes to match are only observed incompletely (e.g. from 3D scanning). Although partial-to-partial matching is a highly relevant setting in practice, it is rarely explored. Our work bridges the gap between existing (rather artificial) 3D full shape matching and partial-to-partial real-world settings by exploiting geometric consistency as a strong constraint. We demonstrate that it is indeed possible to solve this challenging problem in a variety of settings. For the first time, we achieve geometric consistency for partial-to-partial matching, which is realized by a novel integer non-linear program formalism building on triangle product spaces, along with a new pruning algorithm based on linear integer programming. Further, we generate a new inter-class dataset for partial-to-partial shape-matching. We show that our method outperforms current SOTA methods on both an established intra-class dataset and our novel inter-class dataset.
在计算机视觉、图形学和其他领域中,找到3D形状之间的对应关系是一个重要且长期存在的问题。一个突出的挑战是部分到部分形状匹配设置,当匹配的形状只被部分观察时(例如从3D扫描中观察)。尽管在实践中部分到部分匹配是一个高度相关的设置,但很少被探索。我们的工作通过利用几何一致性作为强大的约束,将现有的(相当 artificial)3D完整形状匹配和部分到部分现实世界设置相桥。我们证明了在各种设置中确实可以解决这个问题。对于第一次,我们在部分到部分匹配上实现了几何一致性,这是通过基于三角函数空间的新颖整数非线性规划形式论和基于线性整数规划的新裁剪算法来实现的。此外,我们还为部分到部分形状匹配生成了一个新的跨类数据集。我们证明了我们的方法在已建立的内类数据集和我们的新跨类数据集上的表现都超过了当前的最好的方法。
https://arxiv.org/abs/2404.12209
Language models (LMs) trained on vast quantities of text data can acquire sophisticated skills such as generating summaries, answering questions or generating code. However, they also manifest behaviors that violate human preferences, e.g., they can generate offensive content, falsehoods or perpetuate social biases. In this thesis, I explore several approaches to aligning LMs with human preferences. First, I argue that aligning LMs can be seen as Bayesian inference: conditioning a prior (base, pretrained LM) on evidence about human preferences (Chapter 2). Conditioning on human preferences can be implemented in numerous ways. In Chapter 3, I investigate the relation between two approaches to finetuning pretrained LMs using feedback given by a scoring function: reinforcement learning from human feedback (RLHF) and distribution matching. I show that RLHF can be seen as a special case of distribution matching but distributional matching is strictly more general. In chapter 4, I show how to extend the distribution matching to conditional language models. Finally, in chapter 5 I explore a different root: conditioning an LM on human preferences already during pretraining. I show that involving human feedback from the very start tends to be more effective than using it only during supervised finetuning. Overall, these results highlight the room for alignment techniques different from and complementary to RLHF.
语言模型(LMs)通过大量文本数据进行训练可以获得复杂的技能,如生成概述、回答问题或生成代码。然而,它们也表现出了违反人类偏好的行为,例如生成具有攻击性的内容、虚假信息或传播社会偏见。在这篇论文中,我探讨了几种将LMs与人类偏好对齐的方法。首先,我认为将LMs对齐可以看作是贝叶斯推理:通过给定关于人类偏好的证据来条件化先验(基础,预训练LM)(第2章)。通过人类偏好进行条件可以以多种方式实现。在第3章中,我研究了使用评分函数给反馈的两种方法:基于人类反馈的强化学习(RLHF)和分布匹配。我表明,RLHF可以被视为分布匹配的特殊情况,但分布匹配比它更一般。在第4章中,我展示了如何将分布匹配扩展到条件语言模型。最后,在第5章中,我探讨了另一种根源:在预训练过程中将LM对齐于人类偏好。我表明,从从一开始涉及人类反馈往往比仅在监督微调过程中使用它更有效。总体而言,这些结果突出了与RLHF不同的、互补的alignment技术。
https://arxiv.org/abs/2404.12150
Unsupervised constituency parsing is about identifying word sequences that form a syntactic unit (i.e., constituents) in a target sentence. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent corresponds to frequent word sequences. However, such information is unavailable to previous parsing methods which identify the constituent by observing sentences with diverse PAS. In this study, we empirically verify that \textbf{constituents correspond to word sequence patterns in the PAS-equivalent sentence set}. We propose a frequency-based method \emph{span-overlap}, applying the word sequence pattern to computational unsupervised parsing for the first time. Parsing experiments show that the span-overlap parser outperforms state-of-the-art parsers in eight out of ten languages. Further discrimination analysis confirms that the span-overlap method can non-trivially separate constituents from non-constituents. This result highlights the utility of the word sequence pattern. Additionally, we discover a multilingual phenomenon: \textbf{participant-denoting constituents are more frequent than event-denoting constituents}. The phenomenon indicates a behavioral difference between the two constituent types, laying the foundation for future labeled unsupervised parsing.
无监督的句法分析是关于在目标句子中识别词序列形成语义单位的。语言学家通过评估一组等价于命题-论证结构(PAS)的句子,其中我们找到相应的词序列,来确定语素。然而,这样的信息对于先前通过观察具有多样PAS的句子的方法是不可用的。在这项研究中,我们通过实验验证了语素对应于PAS等价句子集中的词序列模式。我们提出了一个基于频率的方法(span-overlap),这是对计算无监督解析首次应用词序列模式。解析实验证明,在大多数语言中,跨度重叠解析器优于最先进的解析器。进一步的区分分析证实了跨度重叠方法可以非平凡地从非成分中区分出成分。这一结果突出了词序列模式的实用性。此外,我们还发现了一个多语言现象:表示参与者的成分比表示事件的成分更常见。这一现象表明了两种语素类型之间的行为差异,为未来的有标签无监督解析奠定了基础。
https://arxiv.org/abs/2404.12059
Address matching is an important task for many businesses especially delivery and take out companies which help them to take out a certain address from their data warehouse. Existing solution uses similarity of strings, and edit distance algorithms to find out the similar addresses from the address database, but these algorithms could not work effectively with redundant, unstructured, or incomplete address data. This paper discuss semantic Address matching technique, by which we can find out a particular address from a list of possible addresses. We have also reviewed existing practices and their shortcoming. Semantic address matching is an essentially NLP task in the field of deep learning. Through this technique We have the ability to triumph the drawbacks of existing methods like redundant or abbreviated data problems. The solution uses the OCR on invoices to extract the address and create the data pool of addresses. Then this data is fed to the algorithm BM-25 for scoring the best matching entries. Then to observe the best result, this will pass through BERT for giving the best possible result from the similar queries. Our investigation exhibits that our methodology enormously improves both accuracy and review of cutting-edge technology existing techniques.
地址匹配对于许多企业来说特别是送餐和外卖公司,帮助他们从数据仓库中提取特定地址。现有解决方案使用字符串的相似性和编辑距离算法来查找地址数据库中的类似地址,但这些算法对于冗余、无结构或未完整地址数据无法有效工作。本文讨论了语义地址匹配技术,通过它可以从可能的地址列表中找到特定地址。我们还回顾了现有实践及其不足之处。语义地址匹配是深度学习领域中一个基本的语言处理任务。通过这种技术,我们能够克服现有方法中冗余或缩写数据问题的缺点。解决方案使用发票上的OCR提取地址并创建地址数据池。然后将该数据输入到算法BM-25中进行评分,以观察最佳结果。为了观察最佳结果,这还将通过BERT进行处理,从而从类似查询中获得最佳结果。我们的研究结果表明,我们的方法大大提高了现有技术的准确性和尖端技术的审查。
https://arxiv.org/abs/2404.11691
Recent diffusion probabilistic models (DPM) in the field of pansharpening have been gradually gaining attention and have achieved state-of-the-art (SOTA) performance. In this paper, we identify shortcomings in directly applying DPMs to the task of pansharpening as an inverse problem: 1) initiating sampling directly from Gaussian noise neglects the low-resolution multispectral image (LRMS) as a prior; 2) low sampling efficiency often necessitates a higher number of sampling steps. We first reformulate pansharpening into the stochastic differential equation (SDE) form of an inverse problem. Building upon this, we propose a Schrödinger bridge matching method that addresses both issues. We design an efficient deep neural network architecture tailored for the proposed SB matching. In comparison to the well-established DL-regressive-based framework and the recent DPM framework, our method demonstrates SOTA performance with fewer sampling steps. Moreover, we discuss the relationship between SB matching and other methods based on SDEs and ordinary differential equations (ODEs), as well as its connection with optimal transport. Code will be available.
近年来,在 pansharpening(高对比度增强)领域,概率扩散模型(DPM)逐渐受到关注,并取得了最先进的(SOTA)性能。在本文中,我们指出了直接将 DPM应用于 pansharpening 任务中的不足之处:1)直接从高斯噪声中启动采样忽略了低分辨率多光谱图像(LRMS)作为先验;2)低采样效率通常需要增加采样步骤。首先,我们将 pansharpening 转化为反问题中的随机微分方程(SDE)形式。在此基础上,我们提出了一个 Schrödinger 桥匹配方法来解决这两个问题。我们设计了一个专为 SB 匹配设计的高效的深度神经网络架构。与传统的 DL-基于反向传播(RF)框架和最近的 DPM 框架相比,我们的方法在更少的采样步骤下展示了 SOTA 性能。此外,我们还讨论了基于 SDEs 和 ordinary differential equations (ODEs) 的其他方法之间的关系,以及其与最优传输的关系。代码将可用。
https://arxiv.org/abs/2404.11416
Nowadays the accurate geo-localization of ground-view images has an important role across domains as diverse as journalism, forensics analysis, transports, and Earth Observation. This work addresses the problem of matching a query ground-view image with the corresponding satellite image without GPS data. This is done by comparing the features from a ground-view image and a satellite one, innovatively leveraging the corresponding latter's segmentation mask through a three-stream Siamese-like network. The proposed method, Semantic Align Net (SAN), focuses on limited Field-of-View (FoV) and ground panorama images (images with a FoV of 360°). The novelty lies in the fusion of satellite images in combination with their semantic segmentation masks, aimed at ensuring that the model can extract useful features and focus on the significant parts of the images. This work shows how SAN through semantic analysis of images improves the performance on the unlabelled CVUSA dataset for all the tested FoVs.
如今,准确的地形图像的地理定位在各种领域(如新闻、法医分析、交通和地球观测)中具有重要作用。本文解决了在没有GPS数据的情况下将查询地形图像与相应卫星图像相匹配的问题。这是通过比较地形图像和卫星图像的特征来实现创新性的,并利用卫星图像相应的分割掩模通过三个通道的Siamese-like网络。所提出的方法,语义对齐网络(SAN),专注于有限的地形视野(FoV)和地面全景图像(具有360°视野的照片)。创新之处在于将卫星图像与它们的语义分割掩模进行融合,旨在确保模型可以提取有用的特征并关注图像的重要部分。本文证明了SAN通过图像语义分析如何提高所有测试FoV的未标记CVUSA数据集的性能。
https://arxiv.org/abs/2404.11302
Constructing vectorized high-definition maps from surround-view cameras has garnered significant attention in recent years. However, the commonly employed multi-stage sequential workflow in prevailing approaches often leads to the loss of early-stage information, particularly in perspective-view features. Usually, such loss is observed as an instance missing or shape mismatching in the final birds-eye-view predictions. To address this concern, we propose a novel approach, namely \textbf{HybriMap}, which effectively exploits clues from hybrid features to ensure the delivery of valuable information. Specifically, we design the Dual Enhancement Module, to enable both explicit integration and implicit modification under the guidance of hybrid features. Additionally, the perspective keypoints are utilized as supervision, further directing the feature enhancement process. Extensive experiments conducted on existing benchmarks have demonstrated the state-of-the-art performance of our proposed approach.
近年来,从环绕视像相机的构建矢量化高清晰度地图引起了广泛关注。然而,现有的方法中通常采用的多阶段序列工作流程往往会导致在平视 view 特征中丢失早期的信息,尤其是在视角 view 的特征中。通常,这种损失表现为最终鸟瞰预测中实例缺失或形状不匹配。为了应对这种担忧,我们提出了一个新颖的方法,即 \textbf{HybriMap},它有效利用混合特征的线索来确保传递有价值的信息。具体来说,我们设计了一个双增强模块,在混合特征的指导下实现明确的集成和隐含修改。此外,将视角关键点作为监督,进一步指导特征增强过程。对现有基准进行的大量实验证明了我们所提出方法的优越性能。
https://arxiv.org/abs/2404.11155
Matching is one of the simplest approaches for estimating causal effects from observational data. Matching techniques compare the observed outcomes across pairs of individuals with similar covariate values but different treatment statuses in order to estimate causal effects. However, traditional matching techniques are unreliable given high-dimensional covariates due to the infamous curse of dimensionality. To overcome this challenge, we propose a simple, fast, yet highly effective approach to matching using Random Hyperplane Tessellations (RHPT). First, we prove that the RHPT representation is an approximate balancing score -- thus maintaining the strong ignorability assumption -- and provide empirical evidence for this claim. Second, we report results of extensive experiments showing that matching using RHPT outperforms traditional matching techniques and is competitive with state-of-the-art deep learning methods for causal effect estimation. In addition, RHPT avoids the need for computationally expensive training of deep neural networks.
匹配是一种从观测数据中估计因果效应的简单方法之一。匹配技术将具有相似协方差值但不同处理状态的个体对之间的观测结果进行比较,以估计因果效应。然而,传统的匹配技术在高度维度的协方差数据下是不可靠的,因为著名的维度诅咒。为了克服这一挑战,我们提出了一种简单、快速、但效果极高的匹配方法——随机超平面镶嵌(RHPT)。首先,我们证明RHPT表示是一个近似的平衡分数,从而保持强大的忽略假设——并提供了关于这一说法的实证证据。其次,我们报告了使用RHPT进行匹配的广泛实验结果,表明其优于传统匹配技术,与最先进的深度学习方法在因果效应估计方面具有竞争关系。此外,RHPT避免了深度神经网络训练的计算成本。
https://arxiv.org/abs/2404.10907
Today's software stacks for autonomous vehicles rely on HD maps to enable sufficient localization, accurate path planning, and reliable motion prediction. Recent developments have resulted in pipelines for the automated generation of HD maps to reduce manual efforts for creating and updating these HD maps. We present FlexMap Fusion, a methodology to automatically update and enhance existing HD vector maps using OpenStreetMap. Our approach is designed to enable the use of HD maps created from LiDAR and camera data within Autoware. The pipeline provides different functionalities: It provides the possibility to georeference both the point cloud map and the vector map using an RTK-corrected GNSS signal. Moreover, missing semantic attributes can be conflated from OpenStreetMap into the vector map. Differences between the HD map and OpenStreetMap are visualized for manual refinement by the user. In general, our findings indicate that our approach leads to reduced human labor during HD map generation, increases the scalability of the mapping pipeline, and improves the completeness and usability of the maps. The methodological choices may have resulted in limitations that arise especially at complex street structures, e.g., traffic islands. Therefore, more research is necessary to create efficient preprocessing algorithms and advancements in the dynamic adjustment of matching parameters. In order to build upon our work, our source code is available at this https URL.
今天的自动驾驶软件栈依赖HD地图来实现足够的局部定位、精确路径规划和可靠的运动预测。最近的发展使得自动化生成HD地图的流程减少了对创建和更新这些HD地图的手动努力。我们提出了FlexMap Fusion,一种利用OpenStreetMap自动更新和增强现有HD矢量地图的方法。我们的方法旨在使使用来自激光雷达和相机数据的HD地图在Autoware中得到更广泛的应用。管道提供了不同的功能:它提供了使用RTK校正的GNSS信号将点云地图和矢量地图进行几何参考的可能性。此外,缺失的语义属性可以从OpenStreetMap中 conflat到矢量地图中。用户可以通过手动细化来使用户查看HD地图和OpenStreetMap之间的差异。总的来说,我们的研究结果表明,我们的方法在HD地图生成过程中减少了人类劳动,提高了地图映射管道的可扩展性,并提高了地图的完整性和可用性。方法论选择可能导致在复杂街道结构(如交通岛屿)上出现限制。因此,需要进行更多的研究来创建高效的预处理算法和改进动态调整匹配参数。为了基于我们的工作,我们的源代码可在此链接处访问:https://www.example.com/flexmap-fusion。
https://arxiv.org/abs/2404.10879
Large vision-language models revolutionized image classification and semantic segmentation paradigms. However, they typically assume a pre-defined set of categories, or vocabulary, at test time for composing textual prompts. This assumption is impractical in scenarios with unknown or evolving semantic context. Here, we address this issue and introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an unconstrained language-induced semantic space to an input image without needing a known vocabulary. VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories. To address VIC, we propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database. CaSED first extracts the set of candidate categories from the most semantically similar captions in the database and then assigns the image to the best-matching candidate category according to the same vision-language model. Furthermore, we demonstrate that CaSED can be applied locally to generate a coarse segmentation mask that classifies image regions, introducing the task of Vocabulary-free Semantic Segmentation. CaSED and its variants outperform other more complex vision-language models, on classification and semantic segmentation benchmarks, while using much fewer parameters.
大视觉语言模型彻底颠覆了图像分类和语义分割范式。然而,它们通常在测试时假设一个预定义的词汇表,或词汇集,用于构建文本提示。在语义上下文未知或不断变化的情况下,这个假设是不实用的。在这里,我们解决了这个问题,并引入了无词汇图像分类(VIC)任务,该任务旨在将不受已知词汇表约束的语义空间中的类分配给输入图像。VIC 具有挑战性,因为语义空间非常广泛,包含数百万个概念,包括细粒度分类。为了应对 VIC,我们提出了从外部数据库中进行类别搜索(CaSED)的方法,这是一种训练免费的方法,它利用了一个预训练的视觉语言模型和外部数据库。 CaSED 首先从数据库中提取出最具语义相似性的捕捉到的候选类,然后根据相同的视觉语言模型将图像分配给最佳匹配的候选类。此外,我们还证明了 CaSED 可以局部应用于生成一个粗分割掩码,对图像区域进行分类,从而引入了词汇无语义分割任务。CaSED 和它的变体在分类和语义分割基准测试中优于其他更复杂的视觉语言模型,同时使用了更少的参数。
https://arxiv.org/abs/2404.10864
Image super-resolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at this https URL.
图像超分辨率是一个基本不满足问题的问题,因为针对一个低分辨率图像存在多个高分辨率图像。基于扩散概率模型的超分辨率方法通过学习基于低分辨率图像的高分辨率图像的概率分布来解决不满足问题,避免了PSNR导向方法中的模糊图像问题。然而,现有的基于扩散的超分辨率方法在迭代采样过程中具有高时间消耗,生成的图像的质量和不一致性不如理想,因为存在诸如颜色偏移等问题。在本文中,我们提出了用于图像超分辨率的有条件扩散模型概率流采样(ECDP)。为了减少时间消耗,我们设计了一个连续时间条件扩散模型,使得概率流采样能够用于高效的图像生成。此外,为了提高生成的图像的一致性,我们提出了一个混合参数化方法,该方法在数据预测参数化和噪声预测参数化之间进行平滑。此外,我们还设计了一个图像质量损失作为扩散模型分数匹配损失的补充,进一步提高了超分辨率的一致性和质量。在DIV2K、ImageNet和CelebA等数据集上进行的大量实验证明,我们的方法在超分辨率质量上优于现有的扩散基图像超分辨率方法,同时具有较低的时间消耗。我们的代码可在此处访问:https://www.kazuhiko.me/ECDP-SUPER-RESOLUTION
https://arxiv.org/abs/2404.10688
We explore simple methods for adapting a trained multi-task UNet which predicts canopy cover and height to a new geographic setting using remotely sensed data without the need of training a domain-adaptive classifier and extensive fine-tuning. Extending previous research, we followed a selective alignment process to identify similar images in the two geographical domains and then tested an array of data-based unsupervised domain adaptation approaches in a zero-shot setting as well as with a small amount of fine-tuning. We find that the selective aligned data-based image matching methods produce promising results in a zero-shot setting, and even more so with a small amount of fine-tuning. These methods outperform both an untransformed baseline and a popular data-based image-to-image translation model. The best performing methods were pixel distribution adaptation and fourier domain adaptation on the canopy cover and height tasks respectively.
我们探讨了简单的方法来适应训练好的多任务UNet,该模型通过遥感数据在没有训练领域自适应分类器和大量微调的情况下预测树冠覆盖度和高度,以预测新的地理设置。扩展前人研究,我们采用选择性对齐过程来识别两个地理域中的相似图像,然后在一击零数据的情况下以及少量微调的情况下测试了一组数据自适应域迁移方法。我们发现,基于数据的图像匹配方法在零击零数据的情况下产生了积极的结果,而且甚至更好的结果是在少量微调的情况下。这些方法优于未转换的基线和一种流行的基于数据的照片-图像转换模型。在树冠覆盖度和高度任务上,最佳表现的方法是像素分布适应和傅里叶域适应。
https://arxiv.org/abs/2404.10626
In this paper, we present SPVLoc, a global indoor localization method that accurately determines the six-dimensional (6D) camera pose of a query image and requires minimal scene-specific prior knowledge and no scene-specific training. Our approach employs a novel matching procedure to localize the perspective camera's viewport, given as an RGB image, within a set of panoramic semantic layout representations of the indoor environment. The panoramas are rendered from an untextured 3D reference model, which only comprises approximate structural information about room shapes, along with door and window annotations. We demonstrate that a straightforward convolutional network structure can successfully achieve image-to-panorama and ultimately image-to-model matching. Through a viewport classification score, we rank reference panoramas and select the best match for the query image. Then, a 6D relative pose is estimated between the chosen panorama and query image. Our experiments demonstrate that this approach not only efficiently bridges the domain gap but also generalizes well to previously unseen scenes that are not part of the training data. Moreover, it achieves superior localization accuracy compared to the state of the art methods and also estimates more degrees of freedom of the camera pose. We will make our source code publicly available at this https URL .
在本文中,我们提出了SPVLoc,一种全局室内定位方法,准确地确定了一个查询图像的六维(6D)相机姿态,并且不需要场景特定知识,也不需要场景特定训练。我们的方法采用了一种新颖的匹配过程,用于在室内环境的一个全景语义布局表示中定位视角相机的视图域,该表示为一个 RGB 图像。全景图是从无纹理的 3D 参考模型中渲染的,该模型仅包含房间形状的大致结构信息以及门和窗户注释。我们证明了直通道卷积网络结构可以成功实现图像到全景和最终图像到模型的匹配。通过视图分类得分,我们排名参考全景并将最佳匹配分配给查询图像。然后,在选择的全景图像和查询图像之间估计 6D 相对姿态。我们的实验证明,这种方法不仅有效地弥合了领域差距,而且对之前未见过的场景具有很好的泛化能力。此外,与最先进的方法相比,它的定位精度更高,同时还估计了相机的姿态自由度。我们将源代码公开发布在以下链接处:https:// 这个 URL 。
https://arxiv.org/abs/2404.10527
Targeted transfer-based attacks involving adversarial examples pose a significant threat to large visual-language models (VLMs). However, the state-of-the-art (SOTA) transfer-based attacks incur high costs due to excessive iteration counts. Furthermore, the generated adversarial examples exhibit pronounced adversarial noise and demonstrate limited efficacy in evading defense methods such as DiffPure. To address these issues, inspired by score matching, we introduce AdvDiffVLM, which utilizes diffusion models to generate natural, unrestricted adversarial examples. Specifically, AdvDiffVLM employs Adaptive Ensemble Gradient Estimation to modify the score during the diffusion model's reverse generation process, ensuring the adversarial examples produced contain natural adversarial semantics and thus possess enhanced transferability. Simultaneously, to enhance the quality of adversarial examples further, we employ the GradCAM-guided Mask method to disperse adversarial semantics throughout the image, rather than concentrating them in a specific area. Experimental results demonstrate that our method achieves a speedup ranging from 10X to 30X compared to existing transfer-based attack methods, while maintaining superior quality of adversarial examples. Additionally, the generated adversarial examples possess strong transferability and exhibit increased robustness against adversarial defense methods. Notably, AdvDiffVLM can successfully attack commercial VLMs, including GPT-4V, in a black-box manner.
针对具有对抗性样本的定向转移攻击对大型视觉语言模型(VLMs)构成了重大威胁。然而,最先进的(SOTA)转移攻击由于迭代计数过多而产生了高昂的成本。此外,生成的对抗性样本表现出明显的对抗性噪声,并表明在躲避防御方法如DiffPure等情况下效果有限。为了应对这些问题,受到评分匹配的启发,我们引入了AdvDiffVLM,它利用扩散模型生成自然、无限制的对抗性样本。具体来说,AdvDiffVLM在扩散模型反向生成过程中采用自适应集成梯度估计来修改得分,确保生成的对抗性样本包含自然对抗性语义,从而具有增强的转移性。同时,为了进一步提高对抗性样本的质量,我们采用GradCAM指导的遮罩方法将对抗性语义分散在整个图像中,而不仅仅集中在一个特定区域。实验结果表明,与现有的转移攻击方法相比,我们的方法实现了从10倍到30倍的加速,同时保持了优越的对抗性样本质量。此外,生成的对抗性样本具有很强的转移性,对防御方法具有较强的抵抗力。值得注意的是,AdvDiffVLM可以在黑盒方式下成功攻击商业VLMs,包括GPT-4V。
https://arxiv.org/abs/2404.10335