We introduce a simple yet effective approach for separating transmitted and reflected light. Our key insight is that the powerful novel view synthesis capabilities provided by modern inverse rendering methods (e.g.,~3D Gaussian splatting) allow one to perform flash/no-flash reflection separation using unpaired measurements -- this relaxation dramatically simplifies image acquisition over conventional paired flash/no-flash reflection separation methods. Through extensive real-world experiments, we demonstrate our method, Flash-Splat, accurately reconstructs both transmitted and reflected scenes in 3D. Our method outperforms existing 3D reflection separation methods, which do not leverage illumination control, by a large margin. Our project webpage is at this https URL.
我们提出了一个简单而有效的分离传输和反射光的方法。我们的关键洞见是,现代反向渲染方法(例如~3D高斯展开)提供的强大新视图合成能力允许使用非配对测量进行闪光/无闪光反射分离——这种松弛大大简化了传统配对闪光/无闪光反射分离方法中的图像采集过程。通过大量现实世界的实验,我们证明了我们的方法Flash-Splat准确地重构了3D中的传输和反射场景。与现有的3D反射分离方法相比,我们的方法在很大程度上超过了它们,这些方法没有利用照明控制。我们的项目网页地址是https://www. Flash-Splat.org。
https://arxiv.org/abs/2410.02764
We investigate the internal representations of vision-language models (VLMs) to address hallucinations, a persistent challenge despite advances in model size and training. We project VLMs' internal image representations to their language vocabulary and observe more confident output probabilities on real objects than hallucinated objects. We additionally use these output probabilities to spatially localize real objects. Building on this approach, we introduce a knowledge erasure algorithm that removes hallucinations by linearly orthogonalizing image features with respect to hallucinated object features. We show that targeted edits to a model's latent representations can reduce hallucinations by up to 25.7% on the COCO2014 dataset while preserving performance. Our findings demonstrate how a deeper understanding of VLMs' latent representations can enhance reliability and enable novel capabilities, such as zero-shot segmentation.
我们研究了视觉语言模型(VLMs)的内部表示,以解决在模型大小和训练方面取得进展但仍然存在的一种普遍挑战:幻觉。我们将VLMs的内部图像表示投影到其语言词汇中,并观察到真实物体上的输出概率比幻觉物体上的输出概率更自信。此外,我们还使用这些输出概率将真实物体进行空间局部化。在此基础上,我们引入了一种知识消逝算法,通过将图像特征与幻觉物体特征之间进行线性正交操作来消除幻觉。我们在COCO2014数据集上展示了针对模型 latent 表示的定向修改可以减少幻觉,同时保持性能。我们的研究结果表明,对VLMs latent 表示的更深入了解可以提高可靠性并实现诸如零 shot分割等新颖功能。
https://arxiv.org/abs/2410.02762
It is desirable but challenging to generate content-rich long videos in the scale of minutes. Autoregressive large language models (LLMs) have achieved great success in generating coherent and long sequences of tokens in the domain of natural language processing, while the exploration of autoregressive LLMs for video generation is limited to generating short videos of several seconds. In this work, we conduct a deep analysis of the challenges that prevent autoregressive LLM-based video generators from generating long videos. Based on the observations and analysis, we propose Loong, a new autoregressive LLM-based video generator that can generate minute-long videos. Specifically, we model the text tokens and video tokens as a unified sequence for autoregressive LLMs and train the model from scratch. We propose progressive short-to-long training with a loss re-weighting scheme to mitigate the loss imbalance problem for long video training. We further investigate inference strategies, including video token re-encoding and sampling strategies, to diminish error accumulation during inference. Our proposed Loong can be trained on 10-second videos and be extended to generate minute-level long videos conditioned on text prompts, as demonstrated by the results. More samples are available at: this https URL.
生成内容丰富的长视频在分钟级别上是值得追求的,但具有挑战性。自动回归的大语言模型(LLMs)在自然语言处理领域已经取得了巨大的成功,在视频生成领域探索自回归LLMs也局限于生成几秒钟的视频。在这项工作中,我们深入分析了防止自回归LLM基视频生成器生成长视频的挑战。根据观察和分析,我们提出了Loong,一种新的自回归LLM基视频生成器,可以生成分钟长的视频。具体来说,我们将文本标记和视频标记视为一个统一的序列,用于自回归LLMs,并从零开始训练模型。我们提出了渐进短到长训练以及损失重新分配方案来减轻长视频训练中的损失不平衡问题。我们进一步研究了推理策略,包括视频标记重新编码和采样策略,以减缓在推理过程中的误差累积。我们的Loong可以根据文本提示生成分钟级别的长视频,正如实验结果所证明。更多样本可在此链接中查看:https:// this URL。
https://arxiv.org/abs/2410.02757
Contrastive Language-Image Pre-training (CLIP) has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, CLIP has been widely adopted as the vision backbone of multimodal large language models (MLLMs) to connect image inputs for language interactions. The success of CLIP as a vision-language foundation model relies on aligning web-crawled noisy text annotations at image levels. Nevertheless, such criteria may become insufficient for downstream tasks in need of fine-grained vision representations, especially when region-level understanding is demanding for MLLMs. In this paper, we improve the localization capability of CLIP with several advances. We propose a pre-training method called Contrastive Localized Language-Image Pre-training (CLOC) by complementing CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text pseudo-labels at scale. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for image region recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.
对比性语言-图像预训练(CLIP)是一种被誉为用于训练视觉编码器生成图像/文本表示以促进各种应用的训练方法。最近,CLIP已被广泛采用作为多模态大型语言模型的视觉骨干,以连接图像输入为语言交互。CLIP作为视觉-语言基础模型的成功之处在于在图像级别上与网页爬取到的嘈杂文本注释对齐。然而,对于需要细粒度视觉表示的下游任务来说,这样的标准可能变得不足。在本文中,我们通过几个进展来提高CLIP的局部化能力。我们提出了一种补充方法,称为CLOC(对比性局部化语言-图像预训练),通过补充CLIP与区域文本对比损失和模块来形成一种新的概念:提示性嵌入。为了支持大规模预训练,我们设计了一个视觉丰富且局部化的摘要框架,有效地在规模上生成区域文本伪标签。通过扩展数十亿个注释的图像,CLOC为图像区域识别和检索任务提供高质量的区域嵌入,可以成为CLIP的 drop-in 替换,尤其是在参考和 grounded 任务上。
https://arxiv.org/abs/2410.02746
Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.
在未知环境中进行物体导航对于在现实应用中部署感知代理至关重要。虽然我们通过大规模场景数据、高速模拟器和更强的模型见证了巨大的进步,但之前的研究主要集中在有限的场景类型和目标对象上。在本文中,我们研究了在多种场景类型中导航到多个目标对象的新任务。为了进行基准测试,我们提出了一个大型场景数据集DivScene,包含81个不同类型的场景,总共有4,614个场景。通过这个数据集,我们在基于模仿学习对一个大视觉语言模型(LVLM)进行微调,以构建端到端的 embodied 代理,NatVLM。LVLM 通过从环境中获取先前的观察并生成下一个动作进行训练。我们还引入了动作预测的 CoT 轨迹,以便在调整 LVLMs 时获得更好的性能。通过广泛的实验,我们发现,在没有人类监督的情况下,通过模仿学习可以在最短路径上构建的 BFS 规划器上构建出高性能的 LVLM 代理。我们的代理在 GPT-4o 上的成功率超过了20%。同时,我们进行了各种分析,证明了我们的代理具有很强的泛化能力。
https://arxiv.org/abs/2410.02730
Alzheimer's disease (AD) is a progressive neurodegenerative disorder with increasing prevalence among the aging population, necessitating early and accurate diagnosis for effective disease management. In this study, we present a novel hybrid deep learning framework that integrates both 2D Convolutional Neural Networks (2D-CNN) and 3D Convolutional Neural Networks (3D-CNN), along with a custom loss function and volumetric data augmentation, to enhance feature extraction and improve classification performance in AD diagnosis. According to extensive experiments, AlzhiNet outperforms standalone 2D and 3D models, highlighting the importance of combining these complementary representations of data. The depth and quality of 3D volumes derived from the augmented 2D slices also significantly influence the model's performance. The results indicate that carefully selecting weighting factors in hybrid predictions is imperative for achieving optimal results. Our framework has been validated on the Magnetic Resonance Imaging (MRI) from Kaggle and MIRIAD datasets, obtaining accuracies of 98.9% and 99.99%, respectively, with an AUC of 100%. Furthermore, AlzhiNet was studied under a variety of perturbation scenarios on the Alzheimer's Kaggle dataset, including Gaussian noise, brightness, contrast, salt and pepper noise, color jitter, and occlusion. The results obtained show that AlzhiNet is more robust to perturbations than ResNet-18, making it an excellent choice for real-world applications. This approach represents a promising advancement in the early diagnosis and treatment planning for Alzheimer's disease.
阿尔茨海默病(AD)是一种进行性的神经退行性疾病,其发病率在老年人群中的趋势不断增加,因此需要早期和准确的诊断以实现有效的疾病管理。在这项研究中,我们提出了一个新颖的混合深度学习框架,将2D卷积神经网络(2D-CNN)和3D卷积神经网络(3D-CNN)相结合,并包括自定义损失函数和体积数据增强,以增强特征提取和提高AD诊断的分类性能。根据广泛的实验,AlzhiNet超越了单独的2D和3D模型,突出了数据互补表示的重要性。从增强的2D切片中获得的3D体积的深度和质量也会显著影响模型的性能。结果表明,在混合预测中精心选择权重因子是实现最佳结果的必要条件。我们的框架已在Kaggle和MIRIAD数据集上的Magnetic Resonance Imaging(MRI)上进行了验证,获得了98.9%和99.99%的准确率, respectively,以及100%的AUC。此外,AlzhiNet还在AlzhiKaggle数据集上研究了各种扰动情景,包括高斯噪声、亮度、对比度、盐和胡椒噪声、颜色闪烁和遮挡。获得的结果表明,AlzhiNet对扰动的鲁棒性比ResNet-18更高,因此在实际应用中它是优秀的选择。这种方法在阿尔茨海默病的早期诊断和治疗规划中取得了有益的进展。
https://arxiv.org/abs/2410.02714
The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.
视频大型多模态模型(LMMs)的发展受到了从互联网上收集大量高质量原始数据困难的影响。为解决这个问题,我们提出了一个替代方法,即创建一个专门针对视频指令跟随的高质量合成数据集,称为LLaVA-Video-178K。这个数据集包括详细的字幕、开放性问题回答(QA)和多选题QA等关键任务。通过在这个数据集上训练,并与现有的视觉指令调整数据相结合,我们引入了LLaVA-Video,一种新的视频LMM。我们的实验结果表明,LLaVA-Video在各种视频基准测试中取得了强劲的性能,突出了我们数据集的有效性。我们计划发布该数据集、生成流程以及模型检查点。
https://arxiv.org/abs/2410.02713
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks. LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios. Our experiments demonstrate the model's effectiveness in two key areas: (1) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (2) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities. This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs.
我们提出了LLaVA-Critic,第一个开源的大多模态模型(LMM),旨在作为通才评估器评估各种多模态任务的性能。LLaVA-Critic使用高质量的评论指令跟随数据集进行训练,该数据集包含了各种评估标准和场景。我们的实验证明了模型在两个关键领域中的有效性:(1)LLM作为评判者,LLaVA-Critic提供可靠的评估分数,在多个评估基准上与或超过GPT模型;和(2)偏好学习,为偏好学习生成奖励信号,提高模型对齐能力。这项工作突出了开源LMM在自批评和评估方面的潜力,为未来研究铺平了道路,深入研究可扩展的超人类对齐反馈机制。
https://arxiv.org/abs/2410.02712
Autoregressive (AR) models have reformulated image generation as next-token prediction, demonstrating remarkable potential and emerging as strong competitors to diffusion models. However, control-to-image generation, akin to ControlNet, remains largely unexplored within AR models. Although a natural approach, inspired by advancements in Large Language Models, is to tokenize control images into tokens and prefill them into the autoregressive model before decoding image tokens, it still falls short in generation quality compared to ControlNet and suffers from inefficiency. To this end, we introduce ControlAR, an efficient and effective framework for integrating spatial controls into autoregressive image generation models. Firstly, we explore control encoding for AR models and propose a lightweight control encoder to transform spatial inputs (e.g., canny edges or depth maps) into control tokens. Then ControlAR exploits the conditional decoding method to generate the next image token conditioned on the per-token fusion between control and image tokens, similar to positional encodings. Compared to prefilling tokens, using conditional decoding significantly strengthens the control capability of AR models but also maintains the model's efficiency. Furthermore, the proposed ControlAR surprisingly empowers AR models with arbitrary-resolution image generation via conditional decoding and specific controls. Extensive experiments can demonstrate the controllability of the proposed ControlAR for the autoregressive control-to-image generation across diverse inputs, including edges, depths, and segmentation masks. Furthermore, both quantitative and qualitative results indicate that ControlAR surpasses previous state-of-the-art controllable diffusion models, e.g., ControlNet++. Code, models, and demo will soon be available at this https URL.
自回归(AR)模型将图像生成重新定义为下一个词的预测,展示了惊人的潜力和与扩散模型的崛起成为强大的竞争对手。然而,在AR模型中,控制到图像生成类似于ControlNet的控制仍然很大程度上没有被探索。虽然自然的方法,受到大型语言模型发展的启发,是将控制图像元词化,并将它们填入自回归模型在进行图像token解码之前,但它仍然在生成质量上比ControlNet和低效。因此,我们引入了ControlAR,一个用于将空间控制集成到自回归图像生成模型的有效而有效的框架。首先,我们探讨了AR模型的控制编码,并提出了一个轻量级的控制编码器,将空间输入(例如,锐利的边缘或深度图)转换为控制token。然后,ControlAR利用条件解码方法,根据控制和图像token之间的自适应融合生成下一个图像token,类似于位置编码。与预填充的token相比,使用条件解码显著增强了AR模型的控制能力,同时保持了模型的效率。此外,与自适应扩散模型(例如,ControlNet++)相比,所提出的ControlAR通过条件解码在各种输入上实现了任意分辨率图像生成。丰富的实验结果可以证明,所提出的ControlAR在各种输入上实现了自适应的控制,包括边缘、深度和分割掩码。此外,定量和定性结果表明,与最先进的可控制扩散模型(例如,ControlNet++)相比,ControlAR具有超越性的能力。代码、模型和演示文稿很快将在这个https://此URL上提供。
https://arxiv.org/abs/2410.02705
The quest for robust and generalizable machine learning models has driven recent interest in exploiting symmetries through equivariant neural networks. In the context of PDE solvers, recent works have shown that Lie point symmetries can be a useful inductive bias for Physics-Informed Neural Networks (PINNs) through data and loss augmentation. Despite this, directly enforcing equivariance within the model architecture for these problems remains elusive. This is because many PDEs admit non-compact symmetry groups, oftentimes not studied beyond their infinitesimal generators, making them incompatible with most existing equivariant architectures. In this work, we propose Lie aLgebrA Canonicalization (LieLAC), a novel approach that exploits only the action of infinitesimal generators of the symmetry group, circumventing the need for knowledge of the full group structure. To achieve this, we address existing theoretical issues in the canonicalization literature, establishing connections with frame averaging in the case of continuous non-compact groups. Operating within the framework of canonicalization, LieLAC can easily be integrated with unconstrained pre-trained models, transforming inputs to a canonical form before feeding them into the existing model, effectively aligning the input for model inference according to allowed symmetries. LieLAC utilizes standard Lie group descent schemes, achieving equivariance in pre-trained models. Finally, we showcase LieLAC's efficacy on tasks of invariant image classification and Lie point symmetry equivariant neural PDE solvers using pre-trained models.
寻找稳健且具有泛化能力的机器学习模型的 quest 推动了在等价神经网络中利用对称性的兴趣。在 PDE 求解器背景下,最近的工作表明李代数点对称性可以作为数据和损失增强的归纳偏见,用于物理启示神经网络(PINNs)。尽管如此,在这些问题中直接在模型架构内实现对称性仍然具有挑战性。这是因为许多 PDE 允许非紧凑对称群,通常不研究其 infinitesimal 生成器,使得它们与大多数现有的等价架构不兼容。在本文中,我们提出 Lie AlgebrA 规范化 (LieLAC),一种新方法,它仅利用对称群中微小生成器的行动,绕过了对整个群结构的知识需求。要实现这一目标,我们解决了规范化文献中的现有理论问题,建立了在连续非紧凑群上的框架平均与问题之间的关系。在规范化的框架下,LieLAC 可以很容易地与约束预训练模型集成,将输入转换为规范形式,在将它们输入现有模型之前对其进行预处理,从而根据允许的对称性对模型推理的输入进行对齐。LieLAC 采用标准的李群递归方案,在预训练模型中实现对称性。最后,我们用预训练模型展示了 LieLAC 在不变图像分类和李代数点对称性等价神经 PDE 求解器任务上的效果。
https://arxiv.org/abs/2410.02698
LLMs are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs' impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first large-scale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models' persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while the EU AI Act and California's SB-1047 aim to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI's societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, available at this https URL, to advance our understanding of AI-driven persuasion and its societal implications.
随着工作流程中涉及生成供人类消费的内容(例如,市场营销)以及与人类直接互动(例如,通过聊天机器人),LLM(自然语言处理)越来越多地被使用。开发出具有生成可验证的说服力信息的系统,为社会的各个领域带来了机会和挑战。一方面,这样的系统可能会对如广告和社会公益等领域产生积极影响,例如解决吸毒问题;另一方面,它们也可能被用于传播不实信息和塑造政治观点。为了发挥LLM对社会的积极影响,我们需要开发系统来测量和基准它们的说服力。在这个动机下,我们介绍了PersuasionBench和PersuasionArena,第一个大规模基准和包含一系列任务以测量生成模型的说服力的大规模 arena。我们研究了LLM知道并利用哪些语言模式来生成更有说服力的语言的程度。我们的研究结果表明,LLM的说服力与模型大小呈正相关,但较小的模型也可以具有比更大模型更高的说服力。值得注意的是,使用合成和自然数据集进行定向训练显著增强了小模型的说服力,挑战了规模依赖假设。我们的研究结果对模型开发者和政策制定者都具有关键影响。例如,虽然欧盟人工智能法案和加利福尼亚的SB-1047旨在根据浮点运算数量对AI模型进行监管,但我们发现,这样的简单指标单独无法捕捉AI对社会影响的全部范围。我们邀请社区探索并贡献到PersuasionArena和PersuasionBench,可在此链接处访问:https://www.persuasionarena.com/。这将有助于我们更好地理解AI驱动的说服力和其社会影响。
https://arxiv.org/abs/2410.02653
Accurate 3D object detection in real-world environments requires a huge amount of annotated data with high quality. Acquiring such data is tedious and expensive, and often needs repeated effort when a new sensor is adopted or when the detector is deployed in a new environment. We investigate a new scenario to construct 3D object detectors: learning from the predictions of a nearby unit that is equipped with an accurate detector. For example, when a self-driving car enters a new area, it may learn from other traffic participants whose detectors have been optimized for that area. This setting is label-efficient, sensor-agnostic, and communication-efficient: nearby units only need to share the predictions with the ego agent (e.g., car). Naively using the received predictions as ground-truths to train the detector for the ego car, however, leads to inferior performance. We systematically study the problem and identify viewpoint mismatches and mislocalization (due to synchronization and GPS errors) as the main causes, which unavoidably result in false positives, false negatives, and inaccurate pseudo labels. We propose a distance-based curriculum, first learning from closer units with similar viewpoints and subsequently improving the quality of other units' predictions via self-training. We further demonstrate that an effective pseudo label refinement module can be trained with a handful of annotated data, largely reducing the data quantity necessary to train an object detector. We validate our approach on the recently released real-world collaborative driving dataset, using reference cars' predictions as pseudo labels for the ego car. Extensive experiments including several scenarios (e.g., different sensors, detectors, and domains) demonstrate the effectiveness of our approach toward label-efficient learning of 3D perception from other units' predictions.
在现实环境准确3D物体检测需要大量带有高质量标注的数据。获取这样的数据费时且昂贵,而且在采用新传感器或将检测器部署到新环境时,通常需要重复尝试。我们研究了一个新场景来构建3D物体检测器:从配备准确检测器的相邻单元的预测中学习。例如,当自动驾驶汽车进入新区域时,它可以从对该区域进行优化以适应该区域的交通参与者那里学习。这个设置具有标签效率、传感器无关性和通信效率:附近的单元只需要将自己的预测与自车共享。然而,将接收到的预测作为自车训练检测器的地面真值,导致自车性能劣化。我们系统地研究了这个问题,并确定视角不匹配和定位误差(由同步和GPS误差导致的)是主要原因,这导致假阳性、假阴性和不准确的伪标签。我们提出了一个基于距离的教程,首先从具有相似观点的更近的单元学习,然后通过自训练提高其他单位的预测质量。我们还进一步证明了,只需要少量的标注数据,就可以通过自训练训练出有效的伪标签修复模块,从而大大减少训练一个物体检测器所需的數據量。我们在最近发布的合作驾驶数据集上验证了我们的方法,使用自车的预测作为自车的伪标签。包括多个场景(例如不同的传感器、检测器和领域)的丰富实验表明,我们的方法在从其他单位的预测中实现标签有效的3D感知方面是有效的。
https://arxiv.org/abs/2410.02646
Recent advances in robotics are pushing real-world autonomy, enabling robots to perform long-term and large-scale missions. A crucial component for successful missions is the incorporation of loop closures through place recognition, which effectively mitigates accumulated pose estimation drift. Despite computational advancements, optimizing performance for real-time deployment remains challenging, especially in resource-constrained mobile robots and multi-robot systems since, conventional keyframe sampling practices in place recognition often result in retaining redundant information or overlooking relevant data, as they rely on fixed sampling intervals or work directly in the 3D space instead of the feature space. To address these concerns, we introduce the concept of sample space in place recognition and demonstrate how different sampling techniques affect the query process and overall performance. We then present a novel keyframe sampling approach for LiDAR-based place recognition, which focuses on redundancy minimization and information preservation in the hyper-dimensional descriptor space. This approach is applicable to both learning-based and handcrafted descriptors, and through the experimental validation across multiple datasets and descriptor frameworks, we demonstrate the effectiveness of our proposed method, showing it can jointly minimize redundancy and preserve essential information in real-time. The proposed approach maintains robust performance across various datasets without requiring parameter tuning, contributing to more efficient and reliable place recognition for a wide range of robotic applications.
近年来,机器人技术的进步推动了现实世界的自主,使得机器人能够执行长期和大规模任务。成功执行任务的关键组件是引入通过空间识别进行闭环闭合,有效减轻了累积姿态估计漂移。尽管计算取得了进步,为实时部署优化性能仍然具有挑战性,尤其是在资源受限的移动机器人和多机器人系统上,因为传统的基于关键帧的采样实践通常会导致保留冗余信息或忽视相关信息,因为他们依赖于固定的采样间隔或直接在三维空间而不是特征空间工作。为了应对这些担忧,我们引入了空间识别中的样本空间概念,并展示了不同采样技术如何影响查询过程和整体性能。然后,我们提出了一个基于LiDAR的紧凑表示空间中进行闭环闭合的新关键帧采样方法,重点关注降维和信息保留。这种方法适用于基于学习和手工制作的描述符,并通过在多个数据集和描述框架上的实验验证,证明了我们所提出方法的有效性,表明它可以同时最小化冗余并保留关键信息。这种方法在各种数据集上保持稳健的性能,无需进行参数调整,为各种机器人应用提供更高效、可靠的姿态识别。
https://arxiv.org/abs/2410.02643
Diffusion-based extreme image compression methods have achieved impressive performance at extremely low bitrates. However, constrained by the iterative denoising process that starts from pure noise, these methods are limited in both fidelity and efficiency. To address these two issues, we present Relay Residual Diffusion Extreme Image Compression (RDEIC), which leverages compressed feature initialization and residual diffusion. Specifically, we first use the compressed latent features of the image with added noise, instead of pure noise, as the starting point to eliminate the unnecessary initial stages of the denoising process. Second, we design a novel relay residual diffusion that reconstructs the raw image by iteratively removing the added noise and the residual between the compressed and target latent features. Notably, our relay residual diffusion network seamlessly integrates pre-trained stable diffusion to leverage its robust generative capability for high-quality reconstruction. Third, we propose a fixed-step fine-tuning strategy to eliminate the discrepancy between the training and inference phases, further improving the reconstruction quality. Extensive experiments demonstrate that the proposed RDEIC achieves state-of-the-art visual quality and outperforms existing diffusion-based extreme image compression methods in both fidelity and efficiency. The source code will be provided in this https URL.
基于扩散的极端图像压缩方法在极低比特率下取得了令人印象深刻的性能。然而,由于从纯噪声开始的迭代去噪过程限制了它们的保真度和效率,这些方法在信噪比方面存在局限性。为了解决这两个问题,我们提出了Relay Residual Diffusion Extreme Image Compression(RDEIC),它利用了压缩特征初始化和残差扩散。具体来说,我们首先使用添加噪音的图像的压缩 latent 特征作为去噪过程的起点,而不是纯噪声。其次,我们设计了一个新的 relay 残差扩散,它通过迭代删除添加的噪音和压缩与目标 latent 特征之间的残差来重构原始图像。值得注意的是,我们的 relay 残差扩散网络无缝地整合了预训练的稳定扩散,以利用其对高质量重建的稳健生成能力。第三,我们提出了一个固定步数微调策略,以消除训练和推理阶段之间的差异,进一步提高了重建质量。大量实验证明,所提出的 RDEIC 达到了最先进的视觉质量,并且在信噪比方面优于现有的扩散型极端图像压缩方法。源代码将在此处的链接提供。
https://arxiv.org/abs/2410.02640
Accurate online multiple-camera vehicle tracking is essential for intelligent transportation systems, autonomous driving, and smart city applications. Like single-camera multiple-object tracking, it is commonly formulated as a graph problem of tracking-by-detection. Within this framework, existing online methods usually consist of two-stage procedures that cluster temporally first, then spatially, or vice versa. This is computationally expensive and prone to error accumulation. We introduce a graph representation that allows spatial-temporal clustering in a single, combined step: New detections are spatially and temporally connected with existing clusters. By keeping sparse appearance and positional cues of all detections in a cluster, our method can compare clusters based on the strongest available evidence. The final tracks are obtained online using a simple multicut assignment procedure. Our method does not require any training on the target scene, pre-extraction of single-camera tracks, or additional annotations. Notably, we outperform the online state-of-the-art on the CityFlow dataset in terms of IDF1 by more than 14%, and on the Synthehicle dataset by more than 25%, respectively. The code is publicly available.
准确的在线多摄像头车辆跟踪对于智能交通系统、自动驾驶和智能城市应用至关重要。与单摄像头多对象跟踪一样,通常用跟踪检测问题来表示它。在这个框架内,现有的在线方法通常包括两个步骤:首先进行时序聚类,然后进行空间聚类;或者反过来。这是计算密集型且容易累积错误的。我们引入了一个图表示,允许在单个、联合步骤中进行空间-时间聚类:新检测到的样本在空间和时间上与现有的聚类相互连接。通过保留所有检测到的样本的稀疏表示和位置线索,我们的方法可以基于最强的可用证据比较聚类。通过简单的多路复用分配方案,我们可以在在线过程中获得最终轨迹。我们的方法不需要在目标场景上进行训练,也不需要预先提取单摄像头的轨迹或附加注释。值得注意的是,我们在CityFlow数据集上比在线最先进的方法提高了约14%,而在Synthehicle数据集上提高了约25%。代码是公开可用的。
https://arxiv.org/abs/2410.02638
The evaluation of segmentation performance is a common task in biomedical image analysis, with its importance emphasized in the recently released metrics selection guidelines and computing frameworks. To quantitatively evaluate the alignment of two segmentations, researchers commonly resort to counting metrics, such as the Dice similarity coefficient, or distance-based metrics, such as the Hausdorff distance, which are usually computed by publicly available open-source tools with an inherent assumption that these tools provide consistent results. In this study we questioned this assumption, and performed a systematic implementation analysis along with quantitative experiments on real-world clinical data to compare 11 open-source tools for distance-based metrics computation against our highly accurate mesh-based reference implementation. The results revealed that statistically significant differences among all open-source tools are both surprising and concerning, since they question the validity of existing studies. Besides identifying the main sources of variation, we also provide recommendations for distance-based metrics computation.
翻译:分割性能评估是生物医学图像分析中一个常见任务,这在最近发布的指标选择指南和计算框架中得到了强调。为了定量评估两个分割的对齐度,研究人员通常采用计数指标,如Dice相似度系数,或基于距离的指标,如Hausdorff距离,这些指标通常由公开可用的开源工具计算,并具有固有假设,即这些工具提供一致的结果。在本研究中,我们质疑了这一假设,并与现实世界的临床数据一起进行了系统性的实现分析,以比较11个基于距离的指标计算开放源工具与我们的高度准确的网格基础参考实现之间的差异。结果显示,所有开源工具之间的统计学显著差异都是令人惊讶且令人担忧的,因为它们质疑了现有研究的有效性。除了确定主要变化来源外,我们还提供了基于距离的指标计算的建议。
https://arxiv.org/abs/2410.02630
We present GI-GS, a novel inverse rendering framework that leverages 3D Gaussian Splatting (3DGS) and deferred shading to achieve photo-realistic novel view synthesis and relighting. In inverse rendering, accurately modeling the shading processes of objects is essential for achieving high-fidelity results. Therefore, it is critical to incorporate global illumination to account for indirect lighting that reaches an object after multiple bounces across the scene. Previous 3DGS-based methods have attempted to model indirect lighting by characterizing indirect illumination as learnable lighting volumes or additional attributes of each Gaussian, while using baked occlusion to represent shadow effects. These methods, however, fail to accurately model the complex physical interactions between light and objects, making it impossible to construct realistic indirect illumination during relighting. To address this limitation, we propose to calculate indirect lighting using efficient path tracing with deferred shading. In our framework, we first render a G-buffer to capture the detailed geometry and material properties of the scene. Then, we perform physically-based rendering (PBR) only for direct lighting. With the G-buffer and previous rendering results, the indirect lighting can be calculated through a lightweight path tracing. Our method effectively models indirect lighting under any given lighting conditions, thereby achieving better novel view synthesis and relighting. Quantitative and qualitative results show that our GI-GS outperforms existing baselines in both rendering quality and efficiency.
我们提出了GI-GS,一种利用3D高斯展平(3DGS)和 deferred照明实现照片写实的新视图合成和重新着色的新颖渲染框架。在反向渲染中,准确地建模物体的阴影过程对于实现高保真度的结果至关重要。因此,将全局光照引入模型中,以考虑在场景中多次反弹的间接照明,是至关重要的。之前基于3DGS的解决方案试图通过将间接光照建模为可学习的光线体积或每个高斯量的附加属性,并使用烘焙遮挡来表示阴影效果。然而,这些方法却无法准确地建模光和物体之间的复杂物理交互,使得在重新着色过程中无法构建逼真的间接光照。为了克服这一限制,我们提出了一种使用高效的路径追踪和 deferred照明计算间接光照的方法。在我们的框架中,我们首先渲染一个G缓冲区来捕捉场景的详细几何和物质属性。然后,我们仅对直接照明进行物理渲染(PBR)。通过G缓冲年和之前的渲染结果,可以通过轻量路径追踪计算出间接光照。我们的方法在任意光照条件下都能有效建模间接光照,从而实现更好的新视图合成和重新着色。定量和定性结果表明,GI-GS在渲染质量和效率方面优于现有基线。
https://arxiv.org/abs/2410.02619
The enhanced Deep Hierarchical Video Compression-DHVC 2.0-has been introduced. This single-model neural video codec operates across a broad range of bitrates, delivering not only superior compression performance to representative methods but also impressive complexity efficiency, enabling real-time processing with a significantly smaller memory footprint on standard GPUs. These remarkable advancements stem from the use of hierarchical predictive coding. Each video frame is uniformly transformed into multiscale representations through hierarchical variational autoencoders. For a specific scale's feature representation of a frame, its corresponding latent residual variables are generated by referencing lower-scale spatial features from the same frame and then conditionally entropy-encoded using a probabilistic model whose parameters are predicted using same-scale temporal reference from previous frames and lower-scale spatial reference of the current frame. This feature-space processing operates from the lowest to the highest scale of each frame, completely eliminating the need for the complexity-intensive motion estimation and compensation techniques that have been standard in video codecs for decades. The hierarchical approach facilitates parallel processing, accelerating both encoding and decoding, and supports transmission-friendly progressive decoding, making it particularly advantageous for networked video applications in the presence of packet loss. Source codes will be made available.
已经推出了增强的Deep Hierarchical Video Compression(DHVC)2.0。这个单模型神经视频编码器在广泛的比特率范围内运行,不仅具有卓越的压缩性能,而且具有令人印象深刻的复杂度效率,能够在标准的GPU上实现实时的处理,同时具有显著的内存占用非常小的特点。这些显著的进步源于使用分层预测编码。对于每个视频帧,通过分层变分自编码器将其均匀变换为多尺度表示。对于特定缩放率的帧的特征表示,其相应的潜在残差变量是由引用同一帧的较低级空间特征并使用具有相同尺度的先前帧的预测参数以及当前帧的较低级空间参考条件熵编码生成的。这种特征空间处理从最低到最高帧的每个级别进行,完全消除了几十年来的视频编码标准中复杂度密集的动态估计和补偿技术。分层方法促进了并行处理,加速了编码和解码,并支持可传输的渐进式解码,因此在存在数据包丢失的网络视频应用中具有特别的优势。源代码将提供。
https://arxiv.org/abs/2410.02598
The total variation (TV) method is an image denoising technique that aims to reduce noise by minimizing the total variation of the image, which measures the variation in pixel intensities. The TV method has been widely applied in image processing and computer vision for its ability to preserve edges and enhance image quality. In this paper, we propose an improved TV model for image denoising and the associated numerical algorithm to carry out the procedure, which is particularly effective in removing several types of noises and their combinations. Our improved model admits a unique solution and the associated numerical algorithm guarantees the convergence. Numerical experiments are demonstrated to show improved effectiveness and denoising quality compared to other TV models. Such encouraging results further enhance the utility of the TV method in image processing.
总方差(TV)方法是一种图像去噪技术,旨在通过最小化图像的总方差来降低噪声,该方法衡量像素强度的变化。TV方法在图像处理和计算机视觉中得到了广泛应用,因为它能够保留边缘并提高图像质量。在本文中,我们提出了一个改进的TV模型用于图像去噪以及相关数值算法来执行该过程,特别有效地去除几种类型的噪声及其组合。我们的改进模型具有唯一的解,相应的数值算法保证收敛。数值实验结果证明了与其他TV模型相比,去噪效果和去噪质量的提高。这些鼓舞人心的结果进一步加强了TV方法在图像处理中的实用性。
https://arxiv.org/abs/2410.02587
Denoising is one of the fundamental steps of the processing pipeline that converts data captured by a camera sensor into a display-ready image or video. It is generally performed early in the pipeline, usually before demosaicking, although studies swapping their order or even conducting them jointly have been proposed. With the advent of deep learning, the quality of denoising algorithms has steadily increased. Even so, modern neural networks still have a hard time adapting to new noise levels and scenes, which is indispensable for real-world applications. With those in mind, we propose a self-similarity-based denoising scheme that weights both a pre- and a post-demosaicking denoiser for Bayer-patterned CFA video data. We show that a balance between the two leads to better image quality, and we empirically find that higher noise levels benefit from a higher influence pre-demosaicking. We also integrate temporal trajectory prefiltering steps before each denoiser, which further improve texture reconstruction. The proposed method only requires an estimation of the noise model at the sensor, accurately adapts to any noise level, and is competitive with the state of the art, making it suitable for real-world videography.
去噪是图像或视频处理流程中的一个基本步骤,将由相机传感器捕获的数据转换为显示级别的图像或视频。通常在处理流程的早期执行,通常在去噪之前,尽管已经提出了交换顺序或共同进行研究的方法。随着深度学习的出现,去噪算法的质量稳步提高。然而,现代神经网络仍然很难适应新的噪声水平和场景,这对于现实世界应用程序来说至关重要。为此,我们提出了一个基于自相似性的去噪方案,为Bayer-模式化的CFA视频数据中的预和后去噪器分配权重。我们证明了两种去噪器的平衡会带来更好的图像质量,并且通过实验我们发现,较高的噪声水平有助于提高预去噪的影响。我们还将在每个去噪器之前集成时间轨迹预滤波器,进一步改善纹理重建。所提出的方法只需要对传感器进行噪声模型的估计,准确适应任何噪声水平,与最先进的去噪方法相当,适合进行现实世界的视频拍摄。
https://arxiv.org/abs/2410.02572