Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to generate actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference, and the feature sampling is constrained inside the box, failing to effectively leverage richer context information outside. Recently, a few query-based action detectors have been proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling and decoding, thus suffering from the issues of inferior performance or slower convergence. In this paper, we propose two core designs for a more flexible one-stage sparse action detector. First, we present a query-based adaptive feature sampling module, which endows the detector with the flexibility of mining a group of discriminative features from the entire spatio-temporal domain. Second, we devise a decoupled feature mixing module, which dynamically attends to and mixes video features along the spatial and temporal dimensions respectively for better feature decoding. Based on these designs, we instantiate two detection pipelines, that is, STMixer-K for keyframe action detection and STMixer-T for action tubelet detection. Without bells and whistles, our STMixer detectors obtain state-of-the-art results on five challenging spatio-temporal action detection benchmarks for keyframe action detection or action tube detection.
传统视频动作检测器通常采用两阶段流程,首先采用一个人工检测器生成演员框,然后使用3D RoIAlign提取演员特定特征进行分类。这种检测范式需要多阶段训练和推理,并且特征采样在盒子里进行约束,无法有效利用盒外更丰富的上下文信息。最近,一些基于查询的检测器被提出,以端到端地预测动作实例。然而,它们仍然缺乏可塑性在特征采样和解码方面,因此存在性能低下或收敛速度较慢的问题。在本文中,我们提出了两个更灵活的一阶段稀疏动作检测器的设计。首先,我们提出了一种基于查询的自适应特征采样模块,为检测器赋予了从整个空间和时间域中挖掘一组有差别的特征的灵活性。其次,我们设计了一种解耦的特征混合模块,分别在空间和时间维度上动态关注并混合视频特征,以实现更好的特征解码。基于这些设计,我们实例化了两个检测器,即STMixer-K用于关键帧动作检测,STMixer-T用于动作管束检测。在没有花哨的功能的情况下,我们的STMixer检测器在关键帧动作检测或动作管束检测的五个具有挑战性的空间和时间动作检测基准上取得了最先进的成果。
https://arxiv.org/abs/2404.09842
We compare the $(1,\lambda)$-EA and the $(1 + \lambda)$-EA on the recently introduced benchmark DisOM, which is the OneMax function with randomly planted local optima. Previous work showed that if all local optima have the same relative height, then the plus strategy never loses more than a factor $O(n\log n)$ compared to the comma strategy. Here we show that even small random fluctuations in the heights of the local optima have a devastating effect for the plus strategy and lead to super-polynomial runtimes. On the other hand, due to their ability to escape local optima, comma strategies are unaffected by the height of the local optima and remain efficient. Our results hold for a broad class of possible distortions and show that the plus strategy, but not the comma strategy, is generally deceived by sparse unstructured fluctuations of a smooth landscape.
我们将比较在最近引入的基准测试集合DisOM上的$(1,\lambda)$-EA和$(1+\lambda)$-EA。以前的工作表明,如果所有局部最优解具有相同的相对高度,那么加法策略与逗号策略相比,最多只有$O(n\log n)$的差异。在这里,我们证明了即使是局部最优解高度的微小波动对加法策略也有毁灭性的影响,并导致超多项式运行时间。另一方面,由于它们能够逃避免试局部最优解,逗号策略对局部最优解的高度免疫,仍然有效。我们的结果对于各种可能的扭曲是有效的,并且表明,与加法策略相比,通常被局部无结构波动欺骗的策略是逗号策略,而不是加法策略。
https://arxiv.org/abs/2404.09687
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods on digital computers face both software and hardware challenges. On the software front, difficulties arise from storage inefficiencies in conventional explicit signal representation. Hardware obstacles include the von Neumann bottleneck, which limits data transfer between the CPU and memory, and the limitations of CMOS circuits in supporting parallel processing. We propose a systematic approach with software-hardware co-optimizations for signal reconstruction from sparse inputs. Software-wise, we employ neural field to implicitly represent signals via neural networks, which is further compressed using low-rank decomposition and structured pruning. Hardware-wise, we design a resistive memory-based computing-in-memory (CIM) platform, featuring a Gaussian Encoder (GE) and an MLP Processing Engine (PE). The GE harnesses the intrinsic stochasticity of resistive memory for efficient input encoding, while the PE achieves precise weight mapping through a Hardware-Aware Quantization (HAQ) circuit. We demonstrate the system's efficacy on a 40nm 256Kb resistive memory-based in-memory computing macro, achieving huge energy efficiency and parallelism improvements without compromising reconstruction quality in tasks like 3D CT sparse reconstruction, novel view synthesis, and novel view synthesis for dynamic scenes. This work advances the AI-driven signal restoration technology and paves the way for future efficient and robust medical AI and 3D vision applications.
人类通过将稀疏观测整合到密集连接的神经元中,构建了我们对空间的感知,这使得人工智能在医学成像、增强现实(AR)和 embodied AI等领域具有卓越的并行度和效率。在AI中实现这种能力面临着软件和硬件方面的挑战。在软件方面,困难源于传统显式信号表示中存储效率低下。硬件方面,包括由冯·诺伊曼瓶颈限制了CPU和内存之间的数据传输,以及CMOS电路在支持并行处理方面的限制。我们提出了一个软件和硬件协同优化的信号重构系统,可以从稀疏输入中恢复信号。在软件方面,我们使用神经场通过神经网络隐含表示信号,并使用低秩分解和结构化剪裁进一步压缩。在硬件方面,我们设计了一个基于电阻性内存的计算在内存(CIM)平台,包括一个高斯编码器(GE)和一个多层感知器(MLP处理引擎(PE)。GE利用电阻性内存的固有随机性实现高效的输入编码,而PE通过硬件感知量化(HAQ)电路实现精确的权重映射。我们在基于40nm的256Kb电阻性内存的内存计算宏观上展示了系统的效果,实现了巨大的能效和并行度改进,而不会牺牲重构质量,例如3D CT稀疏重建、新颖视图合成和动态场景下的新颖视图合成。这项工作推动了AI驱动的信号修复技术的发展,为未来的高效和可靠的医疗AI和3D视觉应用铺平了道路。
https://arxiv.org/abs/2404.09613
Vision-based perception for autonomous driving requires an explicit modeling of a 3D space, where 2D latent representations are mapped and subsequent 3D operators are applied. However, operating on dense latent spaces introduces a cubic time and space complexity, which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient, these projections result in information loss, especially for tasks like semantic occupancy prediction. To address this, we propose SparseOcc, an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly, a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly, a feature pyramid and sparse interpolation enhance scales with information from others. Finally, the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly, it also improves accuracy, from 12.8% to 14.1% mIOU, which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels.
基于视觉的自动驾驶需要对3D空间进行显式的建模,其中2D潜在表示被映射,然后应用后续的3D操作。然而,在密集的潜在空间中操作会引入 cubic 时间和空间复杂度,从而限制了感知范围或空间分辨率的可扩展性。现有的方法通过像Bird's Eye View(BEV)或Tri-Perspective View(TPV)这样的投影来压缩密集表示。尽管这些投影有效,但它们导致信息损失,尤其是对于诸如语义 occupancy 预测等任务。为了应对这个问题,我们提出了SparseOcc,一种基于稀疏点云处理的节能占用网络。它采用了一种无损失的稀疏潜在表示,具有三个关键创新。首先,3D稀疏扩散器通过空间分解的3D稀疏卷积核进行潜在完成。其次,特征金字塔和稀疏插值增强了来自其他的信息。最后,将Transformer头重新设计为稀疏变体。SparseOcc在密集基线上的FLOPs减少了74.9%。有趣的是,它还提高了精度,从12.8%到14.1%的mIOU,这部分可以归因于稀疏表示避免空洞像素的幻觉的能力。
https://arxiv.org/abs/2404.09502
Broad-scale marine surveys performed by underwater vehicles significantly increase the availability of coral reef imagery, however it is costly and time-consuming for domain experts to label images. Point label propagation is an approach used to leverage existing image data labeled with sparse point labels. The resulting augmented ground truth generated is then used to train a semantic segmentation model. Here, we first demonstrate that recent advances in foundation models enable generation of multi-species coral augmented ground truth masks using denoised DINOv2 features and K-Nearest Neighbors (KNN), without the need for any pre-training or custom-designed algorithms. For extremely sparsely labeled images, we propose a labeling regime based on human-in-the-loop principles, resulting in significant improvement in annotation efficiency: If only 5 point labels per image are available, our proposed human-in-the-loop approach improves on the state-of-the-art by 17.3% for pixel accuracy and 22.6% for mIoU; and by 10.6% and 19.1% when 10 point labels per image are available. Even if the human-in-the-loop labeling regime is not used, the denoised DINOv2 features with a KNN outperforms the prior state-of-the-art by 3.5% for pixel accuracy and 5.7% for mIoU (5 grid points). We also provide a detailed analysis of how point labeling style and the quantity of points per image affects the point label propagation quality and provide general recommendations on maximizing point label efficiency.
广泛的海洋调查通过水下机器人进行的珊瑚礁图像调查显著增加了珊瑚礁图像的可用性,然而领域专家花时间和精力来标注图像是非常昂贵和费时的。点标签传播是一种利用带有稀疏点标签的现有图像数据来训练模型的方法。然后,生成的增强真实值被用于训练语义分割模型。在这里,我们首先证明, recent advances in foundation models enable the generation of multi-species coral augmented ground truth masks using denoised DINOv2 features and K-Nearest Neighbors (KNN),而无需进行预训练或自定义算法。对于极度稀疏的标注图像,我们提出了基于人类闭环原则的标注方案,从而在像素准确度和mIoU方面实现了显著的改进:如果每张图片只有5个点标签可用,我们的人类闭环方法在像素准确度和mIoU方面的表现与最先进的水平相当,相差17.3%;而当每张图片有10个点标签时,相差22.6%。即使不使用人类闭环标注方案, denoised DINOv2 features with KNN也超越了之前的状态水平,在像素准确度和mIoU(5个网格点)方面分别提高了3.5%和5.7%。我们还对点标记样式和每张图片的点数对点标记传播质量的影响进行了详细分析,并提供了关于最大化点标记效率的一般建议。
https://arxiv.org/abs/2404.09406
Large language models (LLMs) can solve challenging tasks. However, their inference computation on modern GPUs is highly inefficient due to the increasing number of tokens they must attend to as they generate new ones. To address this inefficiency, we capitalize on LLMs' problem-solving capabilities to optimize their own inference-time efficiency. We demonstrate with two specific tasks: (a) evaluating complex arithmetic expressions and (b) summarizing news articles. For both tasks, we create custom datasets to fine-tune an LLM. The goal of fine-tuning is twofold: first, to make the LLM learn to solve the evaluation or summarization task, and second, to train it to identify the minimal attention spans required for each step of the task. As a result, the fine-tuned model is able to convert these self-identified minimal attention spans into sparse attention masks on-the-fly during inference. We develop a custom CUDA kernel to take advantage of the reduced context to attend to. We demonstrate that using this custom CUDA kernel improves the throughput of LLM inference by 28%. Our work presents an end-to-end demonstration showing that training LLMs to self-select their attention spans speeds up autoregressive inference in solving real-world tasks.
大语言模型(LLMs)可以解决具有挑战性的任务。然而,由于它们在生成新词时需要关注越来越多的标记,因此它们在现代GPU上的推理计算效率非常低。为解决这个问题,我们利用LLMs的解决问题能力来优化其自身的推理时间效率。我们通过两个具体的任务来展示我们的工作:(a)评估复杂的算术表达式,(b)总结新闻文章。对于这两个任务,我们创建了自定义数据集来微调LLM。微调的目标是双重目的:首先,使LLM学会解决评估或总结任务;其次,训练它识别出每个步骤所需的最低关注度。因此,微调后的模型能够在推理时动态地将自定义的最低关注度转换成稀疏的注意力掩码。我们开发了一个自定义CUDA核来利用减少关注度的优势。我们证明了使用这个自定义CUDA核可以提高LLM推理的吞吐量28%。我们的工作展示了将LLM的注意力选择问题与解决现实世界任务的自动重排推理相结合,可以大大提高模型的处理速度。
https://arxiv.org/abs/2404.09336
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset, thereby enabling energy-efficient embodied SNN deployments for autonomous agents.
近年来,自动驾驶车辆(AGVs)、无人机(UAVs)和移动机器人等自主 agent有效提高了人类在解决多样化任务中的生产力。然而,由于这些 agent通常由便携式电池供电,因此它们在长时间内运行时需要极其低功耗/能量。为解决这个问题,神经形态计算作为一种有前景的解决方案应运而生,其中仿生 Spiking Neural Networks (SNNs) 使用基于事件的数据转换预处理或事件相机中的尖峰来执行稀疏计算 efficiently。然而,针对自主 agent 的 SNN 部署的研究仍处于早期阶段。因此,尚未对 enabling efficient embodied SNN deployments for autonomous agents 的优化阶段进行系统地定义。为了实现这一目标,我们提出了一个名为 SNN4Agents 的 novel framework,它包括一个针对自主 agent 应用设计能量高效的 embodied SNN 的优化技术集合。我们的 SNN4Agents 使用权重量化、时钟步减少和注意力窗口减少来共同提高能源效率、降低内存足迹、优化处理延迟,同时保持高精度。在评估中,我们研究了基于事件的汽车识别用例,并探讨了准确性、延迟、内存和能量消耗之间的权衡。实验结果表明,与最先进的 NCARS 数据集相比,我们的框架可以在降低 68.75% 的内存和使用 68.75% 更快的速度和 4.03x 的能量效率提升的同时保持高精度(即 84.12% 的准确率),从而实现自主 agent 的高效 embodied SNN 部署。
https://arxiv.org/abs/2404.09331
Accurate completion and denoising of roof height maps are crucial to reconstructing high-quality 3D buildings. Repairing sparse points can enhance low-cost sensor use and reduce UAV flight overlap. RoofDiffusion is a new end-to-end self-supervised diffusion technique for robustly completing, in particular difficult, roof height maps. RoofDiffusion leverages widely-available curated footprints and can so handle up to 99\% point sparsity and 80\% roof area occlusion (regional incompleteness). A variant, No-FP RoofDiffusion, simultaneously predicts building footprints and heights. Both quantitatively outperform state-of-the-art unguided depth completion and representative inpainting methods for Digital Elevation Models (DEM), on both a roof-specific benchmark and the BuildingNet dataset. Qualitative assessments show the effectiveness of RoofDiffusion for datasets with real-world scans including AHN3, Dales3D, and USGS 3DEP LiDAR. Tested with the leading City3D algorithm, preprocessing height maps with RoofDiffusion noticeably improves 3D building reconstruction. RoofDiffusion is complemented by a new dataset of 13k complex roof geometries, focusing on long-tail issues in remote sensing; a novel simulation of tree occlusion; and a wide variety of large-area roof cut-outs for data augmentation and benchmarking.
准确地完成和去噪屋顶高度图对于重建高质量的3D建筑至关重要。修复稀疏点可以提高低成本传感器使用并减少无人机飞行重叠。RoofDiffusion是一种新的端到端自监督扩散技术,特别适用于完成艰难、高度不连续的屋顶高度图。RoofDiffusion利用广泛可用的心跳图,可以处理多达99%的点稀疏和80%的屋顶面积遮挡(区域不完整性)。一种变体,No-FP RoofDiffusion同时预测建筑轮廓和高度。在屋顶特定基准和BuildingNet数据集上,No-FP RoofDiffusion的定量效果超过了目前最先进的未经指导的深度完成和代表性的修复方法。定性评估显示,RoofDiffusion对于包括AHN3、Dales3D和USGS 3DEP LiDAR等现实世界扫描的数据集非常有效。使用领先的City3D算法进行测试,使用RoofDiffusion预处理屋顶图显著提高了3D建筑重建。RoofDiffusion通过一个新的具有13k个复杂屋顶几何的 datasets,重点关注遥感中的长尾问题;一种新的树遮挡模拟;以及各种大面积屋顶切口,用于数据增强和基准测试而得到了补充。
https://arxiv.org/abs/2404.09290
Visual relocalization is a key technique to autonomous driving, robotics, and virtual/augmented reality. After decades of explorations, absolute pose regression (APR), scene coordinate regression (SCR), and hierarchical methods (HMs) have become the most popular frameworks. However, in spite of high efficiency, APRs and SCRs have limited accuracy especially in large-scale outdoor scenes; HMs are accurate but need to store a large number of 2D descriptors for matching, resulting in poor efficiency. In this paper, we propose an efficient and accurate framework, called VRS-NeRF, for visual relocalization with sparse neural radiance field. Precisely, we introduce an explicit geometric map (EGM) for 3D map representation and an implicit learning map (ILM) for sparse patches rendering. In this localization process, EGP provides priors of spare 2D points and ILM utilizes these sparse points to render patches with sparse NeRFs for matching. This allows us to discard a large number of 2D descriptors so as to reduce the map size. Moreover, rendering patches only for useful points rather than all pixels in the whole image reduces the rendering time significantly. This framework inherits the accuracy of HMs and discards their low efficiency. Experiments on 7Scenes, CambridgeLandmarks, and Aachen datasets show that our method gives much better accuracy than APRs and SCRs, and close performance to HMs but is much more efficient.
视觉重定位是自动驾驶、机器人学和虚拟/增强现实中的关键技术。经过几十年的探索,绝对姿态回归(APR)、场景坐标回归(SCR)和分层方法(HMs)已成为最受欢迎的框架。然而,尽管它们具有高效率,但APR和SCR在大规模室外场景中的准确性有限;分层方法准确,但需要存储大量二维描述符进行匹配,导致效率低下。在本文中,我们提出了一个高效且准确的框架,称为VRS-NeRF,用于视觉重定位稀疏神经辐射场。具体来说,我们引入了一个 explicit geometric map(EGM)用于3D地图表示和一个 implicit learning map(ILM)用于稀疏补丁渲染。在定位过程中,EGP 提供稀疏2D点的先验,ILM利用这些稀疏点进行稀疏NeRF的补丁渲染。这使得我们能够丢弃大量二维描述符,以减小地图大小。此外,仅对有用点进行补丁渲染,而不是整个图像的像素,显著减少了渲染时间。该框架继承了HMs的准确性,同时也摒弃了它们的低效率。在7Scenes、CambridgeLandmarks和Aachen数据集上的实验结果表明,我们的方法比APR和SCR具有更高的准确度,与HMs的性能接近,但效率更高。
https://arxiv.org/abs/2404.09271
Recent progress in text-to-3D creation has been propelled by integrating the potent prior of Diffusion Models from text-to-image generation into the 3D domain. Nevertheless, generating 3D scenes characterized by multiple instances and intricate arrangements remains challenging. In this study, we present DreamScape, a method for creating highly consistent 3D scenes solely from textual descriptions, leveraging the strong 3D representation capabilities of Gaussian Splatting and the complex arrangement abilities of large language models (LLMs). Our approach involves a 3D Gaussian Guide ($3{DG^2}$) for scene representation, consisting of semantic primitives (objects) and their spatial transformations and relationships derived directly from text prompts using LLMs. This compositional representation allows for local-to-global optimization of the entire scene. A progressive scale control is tailored during local object generation, ensuring that objects of different sizes and densities adapt to the scene, which addresses training instability issue arising from simple blending in the subsequent global optimization stage. To mitigate potential biases of LLM priors, we model collision relationships between objects at the global level, enhancing physical correctness and overall realism. Additionally, to generate pervasive objects like rain and snow distributed extensively across the scene, we introduce a sparse initialization and densification strategy. Experiments demonstrate that DreamScape offers high usability and controllability, enabling the generation of high-fidelity 3D scenes from only text prompts and achieving state-of-the-art performance compared to other methods.
近年来,将文本到图像生成的扩散模型的强大先验整合到3D领域,推动了基于文本描述创建高度一致的3D场景。然而,生成具有多个实例和复杂排列的3D场景仍然具有挑战性。在这项研究中,我们提出了DreamScape,一种仅基于文本描述创建高度一致3D场景的方法,利用高斯聚类的强大3D表示能力和大型语言模型(LLMs)的复杂排列能力。我们的方法包括一个3D高斯引导(3DG2),由语义原型(物体)和它们直接从文本提示获得的时空变换关系组成。这种组合表示允许整个场景进行局部到全局优化。在局部物体生成过程中,适当地进行局部物体尺度控制,确保不同大小和密度的物体适应该场景,从而解决简单混合在后续全局优化阶段产生的训练不稳定问题。为了减轻LLM先验可能存在的偏见,我们在全局层面上建模物体之间的碰撞关系,提高物理正确性和整体真实感。此外,为了生成在场景中分布式广泛传播的物体,如雨和雪,我们引入了稀疏初始化和密度策略。实验证明,DreamScape具有高可用性和可控性,能够从仅文本提示中生成高保真3D场景,并与其他方法相比实现了最先进的性能。
https://arxiv.org/abs/2404.09227
Large language models like ChatGPT have shown substantial progress in natural language understanding and generation, proving valuable across various disciplines, including the medical field. Despite advancements, challenges persist due to the complexity and diversity inherent in medical tasks which often require multi-task learning capabilities. Previous approaches, although beneficial, fall short in real-world applications because they necessitate task-specific annotations at inference time, limiting broader generalization. This paper introduces MING-MOE, a novel Mixture-of-Expert~(MOE)-based medical large language model designed to manage diverse and complex medical tasks without requiring task-specific annotations, thus enhancing its usability across extensive datasets. MING-MOE employs a Mixture of Low-Rank Adaptation (MoLoRA) technique, allowing for efficient parameter usage by maintaining base model parameters static while adapting through a minimal set of trainable parameters. We demonstrate that MING-MOE achieves state-of-the-art (SOTA) performance on over 20 medical tasks, illustrating a significant improvement over existing models. This approach not only extends the capabilities of medical language models but also improves inference efficiency.
大语言模型如ChatGPT在自然语言理解和生成方面取得了实质性进展,在医学领域等各种学科中都有价值。尽管取得了进步,但医疗任务的复杂性和多样性仍然存在挑战。之前的方法虽然有益,但在实际应用中仍然存在局限性,因为它们在推理时需要特定的任务注释,这限制了模型的更广泛的应用。本文介绍了MING-MOE,一种新颖的基于专家混合(MOE)的医疗大语言模型,旨在管理多样和复杂的医疗任务,而不需要具体的任务注释,从而提高了在广泛数据集上的可用性。MING-MOE采用了一种名为莫尔比乌斯低秩适应(MoLoRA)的技术,通过保持基模型参数静态的同时通过最小的一组可训练参数进行适应,实现了高效的数据参数使用。我们证明了MING-MOE在超过20个医疗任务上实现了最先进的性能,比现有模型有了显著的改进。这种方法不仅扩展了医疗语言模型的能力,还提高了推理效率。
https://arxiv.org/abs/2404.09027
Despite the significant demand for assistive technology among vulnerable groups (e.g., the elderly, children, and the disabled) in daily tasks, research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feelings, such as their opportunity for practice and learning, sense of self-improvement, and self-esteem. Addressing this gap, we define a pivotal and novel challenge Smart Help, which aims to provide proactive yet adaptive support to human agents with diverse disabilities and dynamic goals in various tasks and environments. To establish this challenge, we leverage AI2-THOR to build a new interactive 3D realistic household environment for the Smart Help task. We introduce an innovative opponent modeling module that provides a nuanced understanding of the main agent's capabilities and goals, in order to optimize the assisting agent's helping policy. Rigorous experiments validate the efficacy of our model components and show the superiority of our holistic approach against established baselines. Our findings illustrate the potential of AI-imbued assistive robots in improving the well-being of vulnerable groups.
尽管在弱势群体(如老年人、儿童和残疾人)在日常任务中寻求辅助技术的需求很大,但研究真正适应他们多样化需求的高级人工智能驱动辅助解决方案仍然很少。传统的人机交互任务通常要求机器仅仅帮助,而没有考虑到人类的能力和感受,比如他们的练习和学习机会、自我提高和自尊心。为了填补这一空白,我们定义了一个重要的、新颖的挑战——智能助手(Smart Help),旨在为具有不同残疾和动态目标的人造智能代理提供主动且适应性的支持。为建立这个挑战,我们利用AI2-THOR构建了一个新的智能助手任务的三维现实家庭环境。我们引入了一种创新的对手建模模块,以提供对主要代理的能动性和目标的有细粒度的理解,从而优化辅助代理的协助策略。严格的实验证实了我们的模型组件的有效性,并表明了我们的整体方法相对于现有基线的优越性。我们的研究结果表明,人工智能辅助机器人在改善弱势群体的生活质量方面具有巨大的潜力。
https://arxiv.org/abs/2404.09001
Large Language Models (LLMs) have demonstrated significant potential in performing multiple tasks in multimedia applications, ranging from content generation to interactive entertainment, and artistic creation. However, the diversity of downstream tasks in multitask scenarios presents substantial adaptation challenges for LLMs. While traditional methods often succumb to knowledge confusion on their monolithic dense models, Mixture-of-Experts (MoE) has been emerged as a promising solution with its sparse architecture for effective task decoupling. Inspired by the principles of human cognitive neuroscience, we design a novel framework \texttt{Intuition-MoR1E} that leverages the inherent semantic clustering of instances to mimic the human brain to deal with multitask, offering implicit guidance to router for optimized feature allocation. Moreover, we introduce cutting-edge Rank-1 Experts formulation designed to manage a spectrum of intuitions, demonstrating enhanced parameter efficiency and effectiveness in multitask LLM finetuning. Extensive experiments demonstrate that Intuition-MoR1E achieves superior efficiency and 2.15\% overall accuracy improvement across 14 public datasets against other state-of-the-art baselines.
大语言模型(LLMs)在多媒体应用中执行多个任务方面表现出显著潜力,从内容生成到交互娱乐和艺术创作。然而,在多任务场景中下游任务的多样性为LLMs带来了巨大的适应挑战。虽然传统方法在单体密集模型上往往陷入知识混淆,但是混合专家(MoE)已经被证明是一个有前景的解决方案,其稀疏架构有助于任务解耦,提高模型的性能。 受到人脑认知科学原理的启发,我们设计了一个名为《直觉-MoR1E》的新框架,它利用实例固有的语义聚类来模拟人类大脑处理多任务,为路由器提供最优特征分配的隐含指导。此外,我们还引入了用于管理直觉的先进 Rank-1 专家公式的创新方法,展示了在多任务LLM微调中提高参数效率和有效性的能力。 广泛的实验证明,《直觉-MoR1E》在14个公共数据集上的效率和整体准确性比其他最先进的基准方法提高了2.15%。
https://arxiv.org/abs/2404.08985
Intelligent vehicle systems require a deep understanding of the interplay between road conditions, surrounding entities, and the ego vehicle's driving behavior for safe and efficient navigation. This is particularly critical in developing countries where traffic situations are often dense and unstructured with heterogeneous road occupants. Existing datasets, predominantly geared towards structured and sparse traffic scenarios, fall short of capturing the complexity of driving in such environments. To fill this gap, we present IDD-X, a large-scale dual-view driving video dataset. With 697K bounding boxes, 9K important object tracks, and 1-12 objects per video, IDD-X offers comprehensive ego-relative annotations for multiple important road objects covering 10 categories and 19 explanation label categories. The dataset also incorporates rearview information to provide a more complete representation of the driving environment. We also introduce custom-designed deep networks aimed at multiple important object localization and per-object explanation prediction. Overall, our dataset and introduced prediction models form the foundation for studying how road conditions and surrounding entities affect driving behavior in complex traffic situations.
智能车辆系统需要对道路状况、周围实体和自适应车辆的驾驶行为之间的相互作用进行深入理解,以确保安全和高效的导航。这在发展中国家尤为重要,因为交通情况往往密集且不规则,有异质的道路使用者。现有的数据集,主要针对结构和稀疏交通场景,不足以捕捉在这样的环境中的驾驶复杂性。为了填补这一空白,我们提出了IDD-X,一个大型双视驾驶视频数据集。具有697K个边界框、9K个重要对象跟踪和1-12个物体 per video,IDD-X为多个重要道路对象的全面自适应相对注释提供了全面的覆盖,涵盖了10个类别和19个解释标签类别。该数据集还包含了后方信息,以提供更完整的驾驶环境表示。我们还引入了针对多个重要对象局部定位和每对象解释预测的自定义设计深度网络。总的来说,我们的数据集和引入的预测模型是我们研究道路状况和周围实体如何影响复杂交通情况驾驶行为的基础。
https://arxiv.org/abs/2404.08561
This study explores object detection in historical aerial photographs of Namibia to identify long-term environmental changes. Specifically, we aim to identify key objects -- \textit{Waterholes}, \textit{Omuti homesteads}, and \textit{Big trees} -- around Oshikango in Namibia using sub-meter gray-scale aerial imagery from 1943 and 1972. In this work, we propose a workflow for analyzing historical aerial imagery using a deep semantic segmentation model on sparse hand-labels. To this end, we employ a number of strategies including class-weighting, pseudo-labeling and empirical p-value-based filtering to balance skewed and sparse representations of objects in the ground truth data. Results demonstrate the benefits of these different training strategies resulting in an average $F_1=0.661$ and $F_1=0.755$ over the three objects of interest for the 1943 and 1972 imagery, respectively. We also identified that the average size of Waterhole and Big trees increased while the average size of Omutis decreased between 1943 and 1972 reflecting some of the local effects of the massive post-Second World War economic, agricultural, demographic, and environmental changes. This work also highlights the untapped potential of historical aerial photographs in understanding long-term environmental changes beyond Namibia (and Africa). With the lack of adequate satellite technology in the past, archival aerial photography offers a great alternative to uncover decades-long environmental changes.
本研究旨在探讨纳米比亚历史航空照片中的目标检测,以识别长期环境变化。具体来说,我们将利用1943年和1972年的亚米空间灰度级下视影像,在奥西康戈(Oshikango)地区识别关键物体——\textit{水坑}、\textit{奥穆蒂家园}和\textit{大樹}。在这项工作中,我们提出了一个利用深度语义分割模型分析历史航空照片的工作流程。为此,我们采用了一些策略,包括分类加权、伪标签和基于实证p值的过滤,以平衡真实数据中物体地面真实表示的偏态和稀疏性。结果表明,这些不同的训练策略带来了平均F1值分别为0.661和0.755,分别针对1943年和1972年的感兴趣对象。我们还发现,水坑和的大树的平均尺寸在1943年到1972年之间增加,而奥穆蒂家园的平均尺寸减少,反映了二战后大规模经济、农业、人口和环境变化的一些局部影响。这项工作还强调了历史航空照片在纳米比亚(非洲)以外地区理解长期环境变化( beyond Namibia and Africa)的未开发潜力。由于过去的卫星技术不足,档案馆航空摄影提供了揭示几十年环境变化的大好途径。
https://arxiv.org/abs/2404.08544
The latest regularized Neural Radiance Field (NeRF) approaches produce poor geometry and view extrapolation for multiview stereo (MVS) benchmarks such as ETH3D. In this paper, we aim to create 3D models that provide accurate geometry and view synthesis, partially closing the large geometric performance gap between NeRF and traditional MVS methods. We propose a patch-based approach that effectively leverages monocular surface normal and relative depth predictions. The patch-based ray sampling also enables the appearance regularization of normalized cross-correlation (NCC) and structural similarity (SSIM) between randomly sampled virtual and training views. We further show that "density restrictions" based on sparse structure-from-motion points can help greatly improve geometric accuracy with a slight drop in novel view synthesis metrics. Our experiments show 4x the performance of RegNeRF and 8x that of FreeNeRF on average F1@2cm for ETH3D MVS benchmark, suggesting a fruitful research direction to improve the geometric accuracy of NeRF-based models, and sheds light on a potential future approach to enable NeRF-based optimization to eventually outperform traditional MVS.
最新的正规化神经辐射场(NeRF)方法在多视角立体(MVS)基准测试中产生几何和视图扩展效果较差。在本文中,我们的目标是创建3D模型,提供准确的几何和视图合成,在一定程度上缩小NeRF和传统MVS方法之间的巨大几何性能差距。我们提出了基于补丁的方法,有效利用单目表面法线和相对深度预测。基于补丁的 ray sampling 还使得随机采样虚拟和训练视图之间的标准化交叉象限(NCC)和结构相似性(SSIM)的显现 regularization。我们进一步证明了基于稀疏结构运动点 的 "密度限制" 可以极大地改善几何准确性,而稍有降低新颖视图合成指标。我们的实验结果表明,RegNeRF和FreeNeRF在ETH3D MVS基准上的平均F1@2cm分别是4倍和8倍。这表明有前途的研究方向是提高基于NeRF的模型的几何准确性,并阐明一种最终使基于NeRF的优化能够超越传统MVS方法的潜在方法。
https://arxiv.org/abs/2404.08252
Face attribute editing plays a pivotal role in various applications. However, existing methods encounter challenges in achieving high-quality results while preserving identity, editing faithfulness, and temporal consistency. These challenges are rooted in issues related to the training pipeline, including limited supervision, architecture design, and optimization strategy. In this work, we introduce S3Editor, a Sparse Semantic-disentangled Self-training framework for face video editing. S3Editor is a generic solution that comprehensively addresses these challenges with three key contributions. Firstly, S3Editor adopts a self-training paradigm to enhance the training process through semi-supervision. Secondly, we propose a semantic disentangled architecture with a dynamic routing mechanism that accommodates diverse editing requirements. Thirdly, we present a structured sparse optimization schema that identifies and deactivates malicious neurons to further disentangle impacts from untarget attributes. S3Editor is model-agnostic and compatible with various editing approaches. Our extensive qualitative and quantitative results affirm that our approach significantly enhances identity preservation, editing fidelity, as well as temporal consistency.
面部属性编辑在各种应用中扮演着关键角色。然而,现有的方法在保持身份、编辑忠实度和时间一致性的同时实现高质量结果方面遇到了挑战。这些挑战源于训练流程相关的问题,包括有限的监督、架构设计和优化策略。在这项工作中,我们介绍了S3Editor,一个用于面部视频编辑的稀疏语义区分自训练框架。S3Editor是一个通用的解决方案,通过三个关键贡献解决了这些问题。首先,S3Editor采用自训练范式来通过半监督增强训练过程。其次,我们提出了一种语义区分架构,具有动态路由机制,以适应各种编辑需求。第三,我们提出了一个结构化稀疏优化方案,可以识别和禁用有害的神经元,从而进一步区分不受欢迎属性的影响。S3Editor是模型无关的,并兼容各种编辑方法。我们广泛的高质量和数量结果证实了我们的方法显著增强了身份保留、编辑忠实度和时间一致性。
https://arxiv.org/abs/2404.08111
Recent advances in decentralized deep learning algorithms have demonstrated cutting-edge performance on various tasks with large pre-trained models. However, a pivotal prerequisite for achieving this level of competitiveness is the significant communication and computation overheads when updating these models, which prohibits the applications of them to real-world scenarios. To address this issue, drawing inspiration from advanced model merging techniques without requiring additional training, we introduce the Decentralized Iterative Merging-And-Training (DIMAT) paradigm--a novel decentralized deep learning framework. Within DIMAT, each agent is trained on their local data and periodically merged with their neighboring agents using advanced model merging techniques like activation matching until convergence is achieved. DIMAT provably converges with the best available rate for nonconvex functions with various first-order methods, while yielding tighter error bounds compared to the popular existing approaches. We conduct a comprehensive empirical analysis to validate DIMAT's superiority over baselines across diverse computer vision tasks sourced from multiple datasets. Empirical results validate our theoretical claims by showing that DIMAT attains faster and higher initial gain in accuracy with independent and identically distributed (IID) and non-IID data, incurring lower communication overhead. This DIMAT paradigm presents a new opportunity for the future decentralized learning, enhancing its adaptability to real-world with sparse and light-weight communication and computation.
近年来,去中心化深度学习算法的进步在各种任务上取得了尖端性能。然而,实现这种竞争力的关键前提是更新这些模型时产生的显著的通信和计算开销,这禁止将它们应用于现实场景。为了解决这个问题,我们受到先进模型合并技术启发,不需要额外训练,引入了去中心化迭代合并训练(DIMAT)范式——一种新颖的去中心化深度学习框架。在DIMAT中,每个代理都在其局部数据上进行训练,并使用先进模型合并技术(如激活匹配)定期与相邻代理合并,直到收敛。DIMAT通过使用各种第一级方法证明与最优现有方法的收敛率相同,同时将误差边界更紧地推出。我们对DIMAT在各种计算机视觉任务上的优越性进行了全面的实证分析,这些任务来自多个数据集。实证结果证实了我们的理论主张,即DIMAT在独立且等距(IID)和非IID数据上具有更快的收敛速度和更高的初始梯度,同时具有较低的通信开销。这个DIMAT范式为未来的去中心化学习提供了新的机会,通过稀疏和轻量化的通信和计算增强了其在现实场景中的适应性。
https://arxiv.org/abs/2404.08079
We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot segmentation models. Contrasted to prior 3D scene segmentation approaches that heavily rely on video object tracking, Gaga utilizes spatial information and effectively associates object masks across diverse camera poses. By eliminating the assumption of continuous view changes in training images, Gaga demonstrates robustness to variations in camera poses, particularly beneficial for sparsely sampled images, ensuring precise mask label consistency. Furthermore, Gaga accommodates 2D segmentation masks from diverse sources and demonstrates robust performance with different open-world zero-shot segmentation models, enhancing its versatility. Extensive qualitative and quantitative evaluations demonstrate that Gaga performs favorably against state-of-the-art methods, emphasizing its potential for real-world applications such as scene understanding and manipulation.
我们提出了Gaga,一个通过利用零散拍摄模型预测的不一致2D掩码来重构和分割开放世界3D场景的框架。与 prior 3D 场景分割方法相比,Gaga 利用空间信息,有效地将对象掩码关联到不同相机姿态。通过消除训练图像中连续视图变化的假设,Gaga 展示了对相机姿态变化的鲁棒性,特别是对于稀疏采样图像,确保了精确的掩码标签一致性。此外,Gaga 适应了各种来源的2D分割掩码,并展示了与不同开放世界零散拍摄模型具有良好的鲁棒性,增强了其多才性。大量的定性和定量评估证实,Gaga 在与最先进方法的表现上具有优势,突出了其在现实应用场景(如场景理解和操作)上的潜力。
https://arxiv.org/abs/2404.07977
Lane detection is a fundamental task in autonomous driving, and has achieved great progress as deep learning emerges. Previous anchor-based methods often design dense anchors, which highly depend on the training dataset and remain fixed during inference. We analyze that dense anchors are not necessary for lane detection, and propose a transformer-based lane detection framework based on a sparse anchor mechanism. To this end, we generate sparse anchors with position-aware lane queries and angle queries instead of traditional explicit anchors. We adopt Horizontal Perceptual Attention (HPA) to aggregate the lane features along the horizontal direction, and adopt Lane-Angle Cross Attention (LACA) to perform interactions between lane queries and angle queries. We also propose Lane Perceptual Attention (LPA) based on deformable cross attention to further refine the lane predictions. Our method, named Sparse Laneformer, is easy-to-implement and end-to-end trainable. Extensive experiments demonstrate that Sparse Laneformer performs favorably against the state-of-the-art methods, e.g., surpassing Laneformer by 3.0% F1 score and O2SFormer by 0.7% F1 score with fewer MACs on CULane with the same ResNet-34 backbone.
翻译:检测路面是自动驾驶中的一个基本任务,随着深度学习的发展,已经取得了很大的进展。之前基于锚点的method通常设计密集的锚点,这些锚点高度依赖于训练数据,并且在推理过程中保持不变。我们分析得知,密集的锚点对于路面检测并不必要,并基于稀疏锚点机制提出了一种Transformer-based路面检测框架。为此,我们通过位置感知的路标查询和角度查询生成稀疏的锚点。我们采用水平感知注意(HPA)来沿着水平方向汇总路特征,并采用路标角度交叉注意(LACA)进行路标查询和角度查询之间的交互。我们还基于变形跨注意提出了一种路标感知注意(LPA),以进一步优化路标预测。我们的方法名为稀疏路标Transformer,易于实现,端到端可训练。大量实验证明,稀疏路标Transformer的表现优于最先进的方法,例如在具有相同ResNet-34骨干的CULane上,超过Laneformer 3.0%的F1分数,而O2SFormer则降低了0.7%的F1分数。
https://arxiv.org/abs/2404.07821