Large Multimodal Models (LMMs) have recently gained prominence in autonomous driving research, showcasing promising capabilities across various emerging benchmarks. LMMs specifically designed for this domain have demonstrated effective perception, planning, and prediction skills. However, many of these methods underutilize 3D spatial and temporal elements, relying mainly on image data. As a result, their effectiveness in dynamic driving environments is limited. We propose to integrate tracking information as an additional input to recover 3D spatial and temporal details that are not effectively captured in the images. We introduce a novel approach for embedding this tracking information into LMMs to enhance their spatiotemporal understanding of driving scenarios. By incorporating 3D tracking data through a track encoder, we enrich visual queries with crucial spatial and temporal cues while avoiding the computational overhead associated with processing lengthy video sequences or extensive 3D inputs. Moreover, we employ a self-supervised approach to pretrain the tracking encoder to provide LMMs with additional contextual information, significantly improving their performance in perception, planning, and prediction tasks for autonomous driving. Experimental results demonstrate the effectiveness of our approach, with a gain of 9.5% in accuracy, an increase of 7.04 points in the ChatGPT score, and 9.4% increase in the overall score over baseline models on DriveLM-nuScenes benchmark, along with a 3.7% final score improvement on DriveLM-CARLA. Our code is available at this https URL
最近,大型多模态模型(LMMs)在自动驾驶研究中引起了广泛关注,并在各种新兴基准测试中展示了其令人鼓舞的能力。专门为此领域设计的LMM们展现出了有效的感知、规划和预测技能。然而,这些方法往往未能充分利用3D空间和时间元素,主要依赖于图像数据。因此,在动态驾驶环境中它们的效果受到限制。我们提出了一种将追踪信息作为额外输入的方法来恢复图像中未有效捕获的3D空间和时间细节。为此,我们介绍了一种新颖的方法,以嵌入此追踪信息到LMMs之中,从而增强其对驾驶场景的时空理解。 通过使用一个跟踪编码器,我们将三维跟踪数据纳入视觉查询中,同时避免了处理长时间视频序列或大量3D输入所带来的计算开销。此外,我们采用自我监督方法来预训练跟踪编码器,为LMM提供额外的上下文信息,从而显著提升了其在自动驾驶感知、规划和预测任务中的表现。 实验结果表明,我们的方法十分有效,在DriveLM-nuScenes基准测试上较基线模型获得了9.5%的准确率提升、7.04点ChatGPT得分增加以及整体评分上升了9.4%,而在DriveLM-CARLA上的最终得分为3.7%。代码可在[此链接](https://this-url.com)获取。
https://arxiv.org/abs/2503.14498
Text-to-video (T2V) generation has made significant strides with diffusion models. However, existing methods still struggle with accurately binding attributes, determining spatial relationships, and capturing complex action interactions between multiple subjects. To address these limitations, we propose MagicComp, a training-free method that enhances compositional T2V generation through dual-phase refinement. Specifically, (1) During the Conditioning Stage: We introduce the Semantic Anchor Disambiguation to reinforces subject-specific semantics and resolve inter-subject ambiguity by progressively injecting the directional vectors of semantic anchors into original text embedding; (2) During the Denoising Stage: We propose Dynamic Layout Fusion Attention, which integrates grounding priors and model-adaptive spatial perception to flexibly bind subjects to their spatiotemporal regions through masked attention modulation. Furthermore, MagicComp is a model-agnostic and versatile approach, which can be seamlessly integrated into existing T2V architectures. Extensive experiments on T2V-CompBench and VBench demonstrate that MagicComp outperforms state-of-the-art methods, highlighting its potential for applications such as complex prompt-based and trajectory-controllable video generation. Project page: this https URL.
文本到视频(T2V)生成在扩散模型的推动下取得了显著进展。然而,现有的方法仍然难以准确绑定属性、确定空间关系以及捕捉多个主体之间的复杂动作交互。为了解决这些局限性,我们提出了MagicComp,这是一种训练无关的方法,通过双阶段细化来增强组合式的文本到视频生成。 具体而言: 1. **在条件设定阶段**:我们引入了语义锚点消歧法,以强化特定于主题的语义并解决跨主体的模糊问题。这是通过逐步将语义锚点的方向向量注入原始文本嵌入来实现的。 2. **在去噪阶段**:我们提出了动态布局融合注意力机制,该机制整合了定位先验和模型适应性空间感知能力,通过掩码注意调节灵活绑定各个主体到它们的空间时间区域。 此外,MagicComp是一种与模型无关且多用途的方法,可以无缝地集成到现有的T2V架构中。在T2V-CompBench和VBench上的广泛实验表明,MagicComp超越了最先进的方法,突显了其在复杂提示基础视频生成和轨迹可控的视频生成等应用中的潜力。 项目页面:[此链接](this https URL)。
https://arxiv.org/abs/2503.14428
Medical image segmentation aims to identify anatomical structures at the voxel-level. Segmentation accuracy relies on distinguishing voxel differences. Compared to advancements achieved in studies of the inter-class variance, the intra-class variance receives less attention. Moreover, traditional linear classifiers, limited by a single learnable weight per class, struggle to capture this finer distinction. To address the above challenges, we propose a Multi-Prototype-based Embedding Refinement method for semi-supervised medical image segmentation. Specifically, we design a multi-prototype-based classification strategy, rethinking the segmentation from the perspective of structural relationships between voxel embeddings. The intra-class variations are explored by clustering voxels along the distribution of multiple prototypes in each class. Next, we introduce a consistency constraint to alleviate the limitation of linear classifiers. This constraint integrates different classification granularities from a linear classifier and the proposed prototype-based classifier. In the thorough evaluation on two popular benchmarks, our method achieves superior performance compared with state-of-the-art methods. Code is available at this https URL.
医学图像分割的目标是在体素级别识别解剖结构。分割精度依赖于区分不同体素之间的差异。相比于在研究类间差异方面的进展,对于类内差异的关注较少。此外,传统的线性分类器由于每个类别只有一个可学习权重,在捕捉这些细微差别方面存在困难。为了解决上述挑战,我们提出了一种基于多原型的嵌入优化方法,用于半监督医学图像分割。具体而言,我们设计了基于多原型的分类策略,从体素嵌入之间的结构关系出发重新思考分割问题。通过在每个类别中多个原型分布上的聚类来探索类内变化。接下来,我们引入了一致性约束以缓解线性分类器的局限性。这种约束结合了来自线性分类器和所提出的基于原型的分类器的不同分类粒度。在两个流行基准数据集上进行彻底评估后,我们的方法相比现有最佳方法取得了更优性能。代码可在提供的链接处获取。
https://arxiv.org/abs/2503.14343
Deep learning-based segmentation of genito-pelvic structures in MRI and CT is crucial for applications such as radiation therapy, surgical planning, and disease diagnosis. However, existing segmentation models often struggle with generalizability across imaging modalities, and anatomical variations. In this work, we propose RoMedFormer, a rotary-embedding transformer-based foundation model designed for 3D female genito-pelvic structure segmentation in both MRI and CT. RoMedFormer leverages self-supervised learning and rotary positional embeddings to enhance spatial feature representation and capture long-range dependencies in 3D medical data. We pre-train our model using a diverse dataset of 3D MRI and CT scans and fine-tune it for downstream segmentation tasks. Experimental results demonstrate that RoMedFormer achieves superior performance segmenting genito-pelvic organs. Our findings highlight the potential of transformer-based architectures in medical image segmentation and pave the way for more transferable segmentation frameworks.
基于深度学习的磁共振成像(MRI)和计算机断层扫描(CT)中生殖盆腔结构分割对于放射治疗、手术规划以及疾病诊断等应用至关重要。然而,现有的分割模型往往在跨成像模态及解剖变异方面的泛化能力不足。为此,我们提出了RoMedFormer,这是一种基于旋转嵌入变换器的基础模型,专门用于对3D女性生殖盆腔结构进行MRI和CT的分割。RoMedFormer利用自监督学习以及旋转位置嵌入来增强空间特征表示,并捕捉三维医学数据中的长程依赖关系。 我们在一个包含多种3D MRI和CT扫描的大规模多样本数据库上预训练了模型,然后将其微调用于下游分割任务。实验结果表明,RoMedFormer在生殖盆腔器官的分割方面表现优异。我们的研究结果强调了变换器架构在医学图像分割中的潜力,并为开发更具迁移性的分割框架铺平了道路。
https://arxiv.org/abs/2503.14304
Recent advances in Text-to-Image (T2I) diffusion models have transformed image generation, enabling significant progress in stylized generation using only a few style reference images. However, current diffusion-based methods struggle with fine-grained style customization due to challenges in controlling multiple style attributes, such as color and texture. This paper introduces the first tuning-free approach to achieve free-lunch color-texture disentanglement in stylized T2I generation, addressing the need for independently controlled style elements for the Disentangled Stylized Image Generation (DisIG) problem. Our approach leverages the Image-Prompt Additivity property in the CLIP image embedding space to develop techniques for separating and extracting Color-Texture Embeddings (CTE) from individual color and texture reference images. To ensure that the color palette of the generated image aligns closely with the color reference, we apply a whitening and coloring transformation to enhance color consistency. Additionally, to prevent texture loss due to the signal-leak bias inherent in diffusion training, we introduce a noise term that preserves textural fidelity during the Regularized Whitening and Coloring Transformation (RegWCT). Through these methods, our Style Attributes Disentanglement approach (SADis) delivers a more precise and customizable solution for stylized image generation. Experiments on images from the WikiArt and StyleDrop datasets demonstrate that, both qualitatively and quantitatively, SADis surpasses state-of-the-art stylization methods in the DisIG task.
近期的文本到图像(T2I)扩散模型的进步显著改善了图像生成,特别是在仅使用少量风格参考图片的情况下实现了风格化生成的重大进展。然而,现有的基于扩散的方法在进行精细的风格定制时遇到了困难,尤其是因为在控制颜色和纹理等多种风格属性方面存在挑战。本文介绍了一种无需调整参数的方法,可以在风格化的T2I生成中实现自由午餐式的颜色-纹理解耦,以解决独立控制风格元素的需求,并解决了分离化风格图像生成(DisIG)问题的需要。 我们的方法利用CLIP图像嵌入空间中的Image-Prompt Additivity属性开发技术来从单独的颜色和纹理参考图片中分离并提取Color-Texture Embeddings (CTE)。为了确保生成图像的颜色调与颜色参考紧密一致,我们应用了白化和着色变换以增强颜色一致性。此外,为防止由于扩散训练中固有的信号泄露偏差而导致的纹理丢失问题,我们在Regularized Whitening and Coloring Transformation(RegWCT)过程中引入了一个噪声项来保持纹理保真度。 通过这些方法,我们的风格属性解耦方法(SADis)提供了一种更精确和可定制的解决方案用于风格化图像生成。在WikiArt和StyleDrop数据集上的实验表明,在DisIG任务中,无论是从定性还是定量的角度来看,SADis都超越了现有最先进的风格化方法。
https://arxiv.org/abs/2503.14275
Recent advances in Neural Radiance Fields (NeRF) have shown great potential in 3D reconstruction and novel view synthesis, particularly for indoor and small-scale scenes. However, extending NeRF to large-scale outdoor environments presents challenges such as transient objects, sparse cameras and textures, and varying lighting conditions. In this paper, we propose a segmentation-guided enhancement to NeRF for outdoor street scenes, focusing on complex urban environments. Our approach extends ZipNeRF and utilizes Grounded SAM for segmentation mask generation, enabling effective handling of transient objects, modeling of the sky, and regularization of the ground. We also introduce appearance embeddings to adapt to inconsistent lighting across view sequences. Experimental results demonstrate that our method outperforms the baseline ZipNeRF, improving novel view synthesis quality with fewer artifacts and sharper details.
最近在神经辐射场(NeRF)领域的进展展示了其在3D重建和新颖视角合成方面的巨大潜力,特别是在室内场景和小规模环境中。然而,将NeRF应用于大型户外环境时面临着诸如瞬变物体、稀疏相机与纹理以及光照条件变化等挑战。本文提出了一种针对室外街景的基于分割引导的NeRF增强方法,专注于复杂的城市环境。我们的方法扩展了ZipNeRF,并利用Grounded SAM生成分割掩码,这有助于有效处理瞬变对象,建模天空,并对地面进行正则化。此外,我们引入外观嵌入来适应视图序列中的不一致光照条件。实验结果表明,相较于基线模型ZipNeRF,我们的方法在减少伪影和提高细节清晰度方面显著提升了新颖视角合成的质量。
https://arxiv.org/abs/2503.14219
In end-to-end speech translation, acoustic representations learned by the encoder are usually fixed and static, from the perspective of the decoder, which is not desirable for dealing with the cross-modal and cross-lingual challenge in speech translation. In this paper, we show the benefits of varying acoustic states according to decoder hidden states and propose an adaptive speech-to-text translation model that is able to dynamically adapt acoustic states in the decoder. We concatenate the acoustic state and target word embedding sequence and feed the concatenated sequence into subsequent blocks in the decoder. In order to model the deep interaction between acoustic states and target hidden states, a speech-text mixed attention sublayer is introduced to replace the conventional cross-attention network. Experiment results on two widely-used datasets show that the proposed method significantly outperforms state-of-the-art neural speech translation models.
在端到端语音翻译中,编码器学习的声学表示通常是固定且静态的,从解码器的角度来看,这对处理语音翻译中的跨模态和跨语言挑战是不利的。在这篇论文中,我们展示了根据解码器隐藏状态变化声学状态的好处,并提出了一种适应性语音到文本翻译模型,该模型能够在解码过程中动态调整声学状态。我们将声学状态与目标词嵌入序列进行拼接,并将拼接后的序列输入到解码器的后续块中。为了建模声学状态和目标隐藏状态之间的深层交互,我们引入了一个语音-文本混合注意力子层来替换传统的交叉注意网络。实验结果表明,在两个广泛使用的数据集上,所提出的方法显著优于现有的神经语音翻译模型。
https://arxiv.org/abs/2503.14185
Automated Face Recognition Systems (FRSs), developed using deep learning models, are deployed worldwide for identity verification and facial attribute analysis. The performance of these models is determined by a complex interdependence among the model architecture, optimization/loss function and datasets. Although FRSs have surpassed human-level accuracy, they continue to be disparate against certain demographics. Due to the ubiquity of applications, it is extremely important to understand the impact of the three components -- model architecture, loss function and face image dataset on the accuracy-disparity trade-off to design better, unbiased platforms. In this work, we perform an in-depth analysis of three FRSs for the task of gender prediction, with various architectural modifications resulting in ten deep-learning models coupled with four loss functions and benchmark them on seven face datasets across 266 evaluation configurations. Our results show that all three components have an individual as well as a combined impact on both accuracy and disparity. We identify that datasets have an inherent property that causes them to perform similarly across models, independent of the choice of loss functions. Moreover, the choice of dataset determines the model's perceived bias -- the same model reports bias in opposite directions for three gender-balanced datasets of ``in-the-wild'' face images of popular individuals. Studying the facial embeddings shows that the models are unable to generalize a uniform definition of what constitutes a ``female face'' as opposed to a ``male face'', due to dataset diversity. We provide recommendations to model developers on using our study as a blueprint for model development and subsequent deployment.
使用深度学习模型开发的自动化面部识别系统(FRS)在全球范围内用于身份验证和面部属性分析。这些模型的表现由其架构、优化/损失函数以及数据集之间的复杂相互作用决定。尽管FRS已经超过了人类级别的准确度,但在某些人口群体中依然存在显著差异。鉴于应用的广泛性,理解这三个组成部分——即模型架构、损失函数及面部图像数据集如何影响准确性与偏差权衡至关重要,以便设计更公正、无偏见的平台。 在这项工作中,我们对三个用于性别预测任务的FRS进行了深入分析,在此基础上通过各种架构修改生成了十个深度学习模型,并结合四种不同的损失函数在七个人脸数据集上进行测试,涵盖266种评估配置。我们的结果显示,这三个组成部分无论是单独还是组合起来都对准确性和偏差有着显著影响。 我们发现,某些数据集具有固有的特性,导致它们在不同模型之间表现出一致的行为,这与所选择的损失函数无关。此外,数据集的选择决定了模型被感知到的偏见——同样的模型在三个性别均衡的“真实世界”名人面部图像数据集中会显示出相反方向上的偏差。 研究面部嵌入显示,由于数据集的多样性,这些模型无法形成统一定义什么是“女性脸”与“男性脸”,从而导致了泛化能力的问题。我们为模型开发者提供了基于本研究结果进行模型开发和后续部署的建议。
https://arxiv.org/abs/2503.14138
Existing 3D Human Pose Estimation (HPE) methods achieve high accuracy but suffer from computational overhead and slow inference, while knowledge distillation methods fail to address spatial relationships between joints and temporal correlations in multi-frame inputs. In this paper, we propose Sparse Correlation and Joint Distillation (SCJD), a novel framework that balances efficiency and accuracy for 3D HPE. SCJD introduces Sparse Correlation Input Sequence Downsampling to reduce redundancy in student network inputs while preserving inter-frame correlations. For effective knowledge transfer, we propose Dynamic Joint Spatial Attention Distillation, which includes Dynamic Joint Embedding Distillation to enhance the student's feature representation using the teacher's multi-frame context feature, and Adjacent Joint Attention Distillation to improve the student network's focus on adjacent joint relationships for better spatial understanding. Additionally, Temporal Consistency Distillation aligns the temporal correlations between teacher and student networks through upsampling and global supervision. Extensive experiments demonstrate that SCJD achieves state-of-the-art performance. Code is available at this https URL.
现有的3D人体姿态估计(HPE)方法虽然能够达到很高的精度,但存在计算开销大和推理速度慢的问题。而知识蒸馏方法则无法解决关节之间的空间关系以及多帧输入中的时间相关性问题。本文提出了一种新的框架——稀疏关联与关节蒸馏(SCJD),该框架能够在3D HPE中实现效率与精度的平衡。 SCJD引入了稀疏相关输入序列降采样技术,以减少学生网络输入中的冗余信息同时保留帧间的相关性。为了有效地进行知识传递,我们提出了动态关节空间注意力蒸馏方法,其中包括动态关节嵌入蒸馏和相邻关节注意蒸馏两个方面:前者利用教师的多帧上下文特征来增强学生的特征表示;后者则改进了学生网络对相邻关节关系的关注度以提高其在空间上的理解能力。此外,通过上采样与全局监督相结合的方式进行时间一致性蒸馏,使教师模型和学生模型之间的时间相关性得到对齐。 经过广泛的实验验证,SCJD实现了最先进的性能表现。该框架的代码可在提供的链接中获取(此URL指向的是原文中的一个链接位置)。
https://arxiv.org/abs/2503.14097
In this work, we introduce a Distributed Quantum Long Short-Term Memory (QLSTM) framework that leverages modular quantum computing to address scalability challenges on Noisy Intermediate-Scale Quantum (NISQ) devices. By embedding variational quantum circuits into LSTM cells, the QLSTM captures long-range temporal dependencies, while a distributed architecture partitions the underlying Variational Quantum Circuits (VQCs) into smaller, manageable subcircuits that can be executed on a network of quantum processing units. We assess the proposed framework using nontrivial benchmark problems such as damped harmonic oscillators and Nonlinear Autoregressive Moving Average sequences. Our results demonstrate that the distributed QLSTM achieves stable convergence and improved training dynamics compared to classical approaches. This work underscores the potential of modular, distributed quantum computing architectures for large-scale sequence modelling, providing a foundation for the future integration of hybrid quantum-classical solutions into advanced Quantum High-performance computing (HPC) ecosystems.
在这项工作中,我们介绍了一个分布式量子长短期记忆(QLSTM)框架,该框架利用模块化量子计算来解决嘈杂中等规模量子(NISQ)设备上的可扩展性挑战。通过将变分量子电路嵌入到LSTM单元中,QLSTM能够捕捉长期的时间依赖关系,而分布式的架构则将底层的变分量子电路(VQC)划分为更小、易于管理的子电路,在一个由多个量子处理单元组成的网络上执行这些子电路。我们使用非平凡基准问题,如阻尼谐振子和非线性自回归移动平均序列来评估所提出的框架。我们的结果显示,分布式QLSTM能够实现稳定的收敛,并且与经典方法相比,训练动力学有所改善。这项工作强调了模块化、分布式量子计算架构在大规模序列建模中的潜力,为未来将混合量子-经典解决方案整合到先进的量子高性能计算(HPC)生态系统中奠定了基础。
https://arxiv.org/abs/2503.14088
This work focuses on full-body co-speech gesture generation. Existing methods typically employ an autoregressive model accompanied by vector-quantized tokens for gesture generation, which results in information loss and compromises the realism of the generated gestures. To address this, inspired by the natural continuity of real-world human motion, we propose MAG, a novel multi-modal aligned framework for high-quality and diverse co-speech gesture synthesis without relying on discrete tokenization. Specifically, (1) we introduce a motion-text-audio-aligned variational autoencoder (MTA-VAE), which leverages pre-trained WavCaps' text and audio embeddings to enhance both semantic and rhythmic alignment with motion, ultimately producing more realistic gestures. (2) Building on this, we propose a multimodal masked autoregressive model (MMAG) that enables autoregressive modeling in continuous motion embeddings through diffusion without vector quantization. To further ensure multi-modal consistency, MMAG incorporates a hybrid granularity audio-text fusion block, which serves as conditioning for diffusion process. Extensive experiments on two benchmark datasets demonstrate that MAG achieves stateof-the-art performance both quantitatively and qualitatively, producing highly realistic and diverse co-speech this http URL code will be released to facilitate future research.
这项研究专注于全身协同语言手势的生成。现有的方法通常采用自回归模型,配合向量量化令牌来生成手势,这种方法会导致信息损失,并且会损害所产生手势的真实感。为了应对这一挑战,受现实世界人类动作连续性的启发,我们提出了MAG(Multi-modal Aligned Gesture),这是一种新型多模态对齐框架,用于高质量、多样化的协同语言手势合成,无需依赖离散标记化。 具体而言,(1) 我们引入了一种运动-文本-音频对齐变分自动编码器(MTA-VAE),利用预训练的WavCaps的文本和音频嵌入来增强与动作在语义和节奏上的对齐性,最终生成更真实的手势。(2) 在此基础上,我们提出了一种多模态掩码自回归模型(MMAG),该模型通过扩散过程实现连续运动嵌入中的自回归建模,并且不需要向量量化。为了进一步确保多模态一致性,MMAG融合了一个混合粒度的音频-文本融合块作为扩散过程中的条件。 在两个基准数据集上的大量实验表明,MAG在定量和定性上都达到了最先进的性能,能够生成高度真实和多样化的协同语言手势。我们将在未来发布相关代码以促进进一步的研究。
https://arxiv.org/abs/2503.14040
Generative models have recently made remarkable progress in the field of 3D objects. However, their practical application in fields like engineering remains limited since they fail to deliver the accuracy, quality, and controllability needed for domain-specific tasks. Fine-tuning large generative models is a promising perspective for making these models available in these fields. Creating high-quality, domain-specific 3D datasets is crucial for fine-tuning large generative models, yet the data filtering and annotation process remains a significant bottleneck. We present MeshFleet, a filtered and annotated 3D vehicle dataset extracted from Objaverse-XL, the most extensive publicly available collection of 3D objects. Our approach proposes a pipeline for automated data filtering based on a quality classifier. This classifier is trained on a manually labeled subset of Objaverse, incorporating DINOv2 and SigLIP embeddings, refined through caption-based analysis and uncertainty estimation. We demonstrate the efficacy of our filtering method through a comparative analysis against caption and image aesthetic score-based techniques and fine-tuning experiments with SV3D, highlighting the importance of targeted data selection for domain-specific 3D generative modeling.
最近,生成模型在三维物体领域取得了显著进展。然而,在工程等领域的实际应用仍然受到限制,因为这些模型无法提供特定任务所需的准确性、质量和可控性。对大型生成模型进行微调是让这些模型应用于这些领域的一种有前景的方法。创建高质量的特定领域3D数据集对于大型生成模型的微调至关重要,但数据过滤和注释过程仍然是一个重要的瓶颈。 为此,我们提出了MeshFleet,这是一个从Objaverse-XL中提取的、经过筛选和标注的3D车辆数据集。Objaverse-XL是目前最庞大的公开三维物体集合。我们的方法提出了一种基于质量分类器的自动化数据过滤流程。该分类器是在人工标记的Objaverse子集上训练而成的,结合了DINOv2和SigLIP嵌入,并通过基于描述信息的分析和不确定性估计进行了优化。 我们通过与基于描述信息和图像美学评分技术的比较分析以及使用SV3D进行微调实验来展示了我们的过滤方法的有效性。这表明针对特定领域3D生成模型的数据选择具有重要性。
https://arxiv.org/abs/2503.14002
Top-leading solutions for Video Scene Graph Generation (VSGG) typically adopt an offline pipeline. Though demonstrating promising performance, they remain unable to handle real-time video streams and consume large GPU memory. Moreover, these approaches fall short in temporal reasoning, merely aggregating frame-level predictions over a temporal context. In response, we introduce DIFFVSGG, an online VSGG solution that frames this task as an iterative scene graph update problem. Drawing inspiration from Latent Diffusion Models (LDMs) which generate images via denoising a latent feature embedding, we unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding. Then, given an embedding containing unified features of object pairs, we conduct a step-wise Denoising on it within LDMs, so as to deliver a clean embedding which clearly indicates the relationships between objects. This embedding then serves as the input to task-specific heads for object classification, scene graph generation, etc. DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs, to guide the reverse diffusion process for current frames. Extensive experiments on three setups of Action Genome demonstrate the superiority of DIFFVSGG.
顶尖的视频场景图生成(Video Scene Graph Generation,VSGG)解决方案通常采用离线处理管道。尽管这些方法展示了令人鼓舞的表现,但它们无法实时处理视频流,并且需要大量的GPU内存。此外,这些方法在时间推理方面也存在不足,仅通过在一个时间上下文中聚合帧级别的预测来实现。 为了解决这些问题,我们引入了DIFFVSGG,这是一个在线的VSGG解决方案,该方案将任务视为一个迭代场景图更新问题。受到潜在扩散模型(Latent Diffusion Models,LDMs)的启发,这些模型通过去噪潜在特征嵌入生成图像,我们将对象分类、边界框回归和图形生成三个任务的解码统一使用一个共享的特征嵌入。在给定包含成对物体统一特性的嵌入后,我们在LDM中进行逐步去噪处理,以提供一个清晰地显示物体之间关系的干净嵌入。然后将此嵌入用作特定于任务的头部(如对象分类、场景图生成等)的输入。 DIFFVSGG进一步促进了连续时间推理,在这种推理过程中,后续帧的预测会利用过去帧的结果作为LDMs条件输入,以指导当前帧的反向扩散过程。在Action Genome的三个设置上进行的广泛实验展示了DIFFVSGG的优越性。
https://arxiv.org/abs/2503.13957
3D Gaussian Splatting (3DGS) has emerged as an efficient and high-fidelity paradigm for novel view synthesis. To adapt 3DGS for dynamic content, deformable 3DGS incorporates temporally deformable primitives with learnable latent embeddings to capture complex motions. Despite its impressive performance, the high-dimensional embeddings and vast number of primitives lead to substantial storage requirements. In this paper, we introduce a \textbf{Light}weight \textbf{4}D\textbf{GS} framework, called Light4GS, that employs significance pruning with a deep context model to provide a lightweight storage-efficient dynamic 3DGS representation. The proposed Light4GS is based on 4DGS that is a typical representation of deformable 3DGS. Specifically, our framework is built upon two core components: (1) a spatio-temporal significance pruning strategy that eliminates over 64\% of the deformable primitives, followed by an entropy-constrained spherical harmonics compression applied to the remainder; and (2) a deep context model that integrates intra- and inter-prediction with hyperprior into a coarse-to-fine context structure to enable efficient multiscale latent embedding compression. Our approach achieves over 120x compression and increases rendering FPS up to 20\% compared to the baseline 4DGS, and also superior to frame-wise state-of-the-art 3DGS compression methods, revealing the effectiveness of our Light4GS in terms of both intra- and inter-prediction methods without sacrificing rendering quality.
3D高斯点阵(3DGS)作为一种高效且高度保真的新型视图合成范式已经出现。为了适应动态内容,可变形的3DGS通过引入具有可学习潜在嵌入的时间变形基元来捕捉复杂的运动模式。尽管其性能卓越,但高维度嵌入和大量基元导致了巨大的存储需求。 在本文中,我们提出了一种名为Light4GS(轻量级4DGS)的框架,该框架采用显著性剪枝与深度上下文模型结合的方法,为可变形3DGS提供一种存储效率高的动态表示。提出的Light4GS基于4DGS,这是典型的可变形3DGS表示形式之一。具体而言,我们的框架建立在两个核心组件上: 1. 时空显著性剪枝策略:该策略可以消除超过64%的可变形基元,并随后对剩余部分应用熵约束球谐波压缩。 2. 深度上下文模型:该模型结合了内部预测与外部预测以及超先验知识,构建了一个从粗到细的上下文结构,以实现多尺度潜在嵌入压缩。 我们的方法在与基准4DGS相比时实现了超过120倍的压缩,并将渲染帧速率提高了最多20%,同时也优于基于帧的状态-of-the-art 3DGS压缩方法。这表明了我们在内部预测和外部预测方法中的Light4GS的有效性,同时不会牺牲渲染质量。
https://arxiv.org/abs/2503.13948
Multimodal learning that integrates histopathology images and genomic data holds great promise for cancer survival prediction. However, existing methods face key limitations: 1) They rely on multimodal mapping and metrics in Euclidean space, which cannot fully capture the hierarchical structures in histopathology (among patches from different resolutions) and genomics data (from genes to pathways). 2) They discretize survival time into independent risk intervals, which ignores its continuous and ordinal nature and fails to achieve effective optimization. 3) They treat censorship as a binary indicator, excluding censored samples from model optimization and not making full use of them. To address these challenges, we propose HySurvPred, a novel framework for survival prediction that integrates three key modules: Multimodal Hyperbolic Mapping (MHM), Angle-aware Ranking-based Contrastive Loss (ARCL) and Censor-Conditioned Uncertainty Constraint (CUC). Instead of relying on Euclidean space, we design the MHM module to explore the inherent hierarchical structures within each modality in hyperbolic space. To better integrate multimodal features in hyperbolic space, we introduce the ARCL module, which uses ranking-based contrastive learning to preserve the ordinal nature of survival time, along with the CUC module to fully explore the censored data. Extensive experiments demonstrate that our method outperforms state-of-the-art methods on five benchmark datasets. The source code is to be released.
将组织病理学图像和基因组数据整合的多模态学习在癌症生存预测方面展现出巨大的潜力。然而,现有的方法面临着关键限制:1)它们依赖于欧几里得空间中的多模态映射和度量,这无法充分捕捉到组织病理学(不同分辨率下的补丁)和基因组数据(从基因到途径)中的层次结构;2)它们将生存时间离散化为独立的风险区间,忽视了其连续性和有序性,并且无法实现有效的优化;3)它们把审查视为二元指标,排除被审查的样本在模型优化过程中的作用,未能充分利用这些数据。为解决这些问题,我们提出了HySurvPred,一个用于生存预测的新框架,它集成了三个关键模块:多模态双曲映射(MHM)、角度感知排序对比损失(ARCL)和审查条件不确定性约束(CUC)。与依赖于欧几里得空间的方法不同,我们设计了MHM模块来在双曲空间中探索每个模式内部固有的层次结构。为了更好地将多模态特征整合到双曲空间中,我们引入了ARCL模块,它使用基于排序的对比学习方法来保持生存时间的有序性,并通过CUC模块充分利用审查数据。广泛的实验表明,我们的方法在五个基准数据集上超越了现有的最佳方法。源代码将在适当的时候发布。
https://arxiv.org/abs/2503.13862
Accurately understanding and deciding high-level meta-actions is essential for ensuring reliable and safe autonomous driving systems. While vision-language models (VLMs) have shown significant potential in various autonomous driving tasks, they often suffer from limitations such as inadequate spatial perception and hallucination, reducing their effectiveness in complex autonomous driving scenarios. To address these challenges, we propose a retrieval-augmented decision-making (RAD) framework, a novel architecture designed to enhance VLMs' capabilities to reliably generate meta-actions in autonomous driving scenes. RAD leverages a retrieval-augmented generation (RAG) pipeline to dynamically improve decision accuracy through a three-stage process consisting of the embedding flow, retrieving flow, and generating flow. Additionally, we fine-tune VLMs on a specifically curated dataset derived from the NuScenes dataset to enhance their spatial perception and bird's-eye view image comprehension capabilities. Extensive experimental evaluations on the curated NuScenes-based dataset demonstrate that RAD outperforms baseline methods across key evaluation metrics, including match accuracy, and F1 score, and self-defined overall score, highlighting its effectiveness in improving meta-action decision-making for autonomous driving tasks.
准确理解并决定高级元行动对于确保自主驾驶系统的可靠性和安全性至关重要。尽管视觉语言模型(VLM)在各种自主驾驶任务中表现出巨大的潜力,但它们经常受到空间感知不足和幻觉的限制,在复杂环境中的效果会因此减弱。为了解决这些问题,我们提出了一种检索增强决策(RAD)框架——一种旨在提升VLM生成可靠元行动能力的新架构。该框架利用一个包含嵌入流、检索流和生成流三个阶段的检索增强生成(RAG)管道,通过动态过程提高决策准确性。 此外,我们将VLM在从NuScenes数据集中特别策划的数据集上进行了微调,以提升其空间感知能力和俯视图图像的理解能力。在基于NuScenes的精心策划的数据集上的广泛实验评估显示,RAD在包括匹配准确率、F1分数和自定义的整体评分在内的关键评价指标中优于基准方法,突显了它在改善自主驾驶任务元行动决策方面的有效性。
https://arxiv.org/abs/2503.13861
Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings. Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AudoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate systematic incorporation of preprint articles in evidence-based medicine, supporting researchers in more effective evaluation and utilization of preprint resources.
背景:比较效果研究中的系统综述需要及时的证据综合。预印本加速了知识传播,但其质量参差不齐,对系统综述构成了挑战。方法:我们提出了AutoConfidence(自动化信心评估),这是一个先进的框架,用于预测预印本的发表情况,它减少了手动整理的工作量,并扩大了预测因素范围,包括三个关键改进:(1) 使用自然语言处理技术进行自动数据提取;(2) 标题和摘要的语义嵌入;以及 (3) 大型语言模型(LLM)驱动的评估分数。此外,我们还采用了两种预测模型:一种是用于二元结果预测的随机森林分类器,另一种是可以同时预测二元结果和发表风险随时间变化的生存治愈模型。 结果:随机森林分类器在使用LLM驱动评分时达到了AUROC 0.692,在使用语义嵌入时提高了至0.733,并且结合文章使用指标后进一步提高到了0.747。生存治愈模型在使用LLM驱动评分的情况下达到了AUROC 0.716,而在引入了语义嵌入之后则提升到了0.731。对于发表风险预测,它获得了0.658的和谐指数,并且通过加入语义嵌入后提升至0.667。 结论:我们的研究通过对自动数据提取和多特征集成改进了预印本发表情况预测框架。通过结合语义嵌入与LLM驱动评估,AutoConfidence在减少手动标注负担的同时提升了预测性能。该框架有望促进预印本文献在循证医学中的系统性整合,支持研究人员更有效地评价和利用预印本资源。
https://arxiv.org/abs/2503.13857
Accurate segmentation is essential for effective treatment planning and disease monitoring. Existing medical image segmentation methods predominantly rely on uni-modal visual inputs, such as images or videos, requiring labor-intensive manual annotations. Additionally, medical imaging techniques capture multiple intertwined organs within a single scan, further complicating segmentation accuracy. To address these challenges, MedSAM, a large-scale medical segmentation model based on the Segment Anything Model (SAM), was developed to enhance segmentation accuracy by integrating image features with user-provided prompts. While MedSAM has demonstrated strong performance across various medical segmentation tasks, it primarily relies on geometric prompts (e.g., points and bounding boxes) and lacks support for text-based prompts, which could help specify subtle or ambiguous anatomical structures. To overcome these limitations, we propose the Organ-aware Multi-scale Text-guided Medical Image Segmentation Model (OMT-SAM) for multi-organ segmentation. Our approach introduces CLIP encoders as a novel image-text prompt encoder, operating with the geometric prompt encoder to provide informative contextual guidance. We pair descriptive textual prompts with corresponding images, processing them through pre-trained CLIP encoders and a cross-attention mechanism to generate fused image-text embeddings. Additionally, we extract multi-scale visual features from MedSAM, capturing fine-grained anatomical details at different levels of granularity. We evaluate OMT-SAM on the FLARE 2021 dataset, benchmarking its performance against existing segmentation methods. Empirical results demonstrate that OMT-SAM achieves a mean Dice Similarity Coefficient of 0.937, outperforming MedSAM (0.893) and other segmentation models, highlighting its superior capability in handling complex medical image segmentation tasks.
https://arxiv.org/abs/2503.13806
Ensuring robustness in image watermarking is crucial for and maintaining content integrity under diverse transformations. Recent self-supervised learning (SSL) approaches, such as DINO, have been leveraged for watermarking but primarily focus on general feature representation rather than explicitly learning invariant features. In this work, we propose a novel text-guided invariant feature learning framework for robust image watermarking. Our approach leverages CLIP's multimodal capabilities, using text embeddings as stable semantic anchors to enforce feature invariance under distortions. We evaluate the proposed method across multiple datasets, demonstrating superior robustness against various image transformations. Compared to state-of-the-art SSL methods, our model achieves higher cosine similarity in feature consistency tests and outperforms existing watermarking schemes in extraction accuracy under severe distortions. These results highlight the efficacy of our method in learning invariant representations tailored for robust deep learning-based watermarking.
确保图像水印的鲁棒性对于在各种变换下保持内容完整性至关重要。最近,一些自监督学习(SSL)方法,如DINO,已被用于水印技术,但这些方法主要侧重于通用特征表示而非显式地学习不变特征。在此项工作中,我们提出了一种新的文本引导的不变特征学习框架,旨在实现稳健的图像水印。我们的方法利用了CLIP的多模态能力,通过使用文本嵌入作为稳定的语义锚点来强制执行在各种失真下的特征不变性。我们在多个数据集上对所提方法进行了评估,展示了其在面对多种图像变换时具有优越的鲁棒性。与最先进的SSL方法相比,我们的模型在特征一致性测试中获得了更高的余弦相似度,并且在严重失真的情况下水印提取准确率也优于现有方案。这些结果突显了我们方法在学习针对基于深度学习的稳健水印设计的不变表示方面的有效性。
https://arxiv.org/abs/2503.13805
We introduce the 8-Calves dataset, a benchmark for evaluating object detection and identity classification in occlusion-rich, temporally consistent environments. The dataset comprises a 1-hour video (67,760 frames) of eight Holstein Friesian calves in a barn, with ground truth bounding boxes and identities, alongside 900 static frames for detection tasks. Each calf exhibits a unique coat pattern, enabling precise identity distinction. For cow detection, we fine-tuned 28 models (25 YOLO variants, 3 transformers) on 600 frames, testing on the full video. Results reveal smaller YOLO models (e.g., YOLOV9c) outperform larger counterparts despite potential bias from a YOLOv8m-based labeling pipeline. For identity classification, embeddings from 23 pretrained vision models (ResNet, ConvNextV2, ViTs) were evaluated via linear classifiers and KNN. Modern architectures like ConvNextV2 excelled, while larger models frequently overfit, highlighting inefficiencies in scaling. Key findings include: (1) Minimal, targeted augmentations (e.g., rotation) outperform complex strategies on simpler datasets; (2) Pretraining strategies (e.g., BEiT, DinoV2) significantly boost identity recognition; (3) Temporal continuity and natural motion patterns offer unique challenges absent in synthetic or domain-specific benchmarks. The dataset's controlled design and extended sequences (1 hour vs. prior 10-minute benchmarks) make it a pragmatic tool for stress-testing occlusion handling, temporal consistency, and efficiency. The link to the dataset is this https URL.
我们介绍了8-Calves数据集,这是一个用于评估物体检测和身份分类的基准,在该数据集中环境具有丰富的遮挡且在时间上保持一致性。该数据集包括一小时视频(67,760帧),视频中八头黑白花奶牛在牛棚内的活动情况,并附带地面实况边界框和身份信息,另外还有900张静态图像用于检测任务。每头小牛的毛皮图案独特,这使得精确的身份识别成为可能。 对于牛的检测,我们在600帧上微调了28个模型(包括25种YOLO变体和3种变压器),并在整个视频上进行了测试。结果显示,尽管可能存在基于YOLOv8m标签流程的潜在偏差,但较小的YOLO模型(如YOLOV9c)的表现优于较大的模型。 对于身份分类,我们通过线性分类器和KNN评估了23个预训练视觉模型(包括ResNet、ConvNextV2、ViTs)生成的嵌入。现代架构如ConvNextV2表现尤为出色,而较大模型常出现过拟合现象,这凸显了在规模扩展方面的低效。 主要发现如下: 1. 针对简单的数据集,最小化且目标化的增强技术(例如旋转)比复杂的策略更有效。 2. 前训练策略(如BEiT、DinoV2)显著提高了身份识别的准确性。 3. 时间连续性和自然运动模式带来了合成或特定领域基准中所缺乏的独特挑战。 该数据集通过其受控的设计和延长的时间序列(1小时与之前的10分钟基准相比),成为了一个实用工具,用于测试遮挡处理、时间一致性以及效率方面的压力承受能力。数据集链接:[这里插入实际URL]。
https://arxiv.org/abs/2503.13777