We propose a novel embedding-based captioning metric termed as L-CLIPScore that can be used for efficiently evaluating caption quality and training captioning model. L-CLIPScore is calculated from a lightweight CLIP (L-CLIP), which is a dual-encoder architecture compressed and distilled from CLIP. To compress, we apply two powerful techniques which are weight multiplexing and matrix decomposition for reducing the parameters of encoders and word embedding matrix, respectively. To distill, we design a novel multi-modal Similarity Regulator (SR) loss to transfer more vision-language alignment knowledge. Specifically, SR loss amplifies the multi-modal embedding similarity if the given image-text pair is matched and diminishes the similarity if the pair is non-matched. By compressing and distilling by this novel SR loss, our L-CLIP achieves comparable multi-modal alignment ability to the original CLIP while it requires fewer computation resources and running time. We carry out exhaustive experiments to validate the efficiency and effectiveness of L-CLIPScore when using it as the judge to evaluate caption quality. We also discover that when using L-CLIPScore as the supervisor to train the captioning model, it should be mixed up by an n-gram-based metric and meanwhile analyze why using L-CLIPScore only will cause fail training.
我们提出了一种新的基于嵌入的描述符度量方法,称为L-CLIPScore,它可以高效地评估描述符的质量并用于训练描述生成模型。L-CLIPScore 是从一个轻量级 CLIP(即 L-CLIP)计算得出的,该架构是从原始 CLIP 压缩和蒸馏而来的双编码器结构。为了压缩,我们采用了两种强大的技术:权重复用和矩阵分解,分别用于减少编码器和词嵌入矩阵中的参数数量。为了进行蒸馏,我们设计了一种新颖的多模态相似性调节器(SR)损失函数,以转移更多的视觉-语言对齐知识。具体来说,当给定的图像-文本配对匹配时,SR 损失会放大多模态嵌入之间的相似度;而不匹配时,则减小这种相似度。通过使用 SR 损失进行压缩和蒸馏,我们的 L-CLIP 达到了与原始 CLIP 相当的多模态对齐能力,但需要较少的计算资源和运行时间。 我们进行了详尽的实验以验证在评估描述符质量时使用 L-CLIPScore 的效率和有效性。同时,我们也发现,在利用 L-CLIPScore 作为监督信号训练描述生成模型时,应该将其与基于 n-gram 的度量混合使用,并且分析了仅使用 L-CLIPScore 进行训练为什么会失败的原因。
https://arxiv.org/abs/2507.08710
We introduce ByDeWay, a training-free framework designed to enhance the performance of Multimodal Large Language Models (MLLMs). ByDeWay uses a novel prompting strategy called Layered-Depth-Based Prompting (LDP), which improves spatial reasoning and grounding without modifying any model parameters. It segments the scene into closest, mid-range, and farthest layers using monocular depth estimation, then generates region-specific captions with a grounded vision-language model. These structured, depth-aware captions are appended to the image-question prompt, enriching it with spatial context. This guides MLLMs to produce more grounded and less hallucinated responses. Our method is lightweight, modular, and compatible with black-box MLLMs. Experiments on hallucination-sensitive (POPE) and reasoning-intensive (GQA) benchmarks show consistent improvements across multiple MLLMs, validating the effectiveness of depth-aware prompting in a zero-training setting.
我们介绍了ByDeWay,这是一个无需训练的框架,旨在增强多模态大型语言模型(MLLM)的性能。ByDeWay采用了一种称为分层深度基础提示法(Layered-Depth-Based Prompting, LDP)的新颖策略,在不修改任何模型参数的情况下提升了空间推理和定位能力。它使用单目深度估计将场景划分为最近、中距离和最远三层,然后利用基于定位的视觉语言模型生成特定区域描述文本。这些结构化的深度感知描述被附加到图像问题提示中,丰富了其空间信息背景。这引导MLLM产生更为准确且较少幻想的回答。 我们的方法轻量级、模块化,并兼容黑盒多模态大型语言模型。在针对幻想敏感性(POPE)和推理密集型(GQA)基准测试上的实验表明,在零训练设置下,深度感知提示法能够跨多种MLLM提供持续的改进效果,验证了其有效性。
https://arxiv.org/abs/2507.08679
Image captioning is an important problem in developing various AI systems, and these tasks require large volumes of annotated images to train the models. Since all existing labelled datasets are already used for training the large Vision Language Models (VLMs), it becomes challenging to improve the performance of the same. Considering this, it is essential to consider the unsupervised image captioning performance, which remains relatively under-explored. To that end, we propose LoGIC (Lewis Communication Game for Image Captioning), a Multi-agent Reinforcement Learning game. The proposed method consists of two agents, a 'speaker' and a 'listener', with the objective of learning a strategy for communicating in natural language. We train agents in the cooperative common-reward setting using the GRPO algorithm and show that improvement in image captioning performance emerges as a consequence of the agents learning to play the game. We show that using pre-trained VLMs as the 'speaker' and Large Language Model (LLM) for language understanding in the 'listener', we achieved a $46$ BLEU score after fine-tuning using LoGIC without additional labels, a $2$ units advantage in absolute metrics compared to the $44$ BLEU score of the vanilla VLM. Additionally, we replace the VLM from the 'speaker' with lightweight components: (i) a ViT for image perception and (ii) a GPT2 language generation, and train them from scratch using LoGIC, obtaining a $31$ BLEU score in the unsupervised setting, a $10$ points advantage over existing unsupervised image-captioning methods.
图像描述是开发各种AI系统中的一个重要问题,这些任务需要大量带有标注的图片来训练模型。由于所有现有的标记数据集已经被用于训练大型视觉语言模型(VLMs),因此很难进一步提升性能。鉴于此,探索无监督图像描述效果显得尤为重要,这一领域目前相对未被充分研究。 为此,我们提出了LoGIC(Lewis沟通游戏进行图像描述)——一种多智能体强化学习游戏。该方法包括两个代理,“说话者”和“听众”,其目标是学会用自然语言进行交流的策略。通过使用GRPO算法在合作共同奖励设置中训练这些代理,并证明了由于代理学会了玩游戏,因此出现了图像描述性能的提升。 我们还展示了,在没有额外标注的情况下,利用预训练的VLM作为“说话者”以及大型语言模型(LLM)用于“听众”的语言理解,LoGIC微调后实现了46的BLEU得分,相较于原始VLM的44 BLEU分,这一成绩在绝对指标上提高了2个单位。 此外,我们用轻量级组件替换了“说话者”中的VLM:(i)一个ViT用于图像感知和(ii)GPT2用于语言生成,并完全从头开始使用LoGIC进行训练,在无监督环境下获得了31的BLEU得分,相较于现有的无监督图像描述方法,这一成绩在绝对指标上领先了10个单位。
https://arxiv.org/abs/2507.08610
Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at this https URL.
图像-文本匹配(ITM)旨在解决视觉和文本模态之间对齐的基本挑战,这两者在表示方式上存在本质差异:连续、高维的图像特征与离散、结构化的文本。我们提出了一种新颖的框架,通过利用多模态大型语言模型(MLLMs)作为视觉语义解析器来弥合这种模态差距。通过生成丰富的视觉语义描述(VSD),MLLMs 提供了语义锚点以促进跨模态对齐。我们的方法结合了: 1. 实例级对齐:通过将视觉特征与 VSD 融合,增强图像表示的语义表达能力; 2. 原型级对齐:通过 VSD 聚类确保类别级别的一致性。 这些模块可以无缝集成到现有的 ITM 模型中。在 Flickr30K 和 MSCOCO 数据集上进行的广泛实验表明了显著的性能提升。该方法还表现出跨域任务(包括新闻和遥感 ITM)上的卓越零样本泛化能力。代码和模型检查点可在 [此链接](https://this-url.com) 获取。
https://arxiv.org/abs/2507.08590
Text-to-audio (T2A) generation has achieved promising results with the recent advances in generative models. However, because of the limited quality and quantity of temporally-aligned audio-text pairs, existing T2A methods struggle to handle the complex text prompts that contain precise timing control, e.g., "owl hooted at 2.4s-5.2s". Recent works have explored data augmentation techniques or introduced timing conditions as model inputs to enable timing-conditioned 10-second T2A generation, while their synthesis quality is still limited. In this work, we propose a novel training-free timing-controlled T2A framework, FreeAudio, making the first attempt to enable timing-controlled long-form T2A generation, e.g., "owl hooted at 2.4s-5.2s and crickets chirping at 0s-24s". Specifically, we first employ an LLM to plan non-overlapping time windows and recaption each with a refined natural language description, based on the input text and timing prompts. Then we introduce: 1) Decoupling and Aggregating Attention Control for precise timing control; 2) Contextual Latent Composition for local smoothness and Reference Guidance for global consistency. Extensive experiments show that: 1) FreeAudio achieves state-of-the-art timing-conditioned T2A synthesis quality among training-free methods and is comparable to leading training-based methods; 2) FreeAudio demonstrates comparable long-form generation quality with training-based Stable Audio and paves the way for timing-controlled long-form T2A synthesis. Demo samples are available at: this https URL
文本到音频(T2A)生成技术在生成模型的最新进展中取得了显著成果。然而,由于时间对齐的音视频数据质量和数量有限,现有的T2A方法难以处理包含精确时间控制的复杂文字提示,例如“2.4秒至5.2秒之间猫头鹰叫声”。近期的研究探讨了数据增强技术或引入时间条件作为模型输入以实现基于时间条件的10秒钟T2A生成,但其合成质量仍然有限。在这项工作中,我们提出了一种全新的无训练所需的时间控制型T2A框架——FreeAudio,首次实现了基于时间条件的长格式T2A生成,例如“在2.4秒至5.2秒之间猫头鹰叫声,在0秒至24秒期间蟋蟀鸣叫”。具体来说,我们首先使用大型语言模型(LLM)根据输入文本和时间提示来规划非重叠的时间窗口,并为每个窗口重新描述以生成更精细的自然语言描述。然后我们引入了:1)解耦与聚合注意力控制,用于精确的时间控制;2)上下文隐变量组合以及参考指导,以确保局部平滑性和全局一致性。大量实验表明: 1. 在无需训练的方法中,FreeAudio实现了最佳的时间条件T2A合成质量,并且其性能可媲美领先的基于训练的方法; 2. FreeAudio的长格式生成质量与基于训练的Stable Audio方法相当,为时间控制型长格式T2A合成铺平了道路。演示样本可在以下链接查看:[此处提供URL]
https://arxiv.org/abs/2507.08557
Low-level enhancement and high-level visual understanding in low-light vision have traditionally been treated separately. Low-light enhancement improves image quality for downstream tasks, but existing methods rely on physical or geometric priors, limiting generalization. Evaluation mainly focuses on visual quality rather than downstream performance. Low-light visual understanding, constrained by scarce labeled data, primarily uses task-specific domain adaptation, which lacks scalability. To address these challenges, we build a generalized bridge between low-light enhancement and low-light understanding, which we term Generalized Enhancement For Understanding (GEFU). This paradigm improves both generalization and scalability. To address the diverse causes of low-light degradation, we leverage pretrained generative diffusion models to optimize images, achieving zero-shot generalization performance. Building on this, we propose Semantically Consistent Unsupervised Fine-tuning (SCUF). Specifically, to overcome text prompt limitations, we introduce an illumination-aware image prompt to explicitly guide image generation and propose a cycle-attention adapter to maximize its semantic potential. To mitigate semantic degradation in unsupervised training, we propose caption and reflectance consistency to learn high-level semantics and image-level spatial semantics. Extensive experiments demonstrate that our proposed method outperforms current state-of-the-art methods in traditional image quality and GEFU tasks including classification, detection, and semantic segmentation.
在低光视觉中,底层增强和高层视觉理解传统上被分别对待。低光增强可以提升图像质量以支持下游任务的性能,但现有方法依赖于物理或几何先验知识,这限制了它们的泛化能力。评价主要集中在视觉质量而非下游任务的表现上。低光照下的视觉理解由于标注数据稀缺,通常采用特定任务领域的适应方法,这种方法缺乏可扩展性。 为解决这些挑战,我们建立了一个将低光增强和低光理解连接起来的一般桥梁,并将其命名为“用于理解的广义增强”(Generalized Enhancement For Understanding, GEFU)。这种范式能够同时提升泛化能力和可扩展性。为了应对各种低光照退化的成因,我们利用预训练的生成扩散模型来优化图像,实现零样本学习下的性能。 在此基础上,我们提出了语义一致性的无监督微调(Semantically Consistent Unsupervised Fine-tuning, SCUF)。具体来说,为克服文本提示的局限性,我们引入了光照感知型图像提示,以明确指导图像生成,并提出了一种循环注意力适配器来最大化其语义潜力。为了减轻无监督训练中的语义退化问题,我们提出了标题和反射一致性以学习高级语义及图片级空间语义。 广泛实验表明,所提出的这种方法在传统图像质量和包括分类、检测以及语义分割在内的GEFU任务上均超越了现有的最先进的方法。
https://arxiv.org/abs/2507.08380
Video large language models (LLMs) achieve strong video understanding by leveraging a large number of spatio-temporal tokens, but suffer from quadratic computational scaling with token count. To address this, we propose a training-free spatio-temporal token merging method, named STTM. Our key insight is to exploit local spatial and temporal redundancy in video data which has been overlooked in prior work. STTM first transforms each frame into multi-granular spatial tokens using a coarse-to-fine search over a quadtree structure, then performs directed pairwise merging across the temporal dimension. This decomposed merging approach outperforms existing token reduction methods across six video QA benchmarks. Notably, STTM achieves a 2$\times$ speed-up with only a 0.5% accuracy drop under a 50% token budget, and a 3$\times$ speed-up with just a 2% drop under a 30% budget. Moreover, STTM is query-agnostic, allowing KV cache reuse across different questions for the same video. The project page is available at this https URL.
视频大型语言模型(LLMs)通过利用大量时空令牌实现强大的视频理解,但随着令牌数量的增加,计算成本呈二次增长。为了解决这个问题,我们提出了一种无需训练的空间时间令牌合并方法,命名为STTM。我们的关键洞察在于利用了之前工作中被忽视的视频数据中的局部空间和时间冗余。 STTM首先通过四叉树结构进行从粗到细搜索的方式将每一帧转换成多尺度的空间令牌,然后在时间维度上执行有方向的一对一合并操作。这种分解式的合并方法在六个视频问答基准测试中均超越了现有的令牌减少方法。值得注意的是,在50%的令牌预算下,STTM实现了两倍的速度提升且仅损失了0.5%的准确率;而在30%的预算下,则达到了三倍速度增长且仅有2%的准确率下降。 此外,STTM是查询无关的,这意味着在处理同一视频的不同问题时可以重复使用KV缓存。项目的网页可在此网址访问:[此URL](请根据实际情况提供实际链接)。
https://arxiv.org/abs/2507.07990
Contrastive vision-language models like CLIP are used for a large variety of applications, such as zero-shot classification or as vision encoder for multi-modal models. Despite their popularity, their representations show major limitations. For instance, CLIP models learn bag-of-words representations and, as a consequence, fail to distinguish whether an image is of "a yellow submarine and a blue bus" or "a blue submarine and a yellow bus". Previous attempts to fix this issue added hard negatives during training or modified the architecture, but failed to resolve the problem in its entirety. We suspect that the missing insights to solve the binding problem for CLIP are hidden in the arguably most important part of learning algorithms: the data. In this work, we fill this gap by rigorously identifying the influence of data properties on CLIP's ability to learn binding using a synthetic dataset. We find that common properties of natural data such as low attribute density, incomplete captions, and the saliency bias, a tendency of human captioners to describe the object that is "most salient" to them have a detrimental effect on binding performance. In contrast to common belief, we find that neither scaling the batch size, i.e., implicitly adding more hard negatives, nor explicitly creating hard negatives enables CLIP to learn reliable binding. Only when the data expresses our identified data properties CLIP learns almost perfect binding.
对比视觉语言模型如CLIP被广泛应用于各种应用场景,包括零样本分类或多模态模型中的视觉编码器。尽管这些模型很受欢迎,但它们的表示形式存在重大局限性。例如,CLIP模型学习的是基于词袋的表示方法,因此无法区分“一艘黄色潜水艇和一辆蓝色巴士”与“一艘蓝色潜水艇和一辆黄色巴士”。以往尝试通过在训练中加入硬负样本或修改架构来解决这一问题的努力并未彻底解决问题。我们怀疑解决CLIP绑定问题的关键见解隐藏于学习算法中最重要的一部分:数据本身。 在这项工作中,我们填补了这一空白,严谨地识别出数据属性对CLIP学习绑定能力的影响,并使用合成数据集进行了验证。我们发现,自然数据的常见特性(如低属性密度、不完整的描述和显著性偏差——即人类描述者倾向于描述他们认为“最显眼”的物体)都对绑定性能产生了负面影响。 与普遍看法不同的是,我们发现在训练中既不是通过增加批量大小来隐式添加更多硬负样本,也不是通过明确创建硬负样本使CLIP能够学习可靠的绑定。只有当数据体现出我们识别出的数据属性时,CLIP才能几乎完美地实现绑定。
https://arxiv.org/abs/2507.07985
Progress in a research field can be hard to assess, in particular when many concurrent methods are proposed in a short period of time. This is the case in digital pathology, where many foundation models have been released recently to serve as feature extractors for tile-level images, being used in a variety of downstream tasks, both for tile- and slide-level problems. Benchmarking available methods then becomes paramount to get a clearer view of the research landscape. In particular, in critical domains such as healthcare, a benchmark should not only focus on evaluating downstream performance, but also provide insights about the main differences between methods, and importantly, further consider uncertainty and robustness to ensure a reliable usage of proposed models. For these reasons, we introduce THUNDER, a tile-level benchmark for digital pathology foundation models, allowing for efficient comparison of many models on diverse datasets with a series of downstream tasks, studying their feature spaces and assessing the robustness and uncertainty of predictions informed by their embeddings. THUNDER is a fast, easy-to-use, dynamic benchmark that can already support a large variety of state-of-the-art foundation, as well as local user-defined models for direct tile-based comparison. In this paper, we provide a comprehensive comparison of 23 foundation models on 16 different datasets covering diverse tasks, feature analysis, and robustness. The code for THUNDER is publicly available at this https URL.
在研究领域中,尤其是在短时间内提出了许多并行方法的情况下,评估进展可能很困难。这种情况发生在数字病理学中,在该领域最近发布了大量基础模型以作为用于图像瓦片(tile)级别的特征提取器,并且这些模型被广泛应用于各种下游任务,包括瓦片级别和切片级别的问题。因此,对可用的方法进行基准测试变得至关重要,以便更清晰地了解研究格局。特别是,在诸如医疗保健等关键领域中,基准不仅应专注于评估下游性能,还应提供关于不同方法之间主要差异的见解,并且重要的是还要考虑不确定性和稳健性以确保所提出模型的可靠使用。 出于这些原因,我们引入了THUNDER,这是一个针对数字病理学基础模型的瓦片级别基准测试工具,它能够在一个多样化数据集和一系列下游任务上对许多模型进行高效的比较,并研究它们的特征空间以及通过其嵌入评估预测的稳健性和不确定性。 THUNDER是一个快速、易于使用且动态的基准,可以支持大量的最先进的基础模型,也可以用于直接基于瓦片定义本地用户的模型对比。在本文中,我们在16个不同的数据集上对23种基础模型进行了全面比较,涵盖了各种任务、特征分析和鲁棒性。THUNDER的代码可在以下网址公开获取:[此URL](请将此处的"this https URL"替换为实际链接)。
https://arxiv.org/abs/2507.07860
Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: this https URL.
可靠性的不确定性量化(UQ)和故障预测仍然是视觉语言模型(VLMs)面临的挑战。我们引入了ViLU,这是一种新的视觉语言不确定性量化框架,它通过利用所有任务相关的文本表示来上下文化不确定性估计。ViLU通过交叉注意力机制将视觉嵌入、预测的文本嵌入以及基于图像的条件文本表示集成到一个具有不确定性的多模态表征中。与传统的基于损失预测的UQ方法不同,ViLU训练了一个二元分类器作为不确定性预测器,以识别正确的和错误的预测,并使用加权二元交叉熵损失来实现这一点,使其对具体的损失函数是无关的。特别地,我们提出的方法非常适合于事后设置,在这种情况下只有视觉和文本嵌入可用而没有直接访问模型本身的途径。在各种数据集上的广泛实验表明,与最先进的故障预测方法相比,我们的方法具有显著的优势。我们将该方法应用于标准分类数据集(如ImageNet-1k)以及大规模图像描述数据集(如CC12M和LAION-400M)。消融研究表明了我们架构和训练在实现有效不确定性量化方面发挥的关键作用。我们的代码是公开的,并可以在以下网址找到:this https URL。
https://arxiv.org/abs/2507.07620
Foundation models are pre-trained on large-scale datasets and subsequently fine-tuned on small-scale datasets using parameter-efficient fine-tuning (PEFT) techniques like low-rank adapters (LoRA). In most previous works, LoRA weight matrices are randomly initialized with a fixed rank across all attachment points. In this paper, we improve convergence and final performance of LoRA fine-tuning, using our proposed data-driven weight initialization method, ConsNoTrainLoRA (CNTLoRA). We express LoRA initialization as a domain shift problem where we use multiple constraints relating the pre-training and fine-tuning activations. By reformulating these constraints, we obtain a closed-form estimate of LoRA weights that depends on pre-training weights and fine-tuning activation vectors and hence requires no training during initialization. This weight estimate is decomposed to initialize the up and down matrices with proposed flexibility of variable ranks. With the proposed initialization method, we fine-tune on downstream tasks such as image generation, image classification and image understanding. Both quantitative and qualitative results demonstrate that CNTLoRA outperforms standard and data-driven weight initialization methods. Extensive analyses and ablations further elucidate the design choices of our framework, providing an optimal recipe for faster convergence and enhanced performance.
基础模型是在大规模数据集上预训练,然后通过低秩适配器(LoRA)等参数高效微调技术在小型数据集上进行精调。以往的工作中,LoRA权重矩阵通常会在所有附加点处使用固定秩随机初始化。本文提出了一种基于数据驱动的权重初始化方法——ConsNoTrainLoRA (CNTLoRA),旨在改善LoRA精调过程中的收敛性和最终性能。我们将LoRA初始化视为一个领域偏移问题,在这个问题中,我们利用多个约束条件来联系预训练和微调激活状态。通过重新表述这些约束条件,我们可以获得一种闭合形式的LoRA权重估计值,该估计值依赖于预训练权重和微调激活向量,因此在初始化时不需要额外的训练过程。这种权重估计被分解开来用于以可变秩灵活性来初始化上矩阵和下矩阵。 采用我们提出的这一初始化方法后,在下游任务如图像生成、图像分类及图像理解等任务中进行精调。无论是定量分析还是定性分析都表明,CNTLoRA在性能方面优于标准的和基于数据驱动的权重初始化方法。通过广泛的分析与消融研究进一步阐明了我们框架的设计选择,并为实现更快收敛和增强表现提供了最佳方案。
https://arxiv.org/abs/2507.08044
Spatial audio is an integral part of immersive entertainment, such as VR/AR, and has seen increasing popularity in cinema and music as well. The most common format of spatial audio is described as first-order Ambisonics (FOA). We seek to extend recent advancements in FOA generative AI models to enable the generation of 3D scenes with dynamic sound sources. Our proposed end-to-end model, SonicMotion, comes in two variations which vary in their user input and level of precision in sound source localization. In addition to our model, we also present a new dataset of simulated spatial audio-caption pairs. Evaluation of our models demonstrate that they are capable of matching the semantic alignment and audio quality of state of the art models while capturing the desired spatial attributes.
空间音频是沉浸式娱乐(如VR/AR)的重要组成部分,并在电影和音乐领域也越来越受欢迎。最常见的是,空间音频格式采用一阶Ambisonics (FOA) 描述方式。我们希望将最近在FOA生成AI模型方面的进展扩展到能够生成具有动态声源的3D场景上。我们的端到端模型SonicMotion有两种变化形式,这些变化主要体现在用户输入的不同和声音来源定位精度的变化上。除了我们的模型之外,我们还提出了一套新的模拟空间音频-描述符配对的数据集。通过评估我们的模型可以发现,它们在语义匹配和音质方面能够达到现有顶尖模型的水平,并且能够捕捉到所需的声场属性。
https://arxiv.org/abs/2507.07318
Building state-of-the-art Vision-Language Models (VLMs) with strong captioning capabilities typically necessitates training on billions of high-quality image-text pairs, requiring millions of GPU hours. This paper introduces the Vision-Language-Vision (VLV) auto-encoder framework, which strategically leverages key pretrained components: a vision encoder, the decoder of a Text-to-Image (T2I) diffusion model, and subsequently, a Large Language Model (LLM). Specifically, we establish an information bottleneck by regularizing the language representation space, achieved through freezing the pretrained T2I diffusion decoder. Our VLV pipeline effectively distills knowledge from the text-conditioned diffusion model using continuous embeddings, demonstrating comprehensive semantic understanding via high-quality reconstructions. Furthermore, by fine-tuning a pretrained LLM to decode the intermediate language representations into detailed descriptions, we construct a state-of-the-art (SoTA) captioner comparable to leading models like GPT-4o and Gemini 2.0 Flash. Our method demonstrates exceptional cost-efficiency and significantly reduces data requirements; by primarily utilizing single-modal images for training and maximizing the utility of existing pretrained models (image encoder, T2I diffusion model, and LLM), it circumvents the need for massive paired image-text datasets, keeping the total training expenditure under $1,000 USD.
构建具备强大描述能力的最先进的视觉-语言模型(VLMs),通常需要在数十亿高质量的图像文本对上进行训练,这可能需要数百万小时的GPU时间。本文介绍了一种名为“Vision-Language-Vision”(VLV) 自动编码器框架的方法,该方法战略性地利用了预训练的关键组件:一个视觉编码器、一个Text-to-Image (T2I) 扩散模型的解码器以及随后的一个大型语言模型(LLM)。具体来说,我们通过冻结预先训练好的T2I扩散模型解码器来建立信息瓶颈,从而对语言表示空间进行正则化。我们的VLV管道利用连续嵌入有效地从基于文本条件的扩散模型中提炼知识,并通过高质量的重建展示出全面的语义理解能力。此外,通过微调一个预先训练好的LLM以将中间的语言表示解码成详细的描述,我们构建了一个与GPT-4o和Gemini 2.0 Flash等领先模型相媲美的最先进的(SoTA)描述生成器。 我们的方法展示了卓越的成本效益,并显著减少了数据需求;主要通过使用单模态图像进行训练并最大化现有预训练模型的效用(包括图像编码器、T2I扩散模型以及LLM),从而避开了需要大规模配对的图像文本数据集,使得总训练支出保持在1000美元以内。
https://arxiv.org/abs/2507.07104
Decoding visual experience from brain signals offers exciting possibilities for neuroscience and interpretable AI. While EEG is accessible and temporally precise, its limitations in spatial detail hinder image reconstruction. Our model bypasses direct EEG-to-image generation by aligning EEG signals with multilevel semantic captions -- ranging from object-level to abstract themes -- generated by a large language model. A transformer-based EEG encoder maps brain activity to these captions through contrastive learning. During inference, caption embeddings retrieved via projection heads condition a pretrained latent diffusion model for image generation. This text-mediated framework yields state-of-the-art visual decoding on the EEGCVPR dataset, with interpretable alignment to known neurocognitive pathways. Dominant EEG-caption associations reflected the importance of different semantic levels extracted from perceived images. Saliency maps and t-SNE projections reveal semantic topography across the scalp. Our model demonstrates how structured semantic mediation enables cognitively aligned visual decoding from EEG.
从脑信号中解码视觉体验为神经科学和可解释的人工智能带来了令人兴奋的可能性。尽管EEG(脑电图)易于获取且时间分辨率高,但其在空间细节上的局限性阻碍了图像重建。我们的模型通过将EEG信号与大型语言模型生成的多层次语义描述对齐来绕过直接从EEG信号生成图像的过程——这些描述涵盖了从物体层面到抽象主题的不同层次。一个基于Transformer的EEG编码器利用对比学习方法,将脑电活动映射到这些语义描述上。在推理阶段,通过投影头检索出的描述嵌入条件化预训练的潜在扩散模型来生成图像。这一文本中介框架在EEGCVPR数据集上的视觉解码方面达到了最先进的水平,并且与已知神经认知路径具有可解释的一致性。主导的EEG-语义关联反映了从感知到的图像中提取的不同语义层次的重要性。显著图和t-SNE投影揭示了头皮上不同语义区域的分布情况。我们的模型展示了结构化语义中介如何使基于EEG的认知对齐视觉解码成为可能。
https://arxiv.org/abs/2507.07157
Microscopic assessment of histopathology images is vital for accurate cancer diagnosis and treatment. Whole Slide Image (WSI) classification and captioning have become crucial tasks in computer-aided pathology. However, microscopic WSI face challenges such as redundant patches and unknown patch positions due to subjective pathologist captures. Moreover, generating automatic pathology captions remains a significant challenge. To address these issues, we introduce a novel GNN-ViTCap framework for classification and caption generation from histopathological microscopic images. First, a visual feature extractor generates patch embeddings. Redundant patches are then removed by dynamically clustering these embeddings using deep embedded clustering and selecting representative patches via a scalar dot attention mechanism. We build a graph by connecting each node to its nearest neighbors in the similarity matrix and apply a graph neural network to capture both local and global context. The aggregated image embeddings are projected into the language model's input space through a linear layer and combined with caption tokens to fine-tune a large language model. We validate our method on the BreakHis and PatchGastric datasets. GNN-ViTCap achieves an F1 score of 0.934 and an AUC of 0.963 for classification, along with a BLEU-4 score of 0.811 and a METEOR score of 0.569 for captioning. Experimental results demonstrate that GNN-ViTCap outperforms state of the art approaches, offering a reliable and efficient solution for microscopy based patient diagnosis.
显微病理图像的微观评估对于准确的癌症诊断和治疗至关重要。全滑动图像(WSI)分类和描述已成为计算机辅助病理学中的关键任务。然而,由于主观的病理学家拍摄方式导致的问题,如冗余补丁和未知的位置信息,使得这些问题变得复杂。此外,自动生成病理描述仍然是一个重大挑战。为了解决这些问题,我们引入了一种新的GNN-ViTCap框架,用于从组织病理学显微图像中进行分类和生成描述。 该框架首先通过视觉特征提取器生成补丁嵌入。然后,利用深度嵌入聚类动态地对这些嵌入进行聚类,并通过标量点注意力机制选择具有代表性的补丁以去除冗余补丁。接下来,我们建立一个图结构,将每个节点连接到相似度矩阵中的最近邻结点,并应用图形神经网络来捕捉局部和全局上下文信息。最后,聚合后的图像嵌入被线性层映射到语言模型的输入空间中,并与描述标记结合以微调大型语言模型。 我们在BreakHis和PatchGastric数据集上验证了该方法的有效性。结果显示,GNN-ViTCap在分类任务上的F1得分为0.934,AUC为0.963;在描述生成方面,BLEU-4评分为0.811,METEOR评分为0.569。实验结果表明,GNN-ViTCap优于现有方法,在基于显微镜的患者诊断中提供了可靠且高效的解决方案。
https://arxiv.org/abs/2507.07006
Animation colorization is a crucial part of real animation industry production. Long animation colorization has high labor costs. Therefore, automated long animation colorization based on the video generation model has significant research value. Existing studies are limited to short-term colorization. These studies adopt a local paradigm, fusing overlapping features to achieve smooth transitions between local segments. However, the local paradigm neglects global information, failing to maintain long-term color consistency. In this study, we argue that ideal long-term color consistency can be achieved through a dynamic global-local paradigm, i.e., dynamically extracting global color-consistent features relevant to the current generation. Specifically, we propose LongAnimation, a novel framework, which mainly includes a SketchDiT, a Dynamic Global-Local Memory (DGLM), and a Color Consistency Reward. The SketchDiT captures hybrid reference features to support the DGLM module. The DGLM module employs a long video understanding model to dynamically compress global historical features and adaptively fuse them with the current generation features. To refine the color consistency, we introduce a Color Consistency Reward. During inference, we propose a color consistency fusion to smooth the video segment transition. Extensive experiments on both short-term (14 frames) and long-term (average 500 frames) animations show the effectiveness of LongAnimation in maintaining short-term and long-term color consistency for open-domain animation colorization task. The code can be found at this https URL.
动画着色是真实动画制作中的关键环节。长期的动画着色具有很高的劳动力成本,因此基于视频生成模型实现自动化的长段动画着色在研究中具有重要意义。然而,现有的研究仅限于短期色彩化处理,并采用局部范式融合重叠特征以实现在各个局部片段间的平滑过渡。这种方法忽略了全局信息,无法保持长期的色彩一致性。 在这项研究中,我们提出一种动态全局-局部范式的概念,认为通过这种机制可以实现理想的长期色彩一致性和流畅性。具体来说,我们提出了LongAnimation这一创新框架,它主要包括SketchDiT、动态全局-局部记忆(DGLM)和色彩一致性奖励三大模块。其中,SketchDiT用于捕捉混合参考特征以支持DGLM模块的工作;而DGLM则采用长视频理解模型来动态压缩历史全局特征,并适应性地将其与当前生成的特征进行融合。 为了进一步优化色彩的一致性,我们引入了“色彩一致性奖励”。在推理过程中,我们提出了一种用于平滑视频片段转换的颜色一致性融合方法。通过一系列针对短期(14帧)和长期(平均500帧)动画的研究实验,展示了LongAnimation框架在开放域动画着色任务中保持短、长期色彩一致性的有效性。 该研究的代码可以在提供的链接处找到。
https://arxiv.org/abs/2507.01945
Pretrained vision-language models (VLMs) such as CLIP excel in multimodal understanding but struggle with contextually relevant fine-grained visual features, making it difficult to distinguish visually similar yet culturally distinct concepts. This limitation stems from the scarcity of high-quality culture-specific datasets, the lack of integrated contextual knowledge, and the absence of hard negatives highlighting subtle distinctions. To address these challenges, we first design a data curation pipeline that leverages open-sourced VLMs and text-to-image diffusion models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but represent different cultural contexts. Then, we fine-tune CLIP on CulTwin to create CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through customized contrastive learning, enabling finer cultural differentiation while preserving generalization capabilities. Experiments on culturally relevant benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks, while preserving CLIP's original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.
预训练的视觉-语言模型(VLMs),如CLIP,在多模态理解方面表现出色,但在处理上下文相关的细粒度视觉特征时却面临困难。这使得区分在视觉上相似但文化背景不同的概念变得很困难。这种限制源自高质量的文化特有数据集的稀缺、集成上下文知识的缺失以及缺乏强调微妙区别的硬负样本(hard negatives)。为了解决这些挑战,我们首先设计了一种数据整理流水线,该流水线利用开源VLM和文本到图像扩散模型来构建CulTwin这一合成文化数据集。该数据集由成对的概念-描述-图片三元组组成,其中概念在视觉上看起来相似但代表不同的文化背景。 然后,我们通过定制的对比学习方法,在CulTwin上微调CLIP以创建CultureCLIP,使文化概念与上下文增强的描述和合成图像相匹配。这种方法能够在保持一般化能力的同时实现更精细的文化区分。在相关文化的基准测试中进行的实验表明,CultureCLIP的表现优于基础CLIP模型,在某些任务中的细粒度概念识别上实现了高达5.49%的显著改进,并且保留了CLIP原有的泛化能力。这验证了我们的数据合成和VLM骨干训练范式在捕捉微妙文化差异方面是有效的。
https://arxiv.org/abs/2507.06210
Notable breakthroughs in unified understanding and generation modeling have led to remarkable advancements in image understanding, reasoning, production and editing, yet current foundational models predominantly focus on processing images, creating a gap in the development of unified models for video understanding and generation. This report presents Omni-Video, an efficient and effective unified framework for video understanding, generation, as well as instruction-based editing. Our key insight is to teach existing multimodal large language models (MLLMs) to produce continuous visual clues that are used as the input of diffusion decoders, which produce high-quality videos conditioned on these visual clues. To fully unlock the potential of our system for unified video modeling, we integrate several technical improvements: 1) a lightweight architectural design that respectively attaches a vision head on the top of MLLMs and a adapter before the input of diffusion decoders, the former produce visual tokens for the latter, which adapts these visual tokens to the conditional space of diffusion decoders; and 2) an efficient multi-stage training scheme that facilitates a fast connection between MLLMs and diffusion decoders with limited data and computational resources. We empirically demonstrate that our model exhibits satisfactory generalization abilities across video generation, editing and understanding tasks.
在统一理解和生成模型方面的重要突破,已经显著推动了图像理解、推理、生产及编辑领域的进步。然而,目前的基础模型主要集中在处理图像上,这导致视频理解和生成的统一模型的发展出现了一个缺口。本报告介绍了Omni-Video,这是一种高效且有效的框架,用于视频的理解、生成以及基于指令的编辑。我们的关键洞察是教导现有的多模态大型语言模型(MLLMs)产生连续的视觉线索,并将其用作扩散解码器输入,从而生成高质量的视频。为了充分发挥我们在统一视频建模系统中的潜力,我们整合了几项技术改进:1)一种轻量级架构设计,在MLLMs顶部添加一个视觉头并在扩散解码器输入之前添加一个适配器,前者产生视觉标记供后者使用,并将这些视觉标记适应到扩散解码器的条件空间中;以及2)一个多阶段训练方案,该方案在有限的数据和计算资源下促进了MLLMs与扩散解码器之间的快速连接。我们实证表明,我们的模型在视频生成、编辑和理解任务上展现了令人满意的泛化能力。
https://arxiv.org/abs/2507.06119
Accurate driving behavior recognition and reasoning are critical for autonomous driving video understanding. However, existing methods often tend to dig out the shallow causal, fail to address spurious correlations across modalities, and ignore the ego-vehicle level causality modeling. To overcome these limitations, we propose a novel Multimodal Causal Analysis Model (MCAM) that constructs latent causal structures between visual and language modalities. Firstly, we design a multi-level feature extractor to capture long-range dependencies. Secondly, we design a causal analysis module that dynamically models driving scenarios using a directed acyclic graph (DAG) of driving states. Thirdly, we utilize a vision-language transformer to align critical visual features with their corresponding linguistic expressions. Extensive experiments on the BDD-X, and CoVLA datasets demonstrate that MCAM achieves SOTA performance in visual-language causal relationship learning. Furthermore, the model exhibits superior capability in capturing causal characteristics within video sequences, showcasing its effectiveness for autonomous driving applications. The code is available at this https URL.
准确的驾驶行为识别和推理对于自动驾驶视频理解至关重要。然而,现有的方法往往倾向于挖掘浅层次的原因关系,未能解决跨模态之间的虚假关联,并忽视了对自车级别的因果模型构建。为了解决这些局限性,我们提出了一种新颖的多模态因果分析模型(MCAM),该模型在视觉和语言模式之间构建潜在的因果结构。 首先,我们设计了一个多层次特征提取器以捕获长距离依赖关系。其次,我们设计了一个因果分析模块,使用驾驶状态有向无环图(DAG)来动态建模驾驶场景。最后,我们利用一个视觉-语言变压器将关键视觉特征与其相应的语言表达对齐。 在BDD-X和CoVLA数据集上的大量实验表明,MCAM在视觉-语言因果关系学习方面达到了最先进的性能水平。此外,该模型展示了其在视频序列中捕捉因果特性的卓越能力,证明了它在自动驾驶应用中的有效性。代码可在提供的URL获取。
https://arxiv.org/abs/2507.06072
Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC method gets 56.2 AP, which sets a new state-of-the-art result on OVIS benchmark. This work conclusively establishes depth cues as critical enablers for robust video understanding.
视频实例分割(VIS)在处理包括对象遮挡、运动模糊和外观变化等广泛挑战时面临困难。为了克服这些限制,这项工作引入了几何感知,通过战略性地利用单目深度估计来增强VIS的鲁棒性。我们系统地研究了三种不同的集成范式:扩展深度通道(EDC)方法将深度图作为输入通道与分割网络连接;共享ViT(SV)设计了一个统一的ViT骨干网,在深度估计和分割分支之间进行共享;深度监督(DS)利用深度预测作为特征学习的辅助训练指导。尽管DS展示出有限的效果,但基准评估表明EDC和SV显著增强了VIS的鲁棒性。使用Swin-L骨干网络时,我们的EDC方法在OVIS基准测试中达到了56.2 AP,创下了新的最佳性能记录。这项工作最终确立了深度线索作为视频理解中的关键因素,以实现更强的鲁棒性。
https://arxiv.org/abs/2507.05948