2D Gaussian Splatting (2DGS) is an emerging explicit scene representation method with significant potential for image compression due to high fidelity and high compression ratios. However, existing low-light enhancement algorithms operate predominantly within the pixel domain. Processing 2DGS-compressed images necessitates a cumbersome decompression-enhancement-recompression pipeline, which compromises efficiency and introduces secondary degradation. To address these limitations, we propose LL-GaussianImage, the first zero-shot unsupervised framework designed for low-light enhancement directly within the 2DGS compressed representation domain. Three primary advantages are offered by this framework. First, a semantic-guided Mixture-of-Experts enhancement framework is designed. Dynamic adaptive transformations are applied to the sparse attribute space of 2DGS using rendered images as guidance to enable compression-as-enhancement without full decompression to a pixel grid. Second, a multi-objective collaborative loss function system is established to strictly constrain smoothness and fidelity during enhancement, suppressing artifacts while improving visual quality. Third, a two-stage optimization process is utilized to achieve reconstruction-as-enhancement. The accuracy of the base representation is ensured through single-scale reconstruction and network robustness is enhanced. High-quality enhancement of low-light images is achieved while high compression ratios are maintained. The feasibility and superiority of the paradigm for direct processing within the compressed representation domain are validated through experimental results.
2D高斯点阵(2DGS)是一种新兴的显式场景表示方法,由于其高保真度和高压缩比,在图像压缩方面具有巨大的潜力。然而,现有的低光增强算法主要在像素域内操作。处理通过2DGS压缩的图像需要一个复杂的解压-增强-重新压缩管道流程,这不仅效率低下,还会引入二次退化问题。为了克服这些限制,我们提出了LL-GaussianImage框架,这是首个针对低光图像直接在2DGS压缩表示域内进行无监督零样本增强处理的方法。 该框架提供了三个主要优势: 1. 设计了一个基于语义引导的专家混合增强框架。通过使用渲染图像作为指导,在不完全解压到像素网格的情况下对2DGS的稀疏属性空间应用动态自适应变换,从而实现压缩与增强一体化的效果。 2. 建立了一种多目标协作损失函数系统,严格限制了在增强过程中保持平滑度和保真度的要求。这种方法不仅可以抑制伪影,还可以提高视觉质量。 3. 采用两阶段优化过程来实现重建即增强的目标。通过单尺度重建确保基础表示的准确性,并加强网络鲁棒性,在维持高压缩比的同时实现了低光图像的高质量增强效果。 实验结果验证了直接在压缩表示域内进行处理的可行性和优越性,展示了LL-GaussianImage框架的有效性。
https://arxiv.org/abs/2601.15772
This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.
本文介绍了一种先进的视觉编码器家族,名为OpenVision 3,它能够学习一种单一的、统一的视觉表示形式,既能服务于图像理解又能支持图像生成。我们的核心架构非常简单:我们将VAE压缩后的图像潜在特征输入到ViT编码器中,并训练其输出以承担两种互补的角色。首先,编码器的输出被传递给ViT-VAE解码器来重建原始图像,促使表示形式捕捉生成结构。其次,相同的表示形式通过对比学习和图文配对的目标进行优化,强化语义特征。通过在共享潜在空间中共同优化重构驱动信号和语义驱动信号,编码器学会了能很好地协同工作并推广到两个领域的表示方式。 我们通过对冻结编码器后的大量下游任务的评估来验证这种统一的设计理念。对于多模态理解,我们将编码器集成到了LLaVA-1.5框架中:它的性能与标准CLIP视觉编码器相当(例如,在SeedBench上分别为62.4和62.2,在POPE上的得分分别为83.7和82.9)。在生成方面,我们使用RAE框架对其进行测试:我们的模型显著超过了基于标准CLIP的编码器(例如,在ImageNet上的gFID得分为1.89对2.54)。 希望这项工作能够激发未来关于统一建模的研究。
https://arxiv.org/abs/2601.15369
Most existing time series classification methods adopt a discriminative paradigm that maps input sequences directly to one-hot encoded class labels. While effective, this paradigm struggles to incorporate contextual features and fails to capture semantic relationships among classes. To address these limitations, we propose InstructTime, a novel framework that reformulates time series classification as a multimodal generative task. Specifically, continuous numerical sequences, contextual textual features, and task instructions are treated as multimodal inputs, while class labels are generated as textual outputs by tuned language models. To bridge the modality gap, InstructTime introduces a time series discretization module that converts continuous sequences into discrete temporal tokens, together with an alignment projection layer and a generative self-supervised pre-training strategy to enhance cross-modal representation alignment. Building upon this framework, we further propose InstructTime++, which extends InstructTime by incorporating implicit feature modeling to compensate for the limited inductive bias of language models. InstructTime++ leverages specialized toolkits to mine informative implicit patterns from raw time series and contextual inputs, including statistical feature extraction and vision-language-based image captioning, and translates them into textual descriptions for seamless integration. Extensive experiments on multiple benchmark datasets demonstrate the superior performance of InstructTime++.
大多数现有的时间序列分类方法采用了一种判别式范式,该范式直接将输入序列映射为一个由one-hot编码表示的类别标签。尽管这种方法有效,但它难以整合上下文特征,并且无法捕捉类别的语义关系。为了克服这些限制,我们提出了InstructTime,这是一个新的框架,它重新定义时间序列分类为一个多模态生成任务。具体来说,在该框架中,连续数值序列、上下文文本特征和任务指令被视为多模态输入,而类别标签则通过调整过的语言模型生成为文本输出。 为了弥合不同模态之间的鸿沟,InstructTime引入了一个时间序列离散化模块,它将连续的时间序列转换成离散的时间标记。此外,还包括一个对齐投影层和一种增强跨模态表示对齐的生成自监督预训练策略。 在此框架的基础上,我们进一步提出了InstructTime++,该方法通过引入隐式特征建模来扩展InstructTime以弥补语言模型有限的归纳偏差。InstructTime++利用专门工具包从原始时间序列和上下文输入中挖掘出有用的隐式模式,包括统计特征提取以及基于视觉-语言的时间序列描述生成,并将这些模式转化为文本描述进行无缝集成。 在多个基准数据集上的广泛实验表明了InstructTime++的优越性能。
https://arxiv.org/abs/2601.14968
Models for image representation learning are typically designed for either recognition or generation. Various forms of contrastive learning help models learn to convert images to embeddings that are useful for classification, detection, and segmentation. On the other hand, models can be trained to reconstruct images with pixel-wise, perceptual, and adversarial losses in order to learn a latent space that is useful for image generation. We seek to unify these two directions with a first-of-its-kind model that learns representations which are simultaneously useful for recognition and generation. We train our model as a hyper-network for implicit neural representation, which learns to map images to model weights for fast, accurate reconstruction. We further integrate our INR hyper-network with knowledge distillation to improve its generalization and performance. Beyond the novel training design, the model also learns an unprecedented compressed embedding space with outstanding performance for various visual tasks. The complete model competes with state-of-the-art results for image representation learning, while also enabling generative capabilities with its high-quality tiny embeddings. The code is available at this https URL.
图像表示学习的模型通常设计用于识别或生成任务。对比学习的各种形式帮助模型学习将图像转换为对分类、检测和分割有用的嵌入表示。另一方面,通过像素级、感知性和对抗性损失函数训练模型来重构图像,从而学习到对于图像生成有用的潜在空间。我们试图用一种前所未有的模型统一这两种方向,该模型能够同时学习出对识别和生成都有用的表示。 我们的模型被训练为隐式神经表示(INR)超网络,它学会将图像映射到模型权重上以实现快速准确地重构。此外,我们将INR超网络与知识蒸馏技术集成在一起,从而提高其泛化能力和性能表现。除了创新的训练设计之外,该模型还学习出了一个前所未有的压缩嵌入空间,在各种视觉任务中表现出色。 整个模型在图像表示学习领域取得了接近或达到现有最佳水平的结果,并且由于其高质量的小型嵌入,还能具备生成能力。代码可在提供的链接处获取(原文中的“this https URL”)。
https://arxiv.org/abs/2601.14256
Few-shot learning aims to identify novel categories from only a handful of labeled samples, where prototypes estimated from scarce data are often biased and generalize poorly. Semantic-based methods alleviate this by introducing coarse class-level information, but they are mostly applied on the support side, leaving query representations unchanged. In this paper, we present PMCE, a Probabilistic few-shot framework that leverages Multi-granularity semantics with Caption-guided Enhancement. PMCE constructs a nonparametric knowledge bank that stores visual statistics for each category as well as CLIP-encoded class name embeddings of the base classes. At meta-test time, the most relevant base classes are retrieved based on the similarities of class name embeddings for each novel category. These statistics are then aggregated into category-specific prior information and fused with the support set prototypes via a simple MAP update. Simultaneously, a frozen BLIP captioner provides label-free instance-level image descriptions, and a lightweight enhancer trained on base classes optimizes both support prototypes and query features under an inductive protocol with a consistency regularization to stabilize noisy captions. Experiments on four benchmarks show that PMCE consistently improves over strong baselines, achieving up to 7.71% absolute gain over the strongest semantic competitor on MiniImageNet in the 1-shot setting. Our code is available at this https URL
少量样本学习(Few-shot learning)旨在仅通过少数标记样本识别新类别,但由于数据稀缺,从这些有限的数据中估计出的原型常常存在偏差,并且泛化性能较差。基于语义的方法通过引入粗粒度的类级别信息来缓解这一问题,但它们主要应用于支持集方面,而未改变查询表示。 在本文中,我们提出了PMCE(Probabilistic few-shot framework with Multi-granularity semantics and Caption-guided Enhancement),这是一个利用多粒度语义和基于描述指导增强的概率少量样本学习框架。PMCE构建了一个非参数知识库,该知识库存储了每个类别的视觉统计信息以及基础类别中的CLIP编码的类别名称嵌入。在元测试阶段,根据新类别的类别名称嵌入相似性检索最相关的基础类别。随后,将这些统计数据聚合为特定于类别的先验信息,并通过简单的MAP更新与支持集原型融合。 同时,一个冻结的BLIP描述器提供无标签的实例级图像描述,而基于基础类训练的一个轻量级增强器在归纳协议下优化了支持原型和查询特征,并使用一致性正则化来稳定噪声描述。实验结果表明,在四个基准数据集上,PMCE相对于强大的基线方法持续改进,在MiniImageNet的一次性设置中相较于最强的语义竞争者实现了高达7.71%的绝对收益。我们的代码可在上述链接获取。 这段翻译解释了PMCE框架如何通过结合多粒度语义信息和基于描述的增强,来提高少量样本学习中的性能,并详细介绍了该方法的工作原理及其在几个基准数据集上的实验效果。
https://arxiv.org/abs/2601.14111
Mooney images are high-contrast, two-tone visual stimuli, created by thresholding photographic images. They allow researchers to separate image content from image understanding, making them valuable for studying visual perception. An ideal Mooney image for this purpose achieves a specific balance: it initially appears unrecognizable but becomes fully interpretable to the observer after seeing the original template. Researchers traditionally created these stimuli manually using subjective criteria, which is labor-intensive and can introduce inconsistencies across studies. Automated generation techniques now offer an alternative to this manual approach. Here, we present MooneyMaker, an open-source Python package that automates the generation of ambiguous Mooney images using several complementary approaches. Users can choose between various generation techniques that range from approaches based on image statistics to deep learning models. These models strategically alter edge information to increase initial ambiguity. The package lets users create two-tone images with multiple methods and directly compare the results visually. In an experiment, we validate MooneyMaker by generating Mooney images using different techniques and assess their recognizability for human observers before and after disambiguating them by presenting the template images. Our results reveal that techniques with lower initial recognizability are associated with higher post-template recognition (i.e. a larger disambiguation effect). To help vision scientists build effective databases of Mooney stimuli, we provide practical guidelines for technique selection. By standardizing the generation process, MooneyMaker supports more consistent and reproducible visual perception research.
莫尼图像是一种高对比度、双色调的视觉刺激,通过设定阈值从摄影图片中生成。它们允许研究人员将图像内容与理解分离,因此在研究视觉感知方面非常有价值。理想的用于此目的的莫尼图像是:初次看起来难以辨认,但在看到原版模板后变得完全可解释。传统上,这些刺激物是通过主观标准手动创建的,这既费时又可能导致不同研究中的不一致性。现在,自动生成技术提供了一种替代的手动方法。 在这里,我们介绍 MooneyMaker,这是一个开源的 Python 包,它使用几种互补的方法自动化了模棱两可的莫尼图像的生成。用户可以在基于图像统计和深度学习模型的不同生成技术之间进行选择。这些模型战略性地改变边缘信息以增加初始模糊性。该包使用户能够用多种方法创建双色调图,并直接通过视觉比较结果。 在一项实验中,我们使用不同技术生成了莫尼图像,并评估了它们对人类观察者初次和经过模板解歧后的可识别性。我们的结果显示:初期辨识度较低的技术与后续模板识别率较高(即更大的解模效应)相关联。为了帮助视觉科学家建立有效的莫尼刺激数据库,我们提供了关于技术选择的实际指南。 通过标准化生成过程,MooneyMaker 支持更加一致和可重复的视觉感知研究。
https://arxiv.org/abs/2601.14077
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications, allowing users to retrieve a target image by providing a reference image and a relative caption describing the desired modifications. Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively. They primarily rely on either transforming the multimodal query into a single text using image-to-text models or employing large language models for target image description generation, approaches that often fail to capture complementary visual information and complete semantic context. To address these limitations, we propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration (CVSI). Specifically, CVSI leverages three key components: (1) Visual Information Extraction, which not only extracts global image features but also uses a pre-trained mapping network to convert the image into a pseudo token, combining it with the modification text and the objects most likely to be added. (2) Semantic Information Extraction, which involves using a pre-trained captioning model to generate multiple captions for the reference image, followed by leveraging an LLM to generate the modified captions and the objects most likely to be added. (3) Complementary Information Retrieval, which integrates information extracted from both the query and database images to retrieve the target image, enabling the system to efficiently handle retrieval queries in a variety of situations. Extensive experiments on three public datasets (e.g., CIRR, CIRCO, and FashionIQ) demonstrate that CVSI significantly outperforms existing state-of-the-art methods. Our code is available at this https URL.
零样本组合图像检索(ZS-CIR)是一个迅速发展的领域,具有重要的实际应用价值。它允许用户通过提供一张参考图片和一段描述所需修改的相对说明来检索目标图像。现有的ZS-CIR方法常常难以捕捉细微的变化,并且在有效整合视觉和语义信息方面存在挑战。这些方法主要依赖于将多模态查询转换为单一文本,使用图像到文本模型或将大型语言模型用于目标图像描述生成的方法,但这种方法往往无法捕获互补的视觉信息并提供完整的语义上下文。 为了克服这些限制,我们提出了一种新的细粒度零样本组合图像检索方法——互补视觉-语义集成(CVSI)。具体来说,CVSI利用了三个关键组成部分: 1. **视觉信息提取**:不仅提取全局图像特征,还使用预训练的映射网络将图像转换为伪令牌,并将其与修改文本和最有可能添加的对象结合在一起。 2. **语义信息提取**:涉及使用预训练的描述生成模型为参考图像生成多个说明,随后利用大型语言模型生成修改后的说明以及最可能被添加的对象。 3. **互补信息检索**:将从查询和数据库图片中提取的信息结合起来以检索目标图片,从而使得系统能够高效处理各种情况下的检索请求。 广泛的实验在三个公共数据集上(例如CIRR、CIRCO 和FashionIQ)表明,CVSI 显著优于现有最先进的方法。我们的代码可以在这个URL 上获取。
https://arxiv.org/abs/2601.14060
Modern vision backbones for 3D medical imaging typically process dense voxel grids through parameter-heavy encoder-decoder structures, a design that allocates a significant portion of its parameters to spatial reconstruction rather than feature learning. Our approach introduces SVGFormer, a decoder-free pipeline built upon a content-aware grouping stage that partitions the volume into a semantic graph of supervoxels. Its hierarchical encoder learns rich node representations by combining a patch-level Transformer with a supervoxel-level Graph Attention Network, jointly modeling fine-grained intra-region features and broader inter-regional dependencies. This design concentrates all learnable capacity on feature encoding and provides inherent, dual-scale explainability from the patch to the region level. To validate the framework's flexibility, we trained two specialized models on the BraTS dataset: one for node-level classification and one for tumor proportion regression. Both models achieved strong performance, with the classification model achieving a F1-score of 0.875 and the regression model a MAE of 0.028, confirming the encoder's ability to learn discriminative and localized features. Our results establish that a graph-based, encoder-only paradigm offers an accurate and inherently interpretable alternative for 3D medical image representation.
现代的视觉骨干网络在处理三维医学影像时,通常采用参数密集型的编码器-解码器结构来处理稠密体素网格。这种设计将大量参数分配给空间重建而非特征学习上。我们的方法提出了SVGFormer,这是一种无解码器的数据流管道,建立在一个内容感知分组阶段的基础上,该阶段将体积数据划分为一个监督体素的语义图。其层次化编码器通过结合补丁级别的Transformer和监督体素级别的图注意网络来学习丰富的节点表示,同时建模细粒度区域内的特征及更广泛的跨区域依赖关系。这种设计将所有可训练的能力集中在特征编码上,并提供了从补丁到区域级别的内在双尺度解释能力。 为了验证该框架的灵活性,我们在BraTS数据集上训练了两个专门化的模型:一个用于节点级分类任务,另一个用于肿瘤比例回归任务。这两个模型都取得了优异的成绩,其中分类模型达到了0.875的F1分数,而回归模型则实现了0.028的平均绝对误差(MAE),这证实了编码器能够学习到具有区分性和局部性的特征。 我们的研究结果表明,基于图的、仅使用编码器的方法为三维医学影像表示提供了一种既准确又固有的可解释性替代方案。
https://arxiv.org/abs/2601.14055
Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.
最近在大型视觉-语言模型(LVLM)方面的进展已经使它们更接近成为通用型助手。尽管这些模型表现出色,但在以图像分类为代表的视觉任务上仍然表现不佳,未能超越其基础的视觉编码器——通常是基于CLIP的模型。为了解决这一局限性,我们提出了通过集成实现上下文感知的图像表示优先级(CARPE),这是一种新颖且与模型无关的框架,它引入了视觉融合层和一种上下文感知的集成策略来确定何时应优先考虑图像表示或依赖于语言模型的推理能力。这种设计增强了模型根据任务需求灵活调整视觉和文本模态权重的能力,并使模型能够捕捉到图像表示的各种方面,从而在分类和视觉-语言基准测试上实现了持续的泛化性能提升。 广泛的实验表明,CARPE不仅提高了图像分类基准上的表现,还改善了各种视觉-语言基准测试的结果。最后,CARPE被设计为可以有效地与大多数开源LVLM集成,这些模型包括一个视觉编码器和一个语言模型,从而确保其能够在不同的架构中灵活适应。
https://arxiv.org/abs/2601.13622
Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.
在遥感领域的自监督预训练中,通常使用的是中等空间分辨率(MR)图像数据集,因为它们的高可用性。鉴于高分辨率(HR)数据集的发布,我们探讨了如何将这些HR数据集纳入自监督预训练过程,以增强对MR图像表示的学习,并提升下游分割任务中的表现性能。为此,我们设计了一个可以添加到现有自监督学习框架中的空间亲和力组件,该组件利用HR影像来学习更好的MR影像表示。我们在两个自监督学习框架中测试了这个空间亲和力组件,并展示了它在仅使用HR或MR图像预训练的模型上表现更佳的结果。
https://arxiv.org/abs/2601.12964
Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.
目前基于区域特征的图像描述生成方法发展迅速,并且已经取得了显著的成绩。然而,由于缺乏上下文信息以及过度依赖于生成的部分描述来预测剩余单词的原因,这些方法仍然容易产生不相关的描述。为此,本文提出了一种双流协作Transformer(DSCT),通过引入分割特征来解决这个问题。所提出的DSCT整合并融合了区域和分割特征以指导描述句子的生成过程。该模型包含多个模式特定互注意编码器(PSMAE)和动态提名解码器(DND)。PSMAE通过相互查询有效突出并整合两种表示形式中的私有信息。DND则根据输入文本表征动态搜索与其最相关的学习块,并利用整合后的区域与分割特征之间的同质特性来生成更加准确且描述性的句子。 据我们所知,这是首次探索如何以动态方式融合不同模式特定的特征,从而绕过其语义不一致和空间错位问题的研究。实验结果表明,在流行的基准数据集上,我们的DSCT在图像描述生成方面优于文献中现有的最佳模型。
https://arxiv.org/abs/2601.12926
Deep learning has achieved remarkable success in image recognition, yet their inherent opacity poses challenges for deployment in critical domains. Concept-based interpretations aim to address this by explaining model reasoning through human-understandable concepts. However, existing post-hoc methods and ante-hoc concept bottleneck models (CBMs), suffer from limitations such as unreliable concept relevance, non-visual or labor-intensive concept definitions, and model or data-agnostic assumptions. This paper introduces Post-hoc Concept Bottleneck Model via Representation Decomposition (PCBM-ReD), a novel pipeline that retrofits interpretability onto pretrained opaque models. PCBM-ReD automatically extracts visual concepts from a pre-trained encoder, employs multimodal large language models (MLLMs) to label and filter concepts based on visual identifiability and task relevance, and selects an independent subset via reconstruction-guided optimization. Leveraging CLIP's visual-text alignment, it decomposes image representations into linear combination of concept embeddings to fit into the CBMs abstraction. Extensive experiments across 11 image classification tasks show PCBM-ReD achieves state-of-the-art accuracy, narrows the performance gap with end-to-end models, and exhibits better interpretability.
深度学习在图像识别方面取得了显著的成功,但其内在的不透明性为将其部署到关键领域带来了挑战。基于概念的解释旨在通过人类可理解的概念来说明模型推理,从而解决这个问题。然而,现有的事后方法和事前概念瓶颈模型(CBMs)存在一些局限性,例如不可靠的概念相关性、非视觉或劳动密集型的概念定义以及对模型或数据的假设过于不敏感。 本文介绍了通过表示分解进行的事后概念瓶颈模型(PCBM-ReD),这是一种新颖的管道,旨在将解释能力附加到预先训练的黑盒模型上。PCBM-ReD 自动从预训练编码器中提取视觉概念,利用多模态大型语言模型(MLLMs)根据可视性和任务相关性对这些概念进行标记和筛选,并通过基于重建引导优化选择一个独立的概念子集。借助 CLIP 的视觉-文本对齐能力,PCBM-ReD 将图像表示分解为概念嵌入的线性组合,从而符合 CBMs 抽象的要求。 在 11 种图像分类任务上进行的大量实验表明,PCBM-ReD 达到了最先进的精度,缩小了与端到端模型的性能差距,并表现出更好的可解释性。
https://arxiv.org/abs/2601.12303
Despite recent progress, medical foundation models still struggle to unify visual understanding and generation, as these tasks have inherently conflicting goals: semantic abstraction versus pixel-level reconstruction. Existing approaches, typically based on parameter-shared autoregressive architectures, frequently lead to compromised performance in one or both tasks. To address this, we present UniX, a next-generation unified medical foundation model for chest X-ray understanding and generation. UniX decouples the two tasks into an autoregressive branch for understanding and a diffusion branch for high-fidelity generation. Crucially, a cross-modal self-attention mechanism is introduced to dynamically guide the generation process with understanding features. Coupled with a rigorous data cleaning pipeline and a multi-stage training strategy, this architecture enables synergistic collaboration between tasks while leveraging the strengths of diffusion models for superior generation. On two representative benchmarks, UniX achieves a 46.1% improvement in understanding performance (Micro-F1) and a 24.2% gain in generation quality (FD-RadDino), using only a quarter of the parameters of LLM-CXR. By achieving performance on par with task-specific models, our work establishes a scalable paradigm for synergistic medical image understanding and generation. Codes and models are available at this https URL.
尽管近期取得了进展,但医疗基础模型在统一视觉理解和生成任务方面仍面临挑战。这些任务本质上具有冲突的目标:语义抽象与像素级重建之间的矛盾。现有方法通常基于参数共享的自回归架构,在一个或两个任务上的性能往往妥协。为解决这一问题,我们提出了UniX——下一代统一医学基础模型,用于胸部X光片的理解和生成。在UniX中,我们将两项任务解耦:理解任务采用自回归分支,而高保真度生成则使用扩散分支进行处理。至关重要的是,引入了跨模态自我注意机制来动态地通过理解特征引导生成过程。结合严格的去噪数据流程与多阶段训练策略,这一架构使两个任务之间能够协同工作,并利用扩散模型的优势实现更佳的生成效果。在两个代表性基准上,UniX实现了46.1%的理解性能(Micro-F1)提升和24.2%的生成质量(FD-RadDino)改善,同时仅使用了LLM-CXR四分之一的参数量。通过达到与特定任务模型相当的表现水平,我们的工作为医学图像理解和生成之间的协同工作建立了一个可扩展的方法论。相关代码和模型可在上述链接获取。
https://arxiv.org/abs/2601.11522
Vision-Language Pre-training (VLP) models demonstrate strong performance across various downstream tasks by learning from large-scale image-text pairs through contrastive pretraining. The release of extensive English image-text datasets (e.g., COYO-700M and LAION-400M) has enabled widespread adoption of models such as CLIP and SigLIP in tasks including cross-modal retrieval and image captioning. However, the advancement of Chinese vision-language pretraining has substantially lagged behind, due to the scarcity of high-quality Chinese image-text data. To address this gap, we develop a comprehensive pipeline for constructing a high-quality Chinese cross-modal dataset. As a result, we propose DanQing, which contains 100 million image-text pairs collected from Common Crawl. Different from existing datasets, DanQing is curated through a more rigorous selection process, yielding superior data quality. Moreover, DanQing is primarily built from 2024-2025 web data, enabling models to better capture evolving semantic trends and thus offering greater practical utility. We compare DanQing with existing datasets by continual pre-training of the SigLIP2 model. Experimental results show that DanQing consistently achieves superior performance across a range of Chinese downstream tasks, including zero-shot classification, cross-modal retrieval, and LMM-based evaluations. To facilitate further research in Chinese vision-language pre-training, we will open-source the DanQing dataset under the Creative Common CC-BY 4.0 license.
Vision-Language预训练(VLP)模型通过对比学习大规模的图像-文本对,在各种下游任务中展示了强大的性能。随着大量英文图像-文本数据集(如COYO-700M和LAION-400M)的发布,CLIP和SigLIP等模型在跨模态检索和图像描述生成等任务中的应用得到了广泛推广。然而,由于高质量中文图像-文本数据的稀缺,中文学术界在视觉语言预训练领域的进展明显滞后。 为了弥补这一差距,我们开发了一整套流程来构建一个高质量的中文跨模态数据集。由此,我们提出了DanQing数据集,该数据集包含了从Common Crawl收集的1亿张图像-文本对。不同于现有的数据集,DanQing通过更加严格的筛选过程建立起来,从而在质量上更为出色。此外,DanQing主要由2024年至2025年的网络数据构成,使得模型能够更好地捕捉到不断变化的语义趋势,并因此具有更大的实用价值。 我们通过对SigLIP2模型进行持续预训练的方式,将DanQing与现有数据集进行了比较。实验结果显示,在包括零样本分类、跨模态检索以及基于语言建模评估(LMM)等中文下游任务中,DanQing始终表现出更优的性能。 为了进一步促进中国视觉语言预训练领域的研究,我们将以Creative Common CC-BY 4.0许可协议的形式开放源代码共享DanQing数据集。
https://arxiv.org/abs/2601.10305
In the information and communications technology (ICT) industry, training a domain-specific large language model (LLM) or constructing a retrieval-augmented generation system requires a substantial amount of high-value domain knowledge. However, the knowledge is not only hidden in the textual modality but also in the image modality. Traditional methods can parse text from domain documents but dont have image captioning ability. Multi-modal LLM (MLLM) can understand images, but they do not have sufficient domain knowledge. To address the above issues, this paper proposes a multi-stage progressive training strategy to train a Domain-specific Image Captioning Model (DICModel) in ICT, and constructs a standard evaluation system to validate the performance of DICModel. Specifically, this work first synthesizes about 7K image-text pairs by combining the Mermaid tool and LLMs, which are used for the first-stage supervised-fine-tuning (SFT) of DICModel. Then, ICT-domain experts manually annotate about 2K image-text pairs for the second-stage SFT of DICModel. Finally, experts and LLMs jointly synthesize about 1.5K visual question answering data for the instruction-based SFT. Experimental results indicate that our DICModel with only 7B parameters performs better than other state-of-the-art models with 32B parameters. Compared to the SOTA models with 7B and 32B parameters, our DICModel increases the BLEU metric by approximately 56.8% and 20.8%, respectively. On the objective questions constructed by ICT domain experts, our DICModel outperforms Qwen2.5-VL 32B by 1% in terms of accuracy rate. In summary, this work can efficiently and accurately extract the logical text from images, which is expected to promote the development of multimodal models in the ICT domain.
在信息和通信技术(ICT)行业中,训练特定领域的大型语言模型(LLM)或构建增强检索的生成系统需要大量的高质量专业领域知识。然而,这些知识不仅隐藏于文本模式中,还存在于图像模式之中。传统的方法可以解析文档中的文本信息,但不具备图像描述能力。多模态LLM能够理解图片内容,但是它们缺乏足够的特定领域的专业知识。为解决上述问题,本文提出了一种多阶段渐进式训练策略,用于在ICT行业中训练一种领域专用的图像描述模型(DICModel),并构建了一个标准评估系统以验证DICModel的表现。具体而言,该研究首先通过结合Mermaid工具和LLM生成了约7K张图片-文本对,这些数据用于第一阶段的监督微调(SFT)过程中的DICModel训练。然后,ICT领域的专家手动标注大约2K张图片-文本对,以供第二阶段SFT使用。最后,专家与LLM共同合成了大约1.5K条基于视觉的问题回答数据,用于指令驱动的SFT。 实验结果显示,在参数量仅为70亿的情况下,我们的DICModel的表现优于其他最先进的模型(其参数量分别为320亿)。相比其他参数量为70亿和320亿的最佳实践模型,本研究中的DICModel在BLEU分数指标上分别提高了约56.8%和20.8%。在由ICT领域专家构建的客观问题测试中,我们的DICModel在准确率方面比Qwen2.5-VL 32B高出1%。 总之,该工作能够有效地、精确地从图像中提取逻辑文本信息,并有望促进多模态模型在ICT领域的进一步发展。
https://arxiv.org/abs/2601.09298
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.
链式思维(Chain-of-Thought,CoT)推理已被证明可以有效提升大型语言模型的性能,通过鼓励逐步、中间推理的方式来实现。近期进展已将这一范式扩展到多模态大型语言模型(MLLMs)。在医疗领域中,诊断决策依赖于细微的视觉线索和顺序推理,链式思维与临床思维方式自然契合。然而,目前用于医学图像理解的基准测试通常仅关注最终答案,而忽视了推理路径。缺乏透明度的过程难以提供可靠的判断依据,使得医生难以利用其进行辅助诊断。 为解决这一问题,我们引入了一个新的M3CoTBench基准测试,专门设计用于评估链式思维在医学图像理解中的正确性、效率、影响和一致性。该基准包括以下特点: 1. 一个涵盖24种检查类型的多样性和多难度级别的数据集。 2. 包含不同难度等级的13个任务。 3. 针对临床推理量身定制的一系列链式思维特定评估指标(正确性、效率、影响和一致性)。 4. 多个MLLM性能分析。 M3CoTBench系统地评估了各种医学影像任务中的链式推理,揭示了当前多模态大型语言模型在生成可靠且临床解释性强的推理方面存在的局限,并致力于推动透明、可信及诊断准确的人工智能系统的开发。项目页面链接:[此URL](https://project-page-url.com)(请将实际项目页URL插入此处)。
https://arxiv.org/abs/2601.08758
We introduce the Ministral 3 series, a family of parameter-efficient dense language models designed for compute and memory constrained applications, available in three model sizes: 3B, 8B, and 14B parameters. For each model size, we release three variants: a pretrained base model for general-purpose use, an instruction finetuned, and a reasoning model for complex problem-solving. In addition, we present our recipe to derive the Ministral 3 models through Cascade Distillation, an iterative pruning and continued training with distillation technique. Each model comes with image understanding capabilities, all under the Apache 2.0 license.
我们介绍Ministral 3系列,这是一个参数高效、密集型的语言模型家族,专为计算和内存受限的应用程序设计。该系列包含三种不同规模的模型:3B、8B 和14B 参数版本。 对于每个模型规模,我们将发布三个变体: - 预训练的基础模型,适用于通用场景; - 经过指令微调的基础模型; - 用于解决复杂问题的推理模型。 此外,我们还介绍了通过级联蒸馏技术(Cascade Distillation)导出Ministral 3系列模型的方法。这种技术包括迭代剪枝和持续蒸馏训练。每个模型都具备图像理解能力,并且所有内容均以Apache 2.0许可证发布。
https://arxiv.org/abs/2601.08584
With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.
随着图像生成技术的迅速发展,使用自然语言指令进行视觉文本编辑受到了越来越多的关注。这一任务的主要挑战在于充分理解指令和参考图,并据此生成与图片风格一致的视觉文字。以往的方法往往涉及复杂的步骤,如指定文本内容及属性(例如字体大小、颜色和布局),但这些方法并未考虑与参考图像在风格上的一致性。为了解决这些问题,我们提出了UM-Text,这是一种统一的多模态模型,用于通过自然语言指令理解上下文并进行视觉文字编辑。具体来说,我们引入了一个视觉语言模型(VLM)来处理指令和参考图,以便根据上下文信息精巧地设计文本内容和布局。 为了生成准确且和谐的视觉文字图像,我们进一步提出了UM-Encoder,它将各种条件信息的嵌入组合在一起,这些组合由VLM根据输入指令自动配置。在训练过程中,我们提出了一种区域一致性损失函数,以提供对潜在空间和RGB空间中字符生成的有效监督,并设计了一个定制化的三阶段训练策略来进一步提升模型性能。 此外,我们贡献了UM-DATA-200K,这是一个大型视觉文本图像数据集,在多种场景下用于模型训练。在多个公共基准上的广泛定性和定量结果表明,我们的方法达到了最先进的性能水平。
https://arxiv.org/abs/2601.08321
Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +23.6 percentage points on ScienceQA and +8.1 percentage points on EgoSchema.
视觉-语言模型在多种多模态理解和推理任务中表现出色,但其多步推理的稳定性仍然存在问题。对同一输入进行多次采样往往会导致不同的推理路径和不一致的最终预测结果。为了解决这个问题,我们提出了两种受测试时间缩放启发的方法:(1)CASHEW,这是一个推理框架,在推理过程中通过迭代聚合多个候选轨迹来稳定推理过程,并通过显式的视觉验证过滤掉幻觉步骤并使推理建立在视觉证据之上;以及(2)CASHEW-RL,这是一种学习型变体,它在一个单一模型中内化了这种聚合行为。CASHEW-RL 使用分组序列策略优化(GSPO)进行训练,并采用了一种复合奖励机制来鼓励基于最少但足够视觉证据的正确答案,同时根据任务难度自适应地分配推理努力。这一训练目标使模型能够在推理时实现稳健的自我聚合。 在13个图像理解、视频理解和视频推理基准测试中进行了广泛实验,结果显示了显著性能提升,包括ScienceQA和EgoSchema分别提高了23.6个百分点和8.1个百分点。
https://arxiv.org/abs/2601.08010
6D object pose estimation plays a crucial role in scene understanding for applications such as robotics and augmented reality. To support the needs of ever-changing object sets in such context, modern zero-shot object pose estimators were developed to not require object-specific training but only rely on CAD models. Such models are hard to obtain once deployed, and a continuously changing and growing set of objects makes it harder to reliably identify the instance model of interest. To address this challenge, we introduce an Open-Set CAD Retrieval from a Language Prompt and a Single Image (OSCAR), a novel training-free method that retrieves a matching object model from an unlabeled 3D object database. During onboarding, OSCAR generates multi-view renderings of database models and annotates them with descriptive captions using an image captioning model. At inference, GroundedSAM detects the queried object in the input image, and multi-modal embeddings are computed for both the Region-of-Interest and the database captions. OSCAR employs a two-stage retrieval: text-based filtering using CLIP identifies candidate models, followed by image-based refinement using DINOv2 to select the most visually similar object. In our experiments we demonstrate that OSCAR outperforms all state-of-the-art methods on the cross-domain 3D model retrieval benchmark MI3DOR. Furthermore, we demonstrate OSCAR's direct applicability in automating object model sourcing for 6D object pose estimation. We propose using the most similar object model for pose estimation if the exact instance is not available and show that OSCAR achieves an average precision of 90.48\% during object retrieval on the YCB-V object dataset. Moreover, we demonstrate that the most similar object model can be utilized for pose estimation using Megapose achieving better results than a reconstruction-based approach.
六维物体姿态估计在机器人和增强现实等应用的场景理解中扮演着至关重要的角色。为了支持这些应用场景中不断变化的对象集合的需求,现代零样本(zero-shot)物体姿态估算器被开发出来,不需要特定于某个对象的训练,而只需要依靠CAD(计算机辅助设计)模型即可工作。然而,在部署后获取这些模型变得非常困难,并且由于对象集持续变化和增长,准确识别感兴趣的实例模型变得更加具有挑战性。 为了解决这一难题,我们引入了一种名为OSCAR的方法,即从语言提示和单张图像进行开放集合CAD检索的新颖训练自由方法(Open-Set CAD Retrieval from a Language Prompt and a Single Image)。在部署时,OSCAR会生成数据库中模型的多视角渲染,并使用图像描述性文字注释工具来标注这些渲染。推理阶段时,GroundedSAM会在输入图像中检测查询对象,同时为感兴趣区域和数据库中的描述性文字计算多模态嵌入。 OSCAR采用两阶段检索方法:第一阶段是利用CLIP(一种文本到图像的匹配模型)基于文本过滤候选模型;第二阶段则使用DINOv2进行基于图像的细化,选择视觉上最相似的对象。在我们的实验中显示,与现有最佳方法相比,OSCAR在跨域3D模型检索基准MI3DOR上的性能更优。 此外,我们展示了OSCAR在自动化六维物体姿态估计所需对象模型获取中的直接应用价值。当无法获得确切实例时,我们可以使用最相似的对象模型进行姿态估计,并证明在YCB-V对象数据集上,OSCAR在物体检索期间达到了90.48%的平均精度。 最后,我们还展示了即使采用Megapose方法利用最接近的对象模型来进行姿态估计也能取得比基于重建的方法更好的结果。
https://arxiv.org/abs/2601.07333