We address the problem of video question answering (video QA) with temporal grounding in a weakly supervised setup, without any temporal annotations. Given a video and a question, we generate an open-ended answer grounded with the start and end time. For this task, we propose TOGA: a vision-language model for Temporally Grounded Open-Ended Video QA with Weak Supervision. We instruct-tune TOGA to jointly generate the answer and the temporal grounding. We operate in a weakly supervised setup where the temporal grounding annotations are not available. We generate pseudo labels for temporal grounding and ensure the validity of these labels by imposing a consistency constraint between the question of a grounding response and the response generated by a question referring to the same temporal segment. We notice that jointly generating the answers with the grounding improves performance on question answering as well as grounding. We evaluate TOGA on grounded QA and open-ended QA tasks. For grounded QA, we consider the NExT-GQA benchmark which is designed to evaluate weakly supervised grounded question answering. For open-ended QA, we consider the MSVD-QA and ActivityNet-QA benchmarks. We achieve state-of-the-art performance for both tasks on these benchmarks.
我们解决的是在弱监督环境下,视频问答(video QA)中的时间定位问题,而无需任何时间标注。给定一个视频和一个问题时,我们会生成基于开始时间和结束时间的开放式答案。为了解决这个任务,我们提出了TOGA:一种用于弱监督下的、具有时间定位的开放性视频问答的视觉-语言模型(Temporally Grounded Open-Ended Video QA)。通过指令微调,TOGA能够同时生成答案和时间定位信息。我们在时间标注不可用的弱监督环境中操作,并为时间定位生成伪标签,以确保这些标签的有效性,我们通过对同一时间段的问题回答之间的一致性约束来实现这一点。 我们注意到,与单独生成答案或时间定位相比,同时生成二者可以提高问答及定位性能。我们在有依据的问答和开放式问答任务上评估TOGA的表现。对于基于证据的问答,我们考虑NExT-GQA基准测试,该测试旨在评估弱监督下的基于证据的问题回答能力。而对于开放性问答,我们参考MSVD-QA和ActivityNet-QA这两个基准。在这些基准测试中,我们的方法在两项任务上均取得了最先进的性能表现。
https://arxiv.org/abs/2506.09445
Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology (CPath) for generating clinically meaningful slide-level embeddings from gigapixel tissue images. However, MIL often struggles with small, weakly supervised clinical datasets. In contrast to fields such as NLP and conventional computer vision, where transfer learning is widely used to address data scarcity, the transferability of MIL models remains poorly understood. In this study, we systematically evaluate the transfer learning capabilities of pretrained MIL models by assessing 11 models across 21 pretraining tasks for morphological and molecular subtype prediction. Our results show that pretrained MIL models, even when trained on different organs than the target task, consistently outperform models trained from scratch. Moreover, pretraining on pancancer datasets enables strong generalization across organs and tasks, outperforming slide foundation models while using substantially less pretraining data. These findings highlight the robust adaptability of MIL models and demonstrate the benefits of leveraging transfer learning to boost performance in CPath. Lastly, we provide a resource which standardizes the implementation of MIL models and collection of pretrained model weights on popular CPath tasks, available at this https URL
多示例学习(MIL)是计算病理学(CPath)中的一个基石方法,用于从数十亿像素的组织图像中生成具有临床意义的幻灯片级别的嵌入。然而,在小规模、弱监督的临床数据集上,MIL常常面临挑战。与自然语言处理(NLP)和传统计算机视觉等领域不同,这些领域广泛使用迁移学习来解决数据稀缺问题,关于多示例学习模型的可迁移性知之甚少。 在本研究中,我们通过评估11个预训练模型在21项预训练任务中的表现来系统地评价了预训练MIL模型的迁移学习能力。这些任务涵盖形态和分子亚型预测。我们的结果显示,即使是在与目标任务不同的器官上进行训练,预训练过的MIL模型也始终优于从头开始训练的模型。此外,在泛癌症数据集上的预训练能够实现跨不同器官和任务的强大泛化性能,并且在使用显著较少的预训练数据的情况下超过了幻灯片基础模型。 这些发现强调了MIL模型的稳健适应性,展示了利用迁移学习提高CPath中性能的好处。最后,我们提供了一个资源,用于标准化流行CPath任务中的多示例学习模型实现和收集预训练模型权重,可在提供的网址上获取:[此链接](请确保将实际URL替换为具体的可用链接)。
https://arxiv.org/abs/2506.09022
Accurate lesion segmentation in histopathology images is essential for diagnostic interpretation and quantitative analysis, yet it remains challenging due to the limited availability of costly pixel-level annotations. To address this, we propose FMaMIL, a novel two-stage framework for weakly supervised lesion segmentation based solely on image-level labels. In the first stage, a lightweight Mamba-based encoder is introduced to capture long-range dependencies across image patches under the MIL paradigm. To enhance spatial sensitivity and structural awareness, we design a learnable frequency-domain encoding module that supplements spatial-domain features with spectrum-based information. CAMs generated in this stage are used to guide segmentation training. In the second stage, we refine the initial pseudo labels via a CAM-guided soft-label supervision and a self-correction mechanism, enabling robust training even under label noise. Extensive experiments on both public and private histopathology datasets demonstrate that FMaMIL outperforms state-of-the-art weakly supervised methods without relying on pixel-level annotations, validating its effectiveness and potential for digital pathology applications.
在病理图像中准确地分割病变区域对于诊断解释和定量分析至关重要,但由于昂贵的像素级注释资源有限,这一任务仍然具有挑战性。为了应对这个问题,我们提出了FMaMIL,这是一种基于仅使用图像级别标签的弱监督学习框架的新颖两阶段方法。 **第一阶段**:在该阶段中,引入了一个轻量级的以Mamba为基础的编码器,在多实例学习(MIL)范式下捕捉图像块间的长程依赖性。为了增强空间敏感性和结构感知能力,我们设计了一个可学习的频域编码模块,通过基于谱的信息补充了空域特征。该阶段生成的类激活图(CAMs)用于引导分割训练。 **第二阶段**:在这一阶段中,我们通过使用CAM指导下的软标签监督和自我校正机制来优化初始伪标签,这使得即使在存在标签噪声的情况下也能实现稳健的训练。 广泛的实验表明,无论是在公开数据集还是私有病理图像数据集中,FMaMIL的表现都优于现有的弱监督方法,并且无需依赖像素级注释。这些结果验证了该框架的有效性和其在未来数字病理学应用中的潜力。
https://arxiv.org/abs/2506.07652
A central challenge in modern language models (LMs) is intrinsic hallucination: the generation of information that is plausible but unsubstantiated relative to input context. To study this problem, we propose Precise Information Control (PIC), a new task formulation that requires models to generate long-form outputs grounded in a provided set of short self-contained statements, known as verifiable claims, without adding any unsupported ones. For comprehensiveness, PIC includes a full setting that tests a model's ability to include exactly all input claims, and a partial setting that requires the model to selectively incorporate only relevant claims. We present PIC-Bench, a benchmark of eight long-form generation tasks (e.g., summarization, biography generation) adapted to the PIC setting, where LMs are supplied with well-formed, verifiable input claims. Our evaluation of a range of open and proprietary LMs on PIC-Bench reveals that, surprisingly, state-of-the-art LMs still intrinsically hallucinate in over 70% of outputs. To alleviate this lack of faithfulness, we introduce a post-training framework, using a weakly supervised preference data construction method, to train an 8B PIC-LM with stronger PIC ability--improving from 69.1% to 91.0% F1 in the full PIC setting. When integrated into end-to-end factual generation pipelines, PIC-LM improves exact match recall by 17.1% on ambiguous QA with retrieval, and factual precision by 30.5% on a birthplace verification task, underscoring the potential of precisely grounded generation.
现代语言模型(LM)的一个核心挑战是内在幻觉:生成的信息在输入上下文中看似合理但缺乏实质证据支持。为研究这一问题,我们提出了精确信息控制(PIC),这是一种新的任务形式化方法,要求模型基于一组提供的简短独立陈述——称为可验证声明——生成长文本输出,并且不允许添加任何未被证实的信息。为了全面评估,PIC 包含一个完整的设置来测试模型能否准确地包括所有的输入声明,以及一个部分设置以要求模型选择性地整合相关的声明。我们推出了 PIC-Bench,这是一个由八个改编至 PIC 设置的长文本生成任务(例如总结、人物生成)构成的基准,其中语言模型接收结构良好且可验证的输入陈述。 我们在 PIC-Bench 上评估了一系列开源和专有语言模型后发现,令人惊讶的是,最先进的语言模型仍然在超过 70% 的输出中产生内在幻觉。为了解决这一忠实度不足的问题,我们引入了一个后训练框架,利用弱监督偏好数据构造方法来训练一个具有更强 PIC 能力的8B参数PIC-LM——在完整的PIC设置下F1得分从69.1%提升到91.0%。 将PIC-LM集成进端到端事实生成流水线中,在带有检索的模糊问题回答任务中,精确匹配召回率提升了17.1%,而在出生地验证任务中,则实现了30.5%的事实精度提升。这些结果强调了精准基础性生成的潜力。
https://arxiv.org/abs/2506.06589
To what extent do vision-and-language foundation models possess a realistic world model (observation $\times$ action $\rightarrow$ observation) and a dynamics model (observation $\times$ observation $\rightarrow$ action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a new objective, where image tokens in observation pairs are weighted by their importance, as predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on Aurora-Bench. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of $15\%$ on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.
当动作通过语言表达时,视觉和语言基础模型在多大程度上具备现实世界的模型(观察 $\times$ 动作 $\rightarrow$ 观察)和动态模型(观察 $\times$ 观察 $\rightarrow$ 动作)?开源基础模型在这两方面都存在困难,但我们发现通过监督来微调它们以获取动态模型比获取世界模型要容易得多。反过来,可以通过两种主要策略利用动态模型来启动世界模型:1)从合成数据进行弱监督学习;2)在推理时间验证。首先,动态模型可以为未标记的视频帧观察对分配动作标签,以此扩展训练数据集。此外,我们还提出了一种新目标,在此目标中,通过识别模型预测的重要性加权观测配对中的图像令牌。其次,动态模型还可以将奖励分配给世界模型的多个样本进行评分,从而有效指导推理时间搜索。 我们在Aurora-Bench上通过对以行动为中心的图像编辑任务来评估两种策略产生的世界模型性能。我们最好的模型在与最先进的图像编辑模型的竞争中表现出具有竞争力的表现,在GPT4o-as-judge根据现实世界的子集评估时提高了15%,并且在整个Aurora-Bench所有子集中实现了最佳的人均评价结果。
https://arxiv.org/abs/2506.06006
3D human generation is an important problem with a wide range of applications in computer vision and graphics. Despite recent progress in generative AI such as diffusion models or rendering methods like Neural Radiance Fields or Gaussian Splatting, controlling the generation of accurate 3D humans from text prompts remains an open challenge. Current methods struggle with fine detail, accurate rendering of hands and faces, human realism, and controlability over appearance. The lack of diversity, realism, and annotation in human image data also remains a challenge, hindering the development of a foundational 3D human model. We present a weakly supervised pipeline that tries to address these challenges. In the first step, we generate a photorealistic human image dataset with controllable attributes such as appearance, race, gender, etc using a state-of-the-art image diffusion model. Next, we propose an efficient mapping approach from image features to 3D point clouds using a transformer-based architecture. Finally, we close the loop by training a point-cloud diffusion model that is conditioned on the same text prompts used to generate the original samples. We demonstrate orders-of-magnitude speed-ups in 3D human generation compared to the state-of-the-art approaches, along with significantly improved text-prompt alignment, realism, and rendering quality. We will make the code and dataset available.
三维人体生成是计算机视觉和图形学中一个具有广泛应用的重要问题。尽管在生成式AI(如扩散模型)以及渲染方法(例如神经辐射场或高斯点聚法)方面取得了近期进展,但从文本提示准确生成三维人体仍然是一个开放性的挑战。当前的方法在精细细节、手部和面部的精确渲染、人类的真实感以及外观控制性上存在困难。此外,由于人体图像数据中缺乏多样性、真实性和标注信息,开发基础的人体三维模型也面临阻碍。 我们提出了一种弱监督管道来解决这些挑战。首先,利用最先进的图像扩散模型生成包含可控属性(如外观、种族、性别等)的逼真人体图像数据集。接下来,我们提出了一个基于变压器架构的有效映射方法,用于将图像特征转换为三维点云。最后,通过训练条件相同的文本提示以生成原始样本为基础的点云扩散模型来闭合这一循环。 我们的方法在三维人体生成速度上相比现有最佳方法提高了几个数量级,并显著改善了文本提示的对齐性、真实感和渲染质量。我们将公开发布代码和数据集。
https://arxiv.org/abs/2506.04351
Recent advancements in Large Language Models (LLMs) have demonstrated that Process Reward Models (PRMs) play a crucial role in enhancing model performance. However, training PRMs typically requires step-level labels, either manually annotated or automatically generated, which can be costly and difficult to obtain at scale. To address this challenge, we introduce FreePRM, a weakly supervised framework for training PRMs without access to ground-truth step-level labels. FreePRM first generates pseudo step-level labels based on the correctness of final outcome, and then employs Buffer Probability to eliminate impact of noise inherent in pseudo labeling. Experimental results show that FreePRM achieves an average F1 score of 53.0% on ProcessBench, outperforming fully supervised PRM trained on Math-Shepherd by +24.1%. Compared to other open-source PRMs, FreePRM outperforms upon RLHFlow-PRM-Mistral-8B (28.4%) by +24.6%, EurusPRM (31.3%) by +21.7%, and Skywork-PRM-7B (42.1%) by +10.9%. This work introduces a new paradigm in PRM training, significantly reducing reliance on costly step-level annotations while maintaining strong performance.
最近在大型语言模型(LLMs)方面取得的进展表明,过程奖励模型(PRMs)在提高模型性能方面发挥着关键作用。然而,训练PRMs通常需要逐步骤的标签数据,这些标签要么是人工标注的,要么是由自动化工具生成的,这既昂贵又难以大规模获取。为了解决这一挑战,我们引入了FreePRM,这是一种无需访问真实步级标签就能训练PRM的弱监督框架。 FreePRM首先基于最终结果的正确性生成伪步骤级标签,然后利用缓存概率来消除伪标记中固有的噪声影响。实验结果显示,在ProcessBench数据集上,FreePRM实现了平均F1分数为53.0%,这一成绩优于完全监督训练下的Math-Shepherd PRM模型+24.1%。与开源的其他PRMs相比,FreePRM在性能上也取得了显著优势:相较于RLHFlow-PRM-Mistral-8B(28.4%),其表现高出+24.6%,优于EurusPRM(31.3%)+21.7%,以及Skywork-PRM-7B(42.1%)+10.9%。 这项工作开创了PRMs训练的新范式,显著减少了对昂贵的步骤级注释的依赖,同时保持了强大的性能表现。
https://arxiv.org/abs/2506.03570
In this work, we investigate the Meta PL unsupervised domain adaptation framework for Automatic Speech Recognition (ASR). We introduce a Multi-Stage Domain Adaptation pipeline (MSDA), a sample-efficient, two-stage adaptation approach that integrates self-supervised learning with semi-supervised techniques. MSDA is designed to enhance the robustness and generalization of ASR models, making them more adaptable to diverse conditions. It is particularly effective for low-resource languages like Greek and in weakly supervised scenarios where labeled data is scarce or noisy. Through extensive experiments, we demonstrate that Meta PL can be applied effectively to ASR tasks, achieving state-of-the-art results, significantly outperforming state-of-the-art methods, and providing more robust solutions for unsupervised domain adaptation in ASR. Our ablations highlight the necessity of utilizing a cascading approach when combining self-supervision with self-training.
在这项工作中,我们研究了用于自动语音识别(ASR)的元PL无监督领域适应框架。我们引入了一种多阶段领域适应流水线(MSDA),这是一种样本效率高的两阶段适应方法,它将自监督学习与半监督技术相结合。MSDA旨在增强ASR模型的鲁棒性和泛化能力,使其更能适应各种条件。该方法特别适用于希腊语等资源匮乏的语言以及标签数据稀缺或嘈杂的弱监督场景。 通过广泛的实验,我们展示了Meta PL可以有效地应用于ASR任务,并取得了最先进的成果,在很大程度上超越了现有的最佳方法,为无监督领域适应在ASR中的应用提供了更稳健的解决方案。我们的消融研究突显了在结合自监督和自我训练时采用级联方法的重要性。
https://arxiv.org/abs/2505.24656
In this work, we focus on the task of weakly supervised affordance grounding, where a model is trained to identify affordance regions on objects using human-object interaction images and egocentric object images without dense labels. Previous works are mostly built upon class activation maps, which are effective for semantic segmentation but may not be suitable for locating actions and functions. Leveraging recent advanced foundation models, we develop a supervised training pipeline based on pseudo labels. The pseudo labels are generated from an off-the-shelf part segmentation model, guided by a mapping from affordance to part names. Furthermore, we introduce three key enhancements to the baseline model: a label refining stage, a fine-grained feature alignment process, and a lightweight reasoning module. These techniques harness the semantic knowledge of static objects embedded in off-the-shelf foundation models to improve affordance learning, effectively bridging the gap between objects and actions. Extensive experiments demonstrate that the performance of the proposed model has achieved a breakthrough improvement over existing methods. Our codes are available at this https URL .
在这项工作中,我们专注于弱监督下的功能区域定位任务,即在没有密集标签的情况下,通过使用人类与物体交互的图像和第一人称视角的对象图像来训练模型识别对象的功能区域。以往的工作大多基于类激活图(CAMs),这类方法对语义分割非常有效,但对于定位动作和功能可能不太适用。我们利用最近出现的基础模型,开发了一个基于伪标签的监督训练流程。这些伪标签是由现成的部分分割模型生成,并通过功能到部件名称的映射进行引导。 此外,我们还为基准模型引入了三个关键改进:标签精炼阶段、细粒度特征对齐过程以及轻量级推理模块。这些技术利用嵌入在现成基础模型中的静态物体语义知识来增强功能学习能力,并有效地弥合了物体与动作之间的差距。广泛的实验表明,所提出的模型的性能在现有方法的基础上取得了重大突破性改进。 我们的代码可在 [此链接](https://this https URL) 获取。
https://arxiv.org/abs/2505.24103
Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.
时间动作定位(TAL)在信息检索领域引起了广泛的关注。现有的监督或弱监督方法高度依赖于标记的时间边界和动作类别,这需要大量的劳动时间和资源。因此,无监督时间动作定位(UTAL)变得越来越受欢迎。然而,当前的方法面临着两大挑战:1) 预训练的分类特征过于关注高区分性的区域;2) 单纯依靠视觉模态的信息难以确定上下文边界。为了解决这些问题,我们提出了一种CLIP辅助的跨视角音频-视觉增强UTAL方法。具体来说,我们引入了视觉语言预训练(VLP)和基于分类预训练的合作增强机制来避免对高区分性区域的过度关注;同时,我们还融入了音频感知以提供更丰富的上下文边界信息。最后,我们提出了一种自监督跨视角学习范式,在不增加额外标注的情况下实现多视角感知增强。在两个公开数据集上的广泛实验表明,我们的模型优于多个最先进的竞争对手。
https://arxiv.org/abs/2505.23524
Whole-slide images (WSIs) are critical for cancer diagnosis due to their ultra-high resolution and rich semantic content. However, their massive size and the limited availability of fine-grained annotations pose substantial challenges for conventional supervised learning. We propose DSAGL (Dual-Stream Attention-Guided Learning), a novel weakly supervised classification framework that combines a teacher-student architecture with a dual-stream design. DSAGL explicitly addresses instance-level ambiguity and bag-level semantic consistency by generating multi-scale attention-based pseudo labels and guiding instance-level learning. A shared lightweight encoder (VSSMamba) enables efficient long-range dependency modeling, while a fusion-attentive module (FASA) enhances focus on sparse but diagnostically relevant regions. We further introduce a hybrid loss to enforce mutual consistency between the two streams. Experiments on CIFAR-10, NCT-CRC, and TCGA-Lung datasets demonstrate that DSAGL consistently outperforms state-of-the-art MIL baselines, achieving superior discriminative performance and robustness under weak supervision.
全滑动图像(WSIs)对于癌症诊断至关重要,因为它们具有超高分辨率和丰富的语义内容。然而,由于其庞大的大小以及细粒度注释的有限可用性,传统监督学习面临着重大挑战。我们提出了DSAGL(双流注意力引导学习),这是一种新型弱监督分类框架,结合了教师-学生架构与双流设计。DSAGL通过生成基于多尺度注意力的伪标签并指导实例级学习来明确解决实例层面的模糊性和袋层级的语义一致性问题。一个共享的轻量级编码器(VSSMamba)能够有效地进行长程依赖建模,而融合注意模块(FASA)则增强了对稀疏但诊断相关的区域的关注。我们进一步引入了一种混合损失函数来强制执行两流之间的相互一致性。在CIFAR-10、NCT-CRC和TCGA-Lung数据集上的实验表明,DSAGL始终优于最先进的MIL基准模型,在弱监督条件下实现了优异的判别性能和鲁棒性。
https://arxiv.org/abs/2505.23341
Weakly supervised semantic segmentation (WSSS) in medical imaging struggles with effectively using sparse annotations. One promising direction for WSSS leverages gaze annotations, captured via eye trackers that record regions of interest during diagnostic procedures. However, existing gaze-based methods, such as GazeMedSeg, do not fully exploit the rich information embedded in gaze data. In this paper, we propose GradTrack, a framework that utilizes physicians' gaze track, including fixation points, durations, and temporal order, to enhance WSSS performance. GradTrack comprises two key components: Gaze Track Map Generation and Track Attention, which collaboratively enable progressive feature refinement through multi-level gaze supervision during the decoding process. Experiments on the Kvasir-SEG and NCI-ISBI datasets demonstrate that GradTrack consistently outperforms existing gaze-based methods, achieving Dice score improvements of 3.21\% and 2.61\%, respectively. Moreover, GradTrack significantly narrows the performance gap with fully supervised models such as nnUNet.
在医学成像中的弱监督语义分割(WSSS)面临有效利用稀疏标注的挑战。一个有前景的研究方向是利用通过眼动追踪器记录下的诊断过程中感兴趣区域的眼动数据来增强WSSS效果。然而,现有的基于眼动的方法(如GazeMedSeg)未能充分挖掘眼动数据中的丰富信息。在本文中,我们提出了GradTrack框架,该框架利用了医生的注视轨迹,包括固定点、持续时间和时间顺序等信息,以提升弱监督语义分割的表现。 GradTrack包含两个关键组件:注视轨迹图生成和轨迹注意力机制,这两个部分合作通过多级眼动监督来逐步改进特征,在解码过程中实现渐进式的特征精炼。在Kvasir-SEG和NCI-ISBI数据集上的实验表明,GradTrack相比现有的基于眼动的方法表现出显著的优势,分别取得了3.21% 和 2.61% 的Dice分数提升。此外,GradTrack大幅度缩小了与全监督模型(如nnUNet)的性能差距。
https://arxiv.org/abs/2505.22230
Despite remarkable achievements, automatic speech recognition (ASR) in low-resource scenarios still faces two challenges: high-quality data scarcity and high computational demands. This paper proposes EThai-ASR, the first to apply large language models (LLMs) to Thai ASR and create an efficient LLM-based ASR system. EThai-ASR comprises a speech encoder, a connection module and a Thai LLM decoder. To address the data scarcity and obtain a powerful speech encoder, EThai-ASR introduces a self-evolving data refinement strategy to refine weak labels, yielding an enhanced speech encoder. Moreover, we propose a pluggable sequence compression module used in the connection module with three modes designed to reduce the sequence length, thus decreasing computational demands while maintaining decent performance. Extensive experiments demonstrate that EThai-ASR has achieved state-of-the-art accuracy in multiple datasets. We release our refined text transcripts to promote further research.
尽管自动语音识别(ASR)在资源有限的场景中已经取得了显著成就,但仍面临两大挑战:高质量数据稀缺和计算需求高。本文提出了EThai-ASR,这是首个将大型语言模型(LLMs)应用于泰语ASR并创建高效LLM基ASR系统的方案。EThai-ASR包含一个语音编码器、连接模块和一个泰语LLM解码器。为了应对数据稀缺问题,并增强语音编码器的能力,EThai-ASR引入了一种自我演进的数据精炼策略,用于改进弱标签,从而产生了一个性能更佳的语音编码器。此外,我们还提出了一种可插拔序列压缩模块,在连接模块中使用该模块并设计了三种模式以减少序列长度,从而在保持良好性能的同时降低计算需求。大量的实验表明,EThai-ASR已经在多个数据集中达到了最先进的准确率水平。我们将发布我们的精炼文本转录本,以促进进一步的研究。
https://arxiv.org/abs/2505.22063
We present OvSGTR, a novel transformer-based framework for fully open-vocabulary scene graph generation that overcomes the limitations of traditional closed-set models. Conventional methods restrict both object and relationship recognition to a fixed vocabulary, hindering their applicability to real-world scenarios where novel concepts frequently emerge. In contrast, our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories. OvSGTR leverages a DETR-like architecture featuring a frozen image backbone and text encoder to extract high-quality visual and semantic features, which are then fused via a transformer decoder for end-to-end scene graph prediction. To enrich the model's understanding of complex visual relations, we propose a relation-aware pre-training strategy that synthesizes scene graph annotations in a weakly supervised manner. Specifically, we investigate three pipelines--scene parser-based, LLM-based, and multimodal LLM-based--to generate transferable supervision signals with minimal manual annotation. Furthermore, we address the common issue of catastrophic forgetting in open-vocabulary settings by incorporating a visual-concept retention mechanism coupled with a knowledge distillation strategy, ensuring that the model retains rich semantic cues during fine-tuning. Extensive experiments on the VG150 benchmark demonstrate that OvSGTR achieves state-of-the-art performance across multiple settings, including closed-set, open-vocabulary object detection-based, relation-based, and fully open-vocabulary scenarios. Our results highlight the promise of large-scale relation-aware pre-training and transformer architectures for advancing scene graph generation towards more generalized and reliable visual understanding.
我们介绍了OvSGTR,这是一种基于变压器的全新框架,用于全开放式词汇量场景图生成,克服了传统封闭集模型的局限性。传统的方法将物体和关系识别限制在一个固定的词汇表中,这妨碍了它们在新概念频繁出现的真实世界场景中的应用。相比之下,我们的方法同时预测超出预定义类别的对象(节点)及其相互关系(边)。OvSGTR采用类似于DETR的架构,包括冻结的图像骨干网络和文本编码器来提取高质量的视觉和语义特征,并通过变压器解码器融合这些特征以进行端到端场景图预测。为了丰富模型对复杂视觉关系的理解,我们提出了一种基于关系感知的预训练策略,在弱监督下综合生成场景图注释。具体而言,我们研究了三种管道——基于场景解析器、基于大型语言模型(LLM)和多模态LLM的方法——以利用最少的手动标注生成可转移的监督信号。此外,为了解决开放式词汇设置中常见的灾难性遗忘问题,我们引入了一种结合视觉概念保留机制与知识蒸馏策略的方法,在微调过程中确保模型保持丰富的语义线索。在VG150基准测试上的广泛实验表明,OvSGTR在封闭集、基于开放词汇对象检测的、关系驱动型和完全开放式词汇量等多种设置下均达到了最先进的性能水平。我们的结果强调了大规模关系感知预训练和变压器架构对于推进场景图生成向更通用和可靠视觉理解方向发展的前景。
https://arxiv.org/abs/2505.20106
Modality fusion is a cornerstone of multimodal learning, enabling information integration from diverse data sources. However, vanilla fusion methods are limited by (1) inability to account for heterogeneous interactions between modalities and (2) lack of interpretability in uncovering the multimodal interactions inherent in the data. To this end, we propose I2MoE (Interpretable Multimodal Interaction-aware Mixture of Experts), an end-to-end MoE framework designed to enhance modality fusion by explicitly modeling diverse multimodal interactions, as well as providing interpretation on a local and global level. First, I2MoE utilizes different interaction experts with weakly supervised interaction losses to learn multimodal interactions in a data-driven way. Second, I2MoE deploys a reweighting model that assigns importance scores for the output of each interaction expert, which offers sample-level and dataset-level interpretation. Extensive evaluation of medical and general multimodal datasets shows that I2MoE is flexible enough to be combined with different fusion techniques, consistently improves task performance, and provides interpretation across various real-world scenarios. Code is available at this https URL.
模态融合是多模态学习的核心,它能够整合来自不同数据源的信息。然而,传统的融合方法在(1)无法处理模态之间的异构交互以及(2)缺乏解释性来揭示数据中固有的多模态交互方面存在局限性。为此,我们提出了一种新的框架I2MoE(可解释的多模态交互感知专家混合模型),这是一种端到端的专家混合框架,旨在通过显式建模多种多模态交互以及在局部和全局层面提供解释来增强模态融合。 首先,I2MoE 使用带有弱监督交互损失的不同交互专家来以数据驱动的方式学习多模态交互。其次,I2MoE 部署了一个重新加权模型,该模型为每个交互专家的输出分配重要性分数,这提供了样本级别和数据集级别的解释。 对医疗和通用多模态数据集进行广泛的评估显示,I2MoE 具有足够的灵活性与不同的融合技术结合使用,并且在各种现实场景中持续改进任务性能并提供解释。代码可在以下链接获取:[此 URL](https://this https URL)(请将此处的“此 https URL”替换为实际提供的具体链接)。
https://arxiv.org/abs/2505.19190
Weakly supervised referring expression comprehension(WREC) and segmentation(WRES) aim to learn object grounding based on a given expression using weak supervision signals like image-text pairs. While these tasks have traditionally been modeled separately, we argue that they can benefit from joint learning in a multi-task framework. To this end, we propose WeakMCN, a novel multi-task collaborative network that effectively combines WREC and WRES with a dual-branch architecture. Specifically, the WREC branch is formulated as anchor-based contrastive learning, which also acts as a teacher to supervise the WRES branch. In WeakMCN, we propose two innovative designs to facilitate multi-task collaboration, namely Dynamic Visual Feature Enhancement(DVFE) and Collaborative Consistency Module(CCM). DVFE dynamically combines various pre-trained visual knowledge to meet different task requirements, while CCM promotes cross-task consistency from the perspective of optimization. Extensive experimental results on three popular REC and RES benchmarks, i.e., RefCOCO, RefCOCO+, and RefCOCOg, consistently demonstrate performance gains of WeakMCN over state-of-the-art single-task alternatives, e.g., up to 3.91% and 13.11% on RefCOCO for WREC and WRES tasks, respectively. Furthermore, experiments also validate the strong generalization ability of WeakMCN in both semi-supervised REC and RES settings against existing methods, e.g., +8.94% for semi-REC and +7.71% for semi-RES on 1% RefCOCO. The code is publicly available at this https URL.
弱监督指代表达理解(WREC)和分割(WRES)的目标是基于给定的描述学习目标对象定位,利用诸如图像文本对之类的弱监督信号。虽然这些任务传统上分别建模,但我们认为在多任务框架中进行联合学习可以带来好处。为此,我们提出了WeakMCN,这是一种新颖的多任务协作网络,采用双分支架构有效结合了WREC和WRES。具体来说,WREC支路被设计为基于锚点的对比学习,并且也作为教师来监督WRES支路。在WeakMCN中,我们提出两项创新设计以促进多任务合作:动态视觉特征增强(DVFE)和协作一致性模块(CCM)。DVFE动态结合各种预训练的视觉知识,以满足不同的任务需求;而CCM则从优化的角度促进了跨任务的一致性。 广泛的实验结果在三个流行的REC和RES基准测试数据集上得到验证,即RefCOCO、RefCOCO+ 和 RefCOCOg。这些结果显示了WeakMCN相对于最先进的单任务替代方法的性能提升,例如,在RefCOCO上的WREC任务中提升了3.91%,在WRES任务中提升了13.11%。 此外,实验还验证了WeakMCN在半监督REC和RES设置中的强大泛化能力,这比现有方法更优。例如,在1% RefCOCO的情况下,对于半监督REC提高了8.94%,而对于半监督RES则提高了7.71%。 代码可以在提供的URL上公开获取。
https://arxiv.org/abs/2505.18686
Vision-language models (VLMs) have recently been integrated into multiple instance learning (MIL) frameworks to address the challenge of few-shot, weakly supervised classification of whole slide images (WSIs). A key trend involves leveraging multi-scale information to better represent hierarchical tissue structures. However, existing methods often face two key limitations: (1) insufficient modeling of interactions within the same modalities across scales (e.g., 5x and 20x) and (2) inadequate alignment between visual and textual modalities on the same scale. To address these gaps, we propose HiVE-MIL, a hierarchical vision-language framework that constructs a unified graph consisting of (1) parent-child links between coarse (5x) and fine (20x) visual/textual nodes to capture hierarchical relationships, and (2) heterogeneous intra-scale edges linking visual and textual nodes on the same scale. To further enhance semantic consistency, HiVE-MIL incorporates a two-stage, text-guided dynamic filtering mechanism that removes weakly correlated patch-text pairs, and introduces a hierarchical contrastive loss to align textual semantics across scales. Extensive experiments on TCGA breast, lung, and kidney cancer datasets demonstrate that HiVE-MIL consistently outperforms both traditional MIL and recent VLM-based MIL approaches, achieving gains of up to 4.1% in macro F1 under 16-shot settings. Our results demonstrate the value of jointly modeling hierarchical structure and multimodal alignment for efficient and scalable learning from limited pathology data. The code is available at this https URL
最近,视觉-语言模型(VLMs)被整合到多实例学习(MIL)框架中,以解决全滑动图像(WSI)的少量样本、弱监督分类问题。一个关键的趋势是利用跨尺度的信息来更好地表示组织结构的层次性。然而,现有方法通常面临两个主要限制:(1) 在同一模态内(例如5x和20x)不同尺度之间的交互建模不足;(2) 同一尺度上视觉与文本模态间的对齐度不够。为解决这些问题,我们提出了HiVE-MIL,这是一个层次化视觉-语言框架,它构建了一个统一的图结构,其中包括:(1) 粗略(5x)和精细(20x)视觉/文本节点之间的父-子链接,以捕捉层次关系;以及 (2) 同一尺度上的异构内部边连接同一尺度下的视觉与文本节点。为了进一步增强语义一致性,HiVE-MIL引入了一种两阶段、由文本引导的动态过滤机制,用于移除相关性较弱的图块-文本对,并引入了跨尺度的层次对比损失来调整不同尺度上的文本语义。在TCGA乳腺癌、肺癌和肾癌数据集上进行的大量实验表明,HiVE-MIL始终优于传统的MIL方法以及近期基于VLM的MIL方法,在16个样本设置下实现了高达4.1%的宏F1值提升。我们的结果证明了同时建模层次结构和多模态对齐对于从有限病理学数据中进行高效、可扩展学习的价值。代码可在提供的链接处获取。
https://arxiv.org/abs/2505.17982
In real-world scenarios, pixel-level labeling is not always available. Sometimes, we need a semantic segmentation network, and even a visual encoder can have a high compatibility, and can be trained using various types of feedback beyond traditional labels, such as feedback that indicates the quality of the parsing results. To tackle this issue, we proposed RSS (Reward in Semantic Segmentation), the first practical application of reward-based reinforcement learning on pure semantic segmentation offered in two granular levels (pixel-level and image-level). RSS incorporates various novel technologies, such as progressive scale rewards (PSR) and pair-wise spatial difference (PSD), to ensure that the reward facilitates the convergence of the semantic segmentation network, especially under image-level rewards. Experiments and visualizations on benchmark datasets demonstrate that the proposed RSS can successfully ensure the convergence of the semantic segmentation network on two levels of rewards. Additionally, the RSS, which utilizes an image-level reward, outperforms existing weakly supervised methods that also rely solely on image-level signals during training.
在实际场景中,像素级标注并不总是可用的。有时我们需要一个语义分割网络,并且即使是一个视觉编码器也能具有很高的兼容性,可以通过传统标签之外的各种反馈类型进行训练,例如指示解析结果质量的反馈。为了解决这个问题,我们提出了RSS(语义分割中的奖励),这是第一个在纯语义分割任务中实现基于奖励的强化学习的实际应用,提供了两个粒度层次(像素级和图像级)的解决方案。RSS整合了多种新技术,如逐步尺度奖励(PSR)和成对空间差异(PSD),以确保奖励能够促进语义分割网络的收敛,特别是在使用图像级别奖励时。 在基准数据集上的实验和可视化结果表明,所提出的RSS能够在两个层次的奖励下成功地保证语义分割网络的收敛性。此外,采用图像级奖励的RSS优于现有仅依赖于训练过程中图像信号进行弱监督的方法。
https://arxiv.org/abs/2505.17905
Bounding box supervision has gained considerable attention in weakly supervised 3D instance segmentation. While this approach alleviates the need for extensive point-level annotations, obtaining accurate bounding boxes in practical applications remains challenging. To this end, we explore the inaccurate bounding box, named sketchy bounding box, which is imitated through perturbing ground truth bounding box by adding scaling, translation, and rotation. In this paper, we propose Sketchy-3DIS, a novel weakly 3D instance segmentation framework, which jointly learns pseudo labeler and segmentator to improve the performance under the sketchy bounding-box supervisions. Specifically, we first propose an adaptive box-to-point pseudo labeler that adaptively learns to assign points located in the overlapped parts between two sketchy bounding boxes to the correct instance, resulting in compact and pure pseudo instance labels. Then, we present a coarse-to-fine instance segmentator that first predicts coarse instances from the entire point cloud and then learns fine instances based on the region of coarse instances. Finally, by using the pseudo instance labels to supervise the instance segmentator, we can gradually generate high-quality instances through joint training. Extensive experiments show that our method achieves state-of-the-art performance on both the ScanNetV2 and S3DIS benchmarks, and even outperforms several fully supervised methods using sketchy bounding boxes. Code is available at this https URL.
边界框监督在弱监督下的三维实例分割中受到了广泛的关注。尽管这种方法减轻了对大量点级别标注的需求,但在实际应用中获取准确的边界框仍然具有挑战性。为此,我们探索了一种不精确的边界框,称为草图边界框,它是通过对地面真实边界框添加缩放、平移和旋转扰动来模仿得到的。 在本文中,我们提出了一种新颖的弱监督三维实例分割框架——Sketchy-3DIS,该框架通过联合学习伪标签生成器(pseudo labeler)和分割器(segmentator),以改善草图边界框指导下的性能。具体而言,首先我们提出了一个自适应的框到点的伪标签生成器,它能够自适应地将位于两个不精确边界框重叠部分中的点分配给正确的实例,从而产生紧凑且纯净的伪实例标签。然后,我们提出了一种由粗到精的实例分割器,该分割器首先从整个点云中预测粗略实例,然后再基于这些粗略实例区域学习精细实例。最后,通过使用伪实例标签监督实例分割器,并进行联合训练,可以逐步生成高质量的实例。 大量的实验表明,我们的方法在ScanNetV2和S3DIS基准测试上取得了最先进的性能,甚至超过了几个采用草图边界框的完全监督的方法。代码可在提供的链接中获取。
https://arxiv.org/abs/2505.16399
In recent years, weakly supervised object detection (WSOD) has attracted much attention due to its low labeling cost. The success of recent WSOD models is often ascribed to the two-stage multi-class classification (MCC) task, i.e., multiple instance learning and online classification refinement. Despite achieving non-trivial progresses, these methods overlook potential classification ambiguities between these two MCC tasks and fail to leverage their unique strengths. In this work, we introduce a novel WSOD framework to ameliorate these two issues. For one thing, we propose a self-classification enhancement module that integrates intra-class binary classification (ICBC) to bridge the gap between the two distinct MCC tasks. The ICBC task enhances the network's discrimination between positive and mis-located samples in a class-wise manner and forges a mutually reinforcing relationship with the MCC task. For another, we propose a self-classification correction algorithm during inference, which combines the results of both MCC tasks to effectively reduce the mis-classified predictions. Extensive experiments on the prevalent VOC 2007 & 2012 datasets demonstrate the superior performance of our framework.
近年来,由于弱监督目标检测(WSOD)的标注成本低而备受关注。最近成功的WSOD模型通常归功于两阶段多类分类(MCC)任务的成功,即多实例学习和在线分类细化。尽管这些方法已经取得了显著的进步,但它们忽视了这两个MCC任务之间的潜在分类模糊性,并未能充分利用各自的独特优势。在这项工作中,我们引入了一个新颖的WSOD框架来改善这些问题。 一方面,我们提出了一种自我分类增强模块,该模块通过在类内进行二元分类(ICBC)来弥合这两项不同MCC任务之间的差距。ICBC任务以类别为单位增强了网络对正样本和错误定位样本之间区别的能力,并与MCC任务形成了相互强化的关系。 另一方面,在推理过程中,我们提出了一个自我分类校正算法,该算法结合了两个MCC任务的结果,有效地减少了误分类的预测结果。 在流行的VOC 2007及2012数据集上的广泛实验展示了我们的框架具有优越的性能。
https://arxiv.org/abs/2505.16294