This paper presents IMAGGarment-1, a fine-grained garment generation (FGG) framework that enables high-fidelity garment synthesis with precise control over silhouette, color, and logo placement. Unlike existing methods that are limited to single-condition inputs, IMAGGarment-1 addresses the challenges of multi-conditional controllability in personalized fashion design and digital apparel applications. Specifically, IMAGGarment-1 employs a two-stage training strategy to separately model global appearance and local details, while enabling unified and controllable generation through end-to-end inference. In the first stage, we propose a global appearance model that jointly encodes silhouette and color using a mixed attention module and a color adapter. In the second stage, we present a local enhancement model with an adaptive appearance-aware module to inject user-defined logos and spatial constraints, enabling accurate placement and visual consistency. To support this task, we release GarmentBench, a large-scale dataset comprising over 180K garment samples paired with multi-level design conditions, including sketches, color references, logo placements, and textual prompts. Extensive experiments demonstrate that our method outperforms existing baselines, achieving superior structural stability, color fidelity, and local controllability performance. The code and model are available at this https URL.
本文介绍了IMAGGarment-1,这是一个精细的服装生成(FGG)框架,能够实现高保真度的服装合成,并且可以精确控制轮廓、颜色和标志放置。与现有的仅限于单一条件输入的方法不同,IMAGGarment-1解决了个性化时尚设计和个人数字服饰应用中多条件可控性的挑战。具体来说,IMAGGarment-1采用两阶段训练策略来分别建模全局外观和局部细节,并通过端到端推理实现统一且可控制的生成。 在第一阶段,我们提出了一种全球外观模型,该模型使用混合注意力模块和颜色适配器共同编码轮廓和颜色。第二阶段,我们介绍了一个带有自适应外观感知模块的局部增强模型,用于注入用户定义的标志和空间约束,从而实现准确的位置放置和视觉一致性。 为了支持这一任务,我们发布了GarmentBench,这是一个大规模的数据集,包含超过18万件服装样本及多级设计条件,包括草图、颜色参考、标志位置以及文本提示。广泛的实验表明,我们的方法优于现有基准模型,在结构稳定性、颜色保真度和局部可控性性能方面表现出色。 该代码和模型可在此网址获取:[此URL链接](请根据实际情况提供实际的链接地址)。
https://arxiv.org/abs/2504.13176
Reward models (RMs) are essential for aligning Large Language Models (LLMs) with human preferences. However, they often struggle with capturing complex human preferences and generalizing to unseen data. To address these challenges, we introduce Energy-Based Reward Model (EBRM), a lightweight post-hoc refinement framework that enhances RM robustness and generalization. EBRM models the reward distribution explicitly, capturing uncertainty in human preferences and mitigating the impact of noisy or misaligned annotations. It achieves this through conflict-aware data filtering, label-noise-aware contrastive training, and hybrid initialization. Notably, EBRM enhances RMs without retraining, making it computationally efficient and adaptable across different models and tasks. Empirical evaluations on RM benchmarks demonstrate significant improvements in both robustness and generalization, achieving up to a 5.97% improvement in safety-critical alignment tasks compared to standard RMs. Furthermore, reinforcement learning experiments confirm that our refined rewards enhance alignment quality, effectively delaying reward hacking. These results demonstrate our approach as a scalable and effective enhancement for existing RMs and alignment pipelines. The code is available at EBRM.
奖励模型(RMs)对于将大型语言模型(LLMs)与人类偏好对齐至关重要。然而,它们通常难以捕捉复杂的人类偏好,并且在未见过的数据上推广能力有限。为了解决这些问题,我们引入了基于能量的奖励模型(EBRM),这是一种轻量级的事后优化框架,旨在增强RMs的鲁棒性和泛化能力。 EBRM通过显式地建模奖励分布来工作,能够捕捉人类偏好的不确定性,并减少噪声或对齐不佳注释的影响。该方法利用冲突感知数据过滤、考虑标签噪声的对比训练以及混合初始化等技术实现这些目标。值得注意的是,EBRM在不重新训练的情况下增强了RMs,使其计算效率高且适应性强,适用于不同模型和任务。 实验证明,在奖励模型基准测试中,EBRM在鲁棒性和泛化方面取得了显著改善,特别是在安全关键对齐任务上比标准RMs提高了多达5.97%。此外,强化学习实验也证实了我们改进后的奖励能提升对齐质量,并有效延缓奖励欺骗的发生。 这些结果表明,我们的方法是一种可扩展且有效的现有RMs和对齐流水线的增强方式。EBRM的代码可在其官方网站上获取。
https://arxiv.org/abs/2504.13134
This paper presents a review for the NTIRE 2025 Challenge on Short-form UGC Video Quality Assessment and Enhancement. The challenge comprises two tracks: (i) Efficient Video Quality Assessment (KVQ), and (ii) Diffusion-based Image Super-Resolution (KwaiSR). Track 1 aims to advance the development of lightweight and efficient video quality assessment (VQA) models, with an emphasis on eliminating reliance on model ensembles, redundant weights, and other computationally expensive components in the previous IQA/VQA competitions. Track 2 introduces a new short-form UGC dataset tailored for single image super-resolution, i.e., the KwaiSR dataset. It consists of 1,800 synthetically generated S-UGC image pairs and 1,900 real-world S-UGC images, which are split into training, validation, and test sets using a ratio of 8:1:1. The primary objective of the challenge is to drive research that benefits the user experience of short-form UGC platforms such as Kwai and TikTok. This challenge attracted 266 participants and received 18 valid final submissions with corresponding fact sheets, significantly contributing to the progress of short-form UGC VQA and image superresolution. The project is publicly available at this https URL ChallengeCVPR-NTIRE2025.
本文介绍了针对2025年NTIRE挑战赛的短形式用户生成内容(UGC)视频质量评估和增强的回顾。该挑战赛包含两个赛道:(i) 高效视频质量评估(KVQ),以及(ii) 基于扩散方法的图像超分辨率(KwaiSR)。 赛道1旨在推动轻量级且高效的视频质量评估(VQA)模型的发展,重点在于消除对模型集成、冗余权重及其他计算成本较高的组件的依赖,在之前的IQA/VQA竞赛中这些问题普遍存在。赛道2引入了一个专为单张图像超分辨率设计的新短形式UGC数据集,即KwaiSR数据集。该数据集包括1,800对合成生成的S-UGC图像和1,900张真实世界的S-UGC图像,并按照8:1:1的比例分配到训练、验证和测试集合中。 挑战赛的主要目标是推动研究工作,提升像Kwai和TikTok这样的短形式UGC平台上的用户体验。该挑战吸引了266名参与者并收到了18份有效的最终提交作品及其对应的事实表,为短形式UGC视频质量评估和图像超分辨率领域的发展做出了重大贡献。 该项目在以下网址公开发布:[ChallengeCVPR-NTIRE2025](https://challengecvpr-ntire2025.org/)
https://arxiv.org/abs/2504.13131
We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.
我们提出了一种将科学知识融入生成模型的新方法,从而增强图像合成的现实感和一致性。首先,我们介绍了Science-T2I,这是一个由专家注释的对抗性数据集,包含2万个对抗性的图像对以及9千个提示词,涵盖了广泛的不同的科学知识类别。利用Science-T2I,我们提出了SciScore,一个端到端的奖励模型,该模型基于科学知识来优化生成图像的评估,并通过增强预训练CLIP模型的科学理解和视觉能力来实现这一点。此外,基于SciScore,我们提出了一种两阶段训练框架,包括监督微调阶段和掩码在线微调阶段,以将科学知识融入现有的生成模型中。通过全面实验,我们展示了我们的框架在评估生成内容的科学现实性方面建立了新的标准。具体而言,SciScore达到了接近人类水平的表现,并且与经验丰富的评审人员进行的人类评价相比,表现提高了5%左右。此外,通过应用我们提出的微调方法到FLUX模型上,我们在SciScore上的性能提升了超过50%。
https://arxiv.org/abs/2504.13129
In recent years, the field of vision-language model pre-training has experienced rapid advancements, driven primarily by the continuous enhancement of textual capabilities in large language models. However, existing training paradigms for multimodal large language models heavily rely on high-quality image-text pairs. As models and data scales grow exponentially, the availability of such meticulously curated data has become increasingly scarce and saturated, thereby severely limiting further advancements in this domain. This study investigates scalable caption generation techniques for vision-language model pre-training and demonstrates that large-scale low-hallucination synthetic captions can serve dual purposes: 1) acting as a viable alternative to real-world data for pre-training paradigms and 2) achieving superior performance enhancement when integrated into vision-language models through empirical validation. This paper presents three key contributions: 1) a novel pipeline for generating high-quality, low-hallucination, and knowledge-rich synthetic captions. Our continuous DPO methodology yields remarkable results in reducing hallucinations. Specifically, the non-hallucination caption rate on a held-out test set increases from 48.2% to 77.9% for a 7B-size model. 2) Comprehensive empirical validation reveals that our synthetic captions confer superior pre-training advantages over their counterparts. Across 35 vision language tasks, the model trained with our data achieves a significant performance gain of at least 6.2% compared to alt-text pairs and other previous work. Meanwhile, it also offers considerable support in the text-to-image domain. With our dataset, the FID score is reduced by 17.1 on a real-world validation benchmark and 13.3 on the MSCOCO validation benchmark. 3) We will release Hunyuan-Recap100M, a low-hallucination and knowledge-intensive synthetic caption dataset.
近年来,视觉-语言模型的预训练领域取得了迅速进展,这一进步主要得益于大型语言模型文本能力的持续提升。然而,现有的多模态大规模语言模型培训范式严重依赖高质量的图文对数据集。随着模型和数据规模呈指数级增长,这种精心策划的数据变得越来越稀缺且饱和,从而极大地限制了该领域的进一步发展。本研究探讨了视觉-语言模型预训练的大规模描述生成技术,并证明大规模低幻觉合成描述可以实现双重用途:1)作为现实世界数据在预训练范式中的可行替代方案;2)当集成到视觉-语言模型中时,通过实证验证可显著提升性能。本文贡献了以下三点: 1. 一种创新的流水线方法用于生成高质量、低幻觉且富含知识的合成描述。我们的持续DPO(Deliberative Posterior Optimization)方法在减少幻觉方面取得了显著成果。具体来说,在保留测试集上,对于70亿参数规模模型,非幻觉性描述的比例从48.2%提升到了77.9%。 2. 全面的实证验证显示,我们生成的合成描述相对于现有数据具有更优越的预训练优势。在35个视觉-语言任务中,使用我们的数据集进行训练的模型相比于alt-text对和其他先前工作,在所有任务上至少实现了6.2%的性能提升。此外,它也在文本到图像领域提供了显著的支持。借助我们提供的数据集,在真实世界的验证基准上FID(Frechet Inception Distance)得分减少了17.1,在MSCOCO验证基准上的得分则降低了13.3。 3. 我们将发布Hunyuan-Recap100M,这是一个低幻觉且富含知识的合成描述数据集。
https://arxiv.org/abs/2504.13123
Recent advances in video generation have been driven by diffusion models and autoregressive frameworks, yet critical challenges persist in harmonizing prompt adherence, visual quality, motion dynamics, and duration: compromises in motion dynamics to enhance temporal visual quality, constrained video duration (5-10 seconds) to prioritize resolution, and inadequate shot-aware generation stemming from general-purpose MLLMs' inability to interpret cinematic grammar, such as shot composition, actor expressions, and camera motions. These intertwined limitations hinder realistic long-form synthesis and professional film-style generation. To address these limitations, we propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework. Firstly, we design a comprehensive structural representation of video that combines the general descriptions by the Multi-modal LLM and the detailed shot language by sub-expert models. Aided with human annotation, we then train a unified Video Captioner, named SkyCaptioner-V1, to efficiently label the video data. Secondly, we establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement: Initial concept-balanced Supervised Fine-Tuning (SFT) improves baseline quality; Motion-specific Reinforcement Learning (RL) training with human-annotated and synthetic distortion data addresses dynamic artifacts; Our diffusion forcing framework with non-decreasing noise schedules enables long-video synthesis in an efficient search space; Final high-quality SFT refines visual fidelity. All the code and models are available at this https URL.
近期在视频生成领域的进展主要得益于扩散模型和自回归框架,但仍然存在一些关键挑战:如何在遵循提示、视觉质量、运动动态以及时长之间取得平衡。为了提高时间上的视觉质量而妥协了运动动态,受限的视频长度(5-10秒)优先考虑分辨率,以及缺乏基于镜头意识生成的能力,因为通用多模态大规模语言模型(MLLMs)无法理解电影语法,如镜头构图、演员表情和摄像机移动。这些相互交织的局限性阻碍了现实主义长篇合成和专业电影风格生成的发展。 为了克服这些限制,我们提出了SkyReels-V2,这是一种无限长度电影生成模型,该模型结合了多模态大规模语言模型(MLLM)、多阶段预训练、强化学习和扩散强迫框架。首先,我们设计了一种综合的视频结构表示方法,该方法将多模态LLM提供的通用描述与子专家模型的详细镜头语言相结合。借助人工注释,我们随后训练了一个统一的视频说明器SkyCaptioner-V1,用于有效地标记视频数据。 其次,我们为基本的视频生成建立了逐步分辨率预训练,并通过四个阶段的后训练增强:初始概念平衡监督微调(SFT)改善了基准质量;带有手工标注和合成失真数据的运动特定强化学习(RL)训练解决了动态伪影问题;我们的扩散强迫框架结合非递减噪声时间表,能够在一个高效的搜索空间中进行长视频合成;最后高质量的SFT增强了视觉保真度。 所有代码和模型都可在以下链接获取:[此URL](https://this https URL "提供正确的GitHub或项目网站链接")。请注意,在引用具体网址时,请替换“此 https URL”为实际提供的项目地址。
https://arxiv.org/abs/2504.13074
Remote Sensing Image Super-Resolution (RSISR) reconstructs high-resolution (HR) remote sensing images from low-resolution inputs to support fine-grained ground object interpretation. Existing methods face three key challenges: (1) Difficulty in extracting multi-scale features from spatially heterogeneous RS scenes, (2) Limited prior information causing semantic inconsistency in reconstructions, and (3) Trade-off imbalance between geometric accuracy and visual quality. To address these issues, we propose the Texture Transfer Residual Denoising Dual Diffusion Model (TTRD3) with three innovations: First, a Multi-scale Feature Aggregation Block (MFAB) employing parallel heterogeneous convolutional kernels for multi-scale feature extraction. Second, a Sparse Texture Transfer Guidance (STTG) module that transfers HR texture priors from reference images of similar scenes. Third, a Residual Denoising Dual Diffusion Model (RDDM) framework combining residual diffusion for deterministic reconstruction and noise diffusion for diverse generation. Experiments on multi-source RS datasets demonstrate TTRD3's superiority over state-of-the-art methods, achieving 1.43% LPIPS improvement and 3.67% FID enhancement compared to best-performing baselines. Code/model: this https URL.
遥感图像超分辨率(RSISR)技术旨在从低分辨率输入中重建高分辨率的遥感影像,以支持对地面物体进行精细解释。现有方法面临三大挑战:(1) 难以提取空间异质性遥感场景中的多尺度特征;(2) 有限的先验信息导致重构图像语义不一致;(3) 几何精度与视觉质量之间的权衡问题。为解决这些问题,我们提出了纹理转移残差去噪双扩散模型(TTRD3),其具有三大创新:首先,使用平行异构卷积核进行多尺度特征提取的多尺度特征聚合模块(MFAB)。其次,一个稀疏纹理转移引导(STTG)模块,该模块从类似场景的参考图像中转移高分辨率纹理先验。第三,残差扩散和噪声扩散相结合以实现确定性重构与多样化生成的残差去噪双扩散模型框架(RDDM)。在多源遥感数据集上的实验表明,TTRD3优于现有先进方法,在LPIPS指标上较最佳基线提升1.43%,FID指标上提升3.67%。代码/模型:[此链接](https://this https URL)。 请注意,最后一行中的“this https URL”应被实际的访问链接替换。
https://arxiv.org/abs/2504.13026
Chinese-Vicuna is an open-source, resource-efficient language model designed to bridge the gap in Chinese instruction-following capabilities by fine-tuning Meta's LLaMA architecture using Low-Rank Adaptation (LoRA). Targeting low-resource environments, it enables cost-effective deployment on consumer GPUs (e.g., RTX-2080Ti for 7B models) and supports domain-specific adaptation in fields like healthcare and law. By integrating hybrid datasets (BELLE and Guanaco) and 4-bit quantization (QLoRA), the model achieves competitive performance in tasks such as translation, code generation, and domain-specific Q\&A. The project provides a comprehensive toolkit for model conversion, CPU inference, and multi-turn dialogue interfaces, emphasizing accessibility for researchers and developers. Evaluations indicate competitive performance across medical tasks, multi-turn dialogue coherence, and real-time legal updates. Chinese-Vicuna's modular design, open-source ecosystem, and community-driven enhancements position it as a versatile foundation for Chinese LLM applications.
Chinese-Vicuna 是一个开源、资源高效的语言模型,旨在通过使用低秩适应(LoRA)技术对 Meta 的 LLaMA 架构进行微调,来弥补中文指令跟随能力的不足。它针对计算资源有限的环境而设计,可以在消费级 GPU(例如 RTX-2080Ti 上运行 7B 模型)上以低成本部署,并支持医疗和法律等领域的特定领域适应性。 通过整合混合数据集(如 BELLE 和 Guanaco)以及采用4位量化(QLoRA),该模型在诸如翻译、代码生成及特定领域的问答任务中表现出竞争力的性能。该项目提供了一整套工具包,涵盖模型转换、CPU 推断和多轮对话接口等功能,旨在为研究人员和开发人员提供高度可访问性。 评估结果表明,Chinese-Vicuna 在医疗任务、多轮对话连贯性和实时法律更新等方面都达到了竞争性的表现水平。凭借模块化设计、开源生态系统及社区驱动的增强功能,Chinese-Vicuna 作为中文大型语言模型应用的基础平台而具备极高的灵活性和适用性。
https://arxiv.org/abs/2504.12737
Deep neural networks (DNNs) have recently become the leading method for low-light image enhancement (LLIE). However, despite significant progress, their outputs may still exhibit issues such as amplified noise, incorrect white balance, or unnatural enhancements when deployed in real world applications. A key challenge is the lack of diverse, large scale training data that captures the complexities of low-light conditions and imaging pipelines. In this paper, we propose a novel image signal processing (ISP) driven data synthesis pipeline that addresses these challenges by generating unlimited paired training data. Specifically, our pipeline begins with easily collected high-quality normal-light images, which are first unprocessed into the RAW format using a reverse ISP. We then synthesize low-light degradations directly in the RAW domain. The resulting data is subsequently processed through a series of ISP stages, including white balance adjustment, color space conversion, tone mapping, and gamma correction, with controlled variations introduced at each stage. This broadens the degradation space and enhances the diversity of the training data, enabling the generated data to capture a wide range of degradations and the complexities inherent in the ISP pipeline. To demonstrate the effectiveness of our synthetic pipeline, we conduct extensive experiments using a vanilla UNet model consisting solely of convolutional layers, group normalization, GeLU activation, and convolutional block attention modules (CBAMs). Extensive testing across multiple datasets reveals that the vanilla UNet model trained with our data synthesis pipeline delivers high fidelity, visually appealing enhancement results, surpassing state-of-the-art (SOTA) methods both quantitatively and qualitatively.
最近,深度神经网络(DNN)已成为低光图像增强(LLIE)的领先方法。然而,尽管取得了显著进展,它们在实际应用中的输出仍然可能出现诸如放大噪声、错误白平衡或不自然增强等问题。关键挑战之一是缺乏能够捕捉低光条件和成像流程复杂性的多样化大规模训练数据。 为此,本文提出了一种新颖的基于图像信号处理(ISP)的数据合成管道,通过生成无限制配对训练数据来解决这些问题。具体来说,我们的管道从易于收集的高质量正常光照图像开始,并使用反向ISP首先将其未加工为RAW格式。然后,在RAW域直接合成低光退化情况。随后,生成的数据会经过一系列ISP阶段处理,包括白平衡调整、颜色空间转换、色调映射和伽马校正等,同时在每个阶段引入受控变化。这拓宽了降级范围,并增强了训练数据的多样性,使得生成的数据能够捕捉到广泛的降级情况以及ISP流程中的固有复杂性。 为了证明我们合成管道的有效性,我们在一个简单的UNet模型上进行了大量实验,该模型仅由卷积层、组归一化、GeLU激活和卷积块注意模块(CBAMs)组成。在多个数据集上的广泛测试表明,使用我们的数据合成管道训练的简单UNet模型能够提供高保真度且视觉效果良好的增强结果,在量化和定性评估中均超越了最先进的方法(SOTA)。
https://arxiv.org/abs/2504.12204
Low-light conditions pose significant challenges for both human and machine annotation. This in turn has led to a lack of research into machine understanding for low-light images and (in particular) videos. A common approach is to apply annotations obtained from high quality datasets to synthetically created low light versions. In addition, these approaches are often limited through the use of unrealistic noise models. In this paper, we propose a new Degradation Estimation Network (DEN), which synthetically generates realistic standard RGB (sRGB) noise without the requirement for camera metadata. This is achieved by estimating the parameters of physics-informed noise distributions, trained in a self-supervised manner. This zero-shot approach allows our method to generate synthetic noisy content with a diverse range of realistic noise characteristics, unlike other methods which focus on recreating the noise characteristics of the training data. We evaluate our proposed synthetic pipeline using various methods trained on its synthetic data for typical low-light tasks including synthetic noise replication, video enhancement, and object detection, showing improvements of up to 24\% KLD, 21\% LPIPS, and 62\% AP$_{50-95}$, respectively.
低光照条件对人类和机器的标注工作提出了重大挑战,这也导致了针对低光图像及(尤其是)视频的机器理解研究不足。一种常见的方法是将从高质量数据集中获取的注释应用到人工生成的低光版本上。此外,这些方法往往由于使用不现实的噪声模型而受到限制。在本文中,我们提出了一种新的退化估计网络(Degradation Estimation Network, DEN),该网络能够合成没有相机元数据需求的真实标准RGB(sRGB)噪声。通过训练以自监督方式进行,DEN可以估算物理信息噪音分布的参数。这种零样本方法使我们的方法能够在合成的噪点内容中生成一系列现实的噪声特征,这与仅专注于复制训练数据噪声特性的其他方法有所不同。我们使用各种在合成数据上进行训练的方法(包括合成噪声复现、视频增强和目标检测)评估了我们提出的合成流程,在这些任务上分别取得了高达24% KLD、21% LPIPS以及62% AP$_{50-95}$的改进。
https://arxiv.org/abs/2504.12169
We present a novel framework that bridges the gap between the interpretability of decision trees and the advanced reasoning capabilities of large language models (LLMs) to predict startup success. Our approach leverages chain-of-thought prompting to generate detailed reasoning logs, which are subsequently distilled into structured, human-understandable logical rules. The pipeline integrates multiple enhancements - efficient data ingestion, a two-step refinement process, ensemble candidate sampling, simulated reinforcement learning scoring, and persistent memory - to ensure both stable decision-making and transparent output. Experimental evaluations on curated startup datasets demonstrate that our combined pipeline improves precision by 54% from 0.225 to 0.346 and accuracy by 50% from 0.46 to 0.70 compared to a standalone OpenAI o3 model. Notably, our model achieves over 2x the precision of a random classifier (16%). By combining state-of-the-art AI reasoning with explicit rule-based explanations, our method not only augments traditional decision-making processes but also facilitates expert intervention and continuous policy refinement. This work lays the foundation for the implementation of interpretable LLM-powered decision frameworks in high-stakes investment environments and other domains that require transparent and data-driven insights.
我们提出了一种新颖的框架,该框架弥合了决策树可解释性和大型语言模型(LLM)高级推理能力之间的差距,以预测创业公司的成功。我们的方法利用链式思维提示生成详细的推理日志,并将这些日志提炼成结构化且易于人类理解的逻辑规则。这一流程集成了多种改进措施——包括高效的数据摄入、两步细化过程、集成候选采样、模拟强化学习评分以及持久性记忆——以确保决策过程既稳定又透明。 在经过精心挑选的创业公司数据集上的实验评估表明,我们的综合流程相较于独立的OpenAI o3模型,在精度方面提高了54%,从0.225提升至0.346;准确率提升了50%,从0.46提高到0.70。特别值得注意的是,我们所开发的模型在精确度上超过了随机分类器两倍以上(16%)。通过结合最先进的AI推理能力和明确的基于规则的解释方法,我们的方法不仅增强了传统的决策过程,还促进了专家干预和持续政策改进。 这项工作为在高风险投资环境以及其他需要透明且数据驱动洞察力领域的实施可解释LLM驱动型决策框架奠定了基础。
https://arxiv.org/abs/2504.12090
We present a Bayesian dynamic borrowing (BDB) approach to enhance the quantitative identification of adverse events (AEs) in spontaneous reporting systems (SRSs). The method embeds a robust meta-analytic predictive (MAP) prior within a Bayesian hierarchical model and incorporates semantic similarity measures (SSMs) to enable weighted information sharing from MedDRA Preferred Terms (PTs) that are clinical similar to the target PT. This continuous similarity-based borrowing addresses limitation of rigid hierarchical grouping in current disproportionality analysis (DPA). Using data from the FDA Adverse Event Reporting System (FAERS) between 2015 and 2019, we evalute this approach - termed IC SSM - against standard Information Component (IC) analysis and IC with borrowing at the MedDRA high-level group term (HLGT) level. A novel references set (PVLens), derived from FDA product label updates, enabled prospective evaluation of method performance in identifying AEs prior to official labeling. The IC SSM approach demonstrated improved sensitivity compared to both traditional IC and HLGT-based borrowing, with minor trade-offs in F1 scores and Youden's index. IC SSM consistently identified more true positives and detected signals over 5 months sooner than traditional IC. Despite a marginally lower aggregate Youden's index, IC SSM showed higher performance in the early post-marketing period, providing more stable and relevant estimates than HLGT-based borrowing and traditional IC. These findings support the use of SSM-informed Bayesian borrowing as a scalable and context-aware enhancement to traditional DPA methods. Future research should validate this approach across other datasets and explore additional similarity metrics and Bayesian inference strategies using case-level data.
我们提出了一种贝叶斯动态借用(BDB)方法,以增强自发报告系统(SRSs)中不良事件(AEs)的定量识别。该方法在一个贝叶斯分层模型内嵌入了稳健的元分析预测(MAP)先验,并结合语义相似度量(SSMs),使从临床上与目标首选术语(PT)相似的MedDRA优选术语中借重信息成为可能。这种基于连续相似性的借用解决了当前不对称性分析(DPA)中严格的分层分组的限制。利用2015年至2019年间来自美国食品药品监督管理局不良事件报告系统(FAERS)的数据,我们评估了这种方法——称为IC SSM——相对于标准信息成分(IC)分析和MedDRA高级别组术语(HLGT)级别的借重方法的表现。 一个新颖的参考集(PVLens),来源于FDA产品标签更新,使得能够前瞻性地评估在正式标记之前识别AEs的方法性能。与传统的IC和HLGT级借用相比,IC SSM方法表现出更高的灵敏度,尽管F1分数和Youden指数略有下降。与传统IC相比,IC SSM始终能更早检测到更多真正的阳性结果,并且在五个月前就能发现信号。 虽然总体的Youden指数稍低,但IC SSM显示了在上市后早期阶段的表现更好,提供了比HLGT级借用和传统的IC更为稳定和相关的估计。这些发现在支持使用SSM告知的贝叶斯借重作为一种可扩展且情境感知的传统DPA方法增强手段方面具有重要意义。未来的研究应验证该方法在其他数据集上的有效性,并探索利用案例级别数据的额外相似度量和贝叶斯推理策略。
https://arxiv.org/abs/2504.12052
The YOLO (You Only Look Once) series has been a leading framework in real-time object detection, consistently improving the balance between speed and accuracy. However, integrating attention mechanisms into YOLO has been challenging due to their high computational overhead. YOLOv12 introduces a novel approach that successfully incorporates attention-based enhancements while preserving real-time performance. This paper provides a comprehensive review of YOLOv12's architectural innovations, including Area Attention for computationally efficient self-attention, Residual Efficient Layer Aggregation Networks for improved feature aggregation, and FlashAttention for optimized memory access. Additionally, we benchmark YOLOv12 against prior YOLO versions and competing object detectors, analyzing its improvements in accuracy, inference speed, and computational efficiency. Through this analysis, we demonstrate how YOLOv12 advances real-time object detection by refining the latency-accuracy trade-off and optimizing computational resources.
YOLO(You Only Look Once)系列一直是实时目标检测领域的领军框架,持续优化速度与准确性的平衡。然而,由于注意力机制的高计算开销,将其集成到YOLO中一直颇具挑战性。YOLOv12则引入了一种新颖的方法,在保持实时性能的同时成功地集成了基于注意机制的改进。本文全面回顾了YOLOv12的架构创新,包括用于高效自注意力的区域注意(Area Attention)、用于改进特征聚合的残差高效层聚合网络(Residual Efficient Layer Aggregation Networks)以及优化内存访问的FlashAttention。此外,我们还对YOLOv12与先前版本的YOLO以及其他竞争性目标检测器进行了基准测试,分析了其在准确性、推理速度和计算效率方面的改进。通过这些分析,我们展示了如何通过细化延迟-准确性的权衡并优化计算资源,使YOLOv12推动了实时目标检测技术的发展。
https://arxiv.org/abs/2504.11995
Mesh reconstruction from multi-view images is a fundamental problem in computer vision, but its performance degrades significantly under sparse-view conditions, especially in unseen regions where no ground-truth observations are available. While recent advances in diffusion models have demonstrated strong capabilities in synthesizing novel views from limited inputs, their outputs often suffer from visual artifacts and lack 3D consistency, posing challenges for reliable mesh optimization. In this paper, we propose a novel framework that leverages diffusion models to enhance sparse-view mesh reconstruction in a principled and reliable manner. To address the instability of diffusion outputs, we propose a Consensus Diffusion Module that filters unreliable generations via interquartile range (IQR) analysis and performs variance-aware image fusion to produce robust pseudo-supervision. Building on this, we design an online reinforcement learning strategy based on the Upper Confidence Bound (UCB) to adaptively select the most informative viewpoints for enhancement, guided by diffusion loss. Finally, the fused images are used to jointly supervise a NeRF-based model alongside sparse-view ground truth, ensuring consistency across both geometry and appearance. Extensive experiments demonstrate that our method achieves significant improvements in both geometric quality and rendering quality.
从多视角图像重建网格是一项计算机视觉中的基本问题,但在稀疏视图条件下(尤其是在没有真实观测值的未见区域),其性能会显著下降。虽然近期在扩散模型上的进展展示了通过有限输入合成新颖视图的强大能力,但它们的输出通常存在视觉伪影,并且缺乏三维一致性,这为可靠的网格优化带来了挑战。在这篇论文中,我们提出了一种新的框架,该框架利用扩散模型以原则性和可靠的方式增强稀疏视角下的网格重建。 为了应对扩散生成输出的不稳定性,我们提出了一个共识扩散模块(Consensus Diffusion Module),通过四分位数范围(IQR)分析过滤不可靠的生成,并执行基于方差感知的图像融合,从而产生稳健的伪监督。在此基础上,我们设计了一种基于上置信边界(UCB)的在线强化学习策略,根据扩散损失自适应地选择最具有信息量的视点进行增强。 最终,融合后的图像被用于同时监督一个NeRF(神经辐射场)模型以及稀疏视图的真实地面实况数据,确保几何形状和外观的一致性。广泛的实验表明,我们的方法在几何质量和渲染质量方面都取得了显著改进。
https://arxiv.org/abs/2504.11946
Image decomposition offers deep insights into the imaging factors of visual data and significantly enhances various advanced computer vision tasks. In this work, we introduce a novel approach to low-light image enhancement based on decomposed physics-informed priors. Existing methods that directly map low-light to normal-light images in the sRGB color space suffer from inconsistent color predictions and high sensitivity to spectral power distribution (SPD) variations, resulting in unstable performance under diverse lighting conditions. To address these challenges, we introduce a Physics-informed Color-aware Transform (PiCat), a learning-based framework that converts low-light images from the sRGB color space into deep illumination-invariant descriptors via our proposed Color-aware Transform (CAT). This transformation enables robust handling of complex lighting and SPD variations. Complementing this, we propose the Content-Noise Decomposition Network (CNDN), which refines the descriptor distributions to better align with well-lit conditions by mitigating noise and other distortions, thereby effectively restoring content representations to low-light images. The CAT and the CNDN collectively act as a physical prior, guiding the transformation process from low-light to normal-light domains. Our proposed PiCat framework demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets.
图像分解为视觉数据的成像因素提供了深刻见解,并显著增强了各种先进的计算机视觉任务。在这项工作中,我们介绍了一种基于分解物理信息先验的新方法来改进低光条件下的图像增强技术。现有直接将低光图像映射到正常光照条件下图像的方法在sRGB色彩空间中存在颜色预测不一致以及对光谱功率分布(SPD)变化高度敏感的问题,导致其在不同照明条件下表现不稳定。为解决这些问题,我们引入了物理信息感知的颜色变换框架——Physics-informed Color-aware Transform (PiCat),它通过我们提出的Color-aware Transform (CAT) 将低光图像从sRGB色彩空间转换成深度的光照不变描述符。这种变换能够更好地处理复杂的光照和SPD变化问题,从而增强其鲁棒性。 此外,我们提出了内容噪声分解网络(CNDN),该网络通过减少噪声和其他失真来优化描述符分布,使低光图像的内容表示与良好照明条件下的表示更加一致,进而有效地恢复了这些内容的表示。CAT和CNDN共同作为物理先验知识,在从低光照域到正常光照领域的转换过程中提供指导。 我们的PiCat框架在五个基准数据集上展示出优于现有最佳方法的表现。
https://arxiv.org/abs/2504.11896
Person re-identification (Re-ID) aims to match the same pedestrian in a large gallery with different cameras and views. Enhancing the robustness of the extracted feature representations is a main challenge in Re-ID. Existing methods usually improve feature representation by improving model architecture, but most methods ignore the potential contextual information, which limits the effectiveness of feature representation and retrieval performance. Neighborhood information, especially the potential information of multi-order neighborhoods, can effectively enrich feature expression and improve retrieval accuracy, but this has not been fully explored in existing research. Therefore, we propose a novel model DMON-ARO that leverages latent neighborhood information to enhance both feature representation and index performance. Our approach is built on two complementary modules: Dynamic Multi-Order Neighbor Modeling (DMON) and Asymmetric Relationship Optimization (ARO). The DMON module dynamically aggregates multi-order neighbor relationships, allowing it to capture richer contextual information and enhance feature representation through adaptive neighborhood modeling. Meanwhile, ARO refines the distance matrix by optimizing query-to-gallery relationships, improving the index accuracy. Extensive experiments on three benchmark datasets demonstrate that our approach achieves performance improvements against baseline models, which illustrate the effectiveness of our model. Specifically, our model demonstrates improvements in Rank-1 accuracy and mAP. Moreover, this method can also be directly extended to other re-identification tasks.
人再识别(Re-ID)的目标是在不同摄像头和视角的大图库中匹配同一行人。提高提取特征表示的鲁棒性是Re-ID中的主要挑战之一。现有方法通常通过改进模型架构来提升特征表示,但大多数方法忽视了潜在的上下文信息,这限制了特征表示的有效性和检索性能。邻域信息特别是多阶邻居的潜在信息能够有效丰富特征表达并提高检索精度,但在现有研究中这一潜力尚未得到充分挖掘。 因此,我们提出了一种新型模型DMON-ARO(Dynamic Multi-Order Neighbor Modeling and Asymmetric Relationship Optimization),该模型利用隐式邻域信息来增强特征表示和索引性能。我们的方法基于两个互补模块:动态多阶邻居建模(DMON)和非对称关系优化(ARO)。DMON模块通过自适应地聚合多阶邻居关系,能够捕捉更丰富的上下文信息,并通过灵活的邻居建模增强特征表示能力。同时,ARO模块通过对查询到图库的关系进行优化来精炼距离矩阵,进而提高索引准确性。 在三个基准数据集上的广泛实验表明,我们的方法相比基线模型取得了性能改进,这说明了我们模型的有效性。具体而言,我们在Rank-1准确率和mAP方面均有提升。此外,该方法也可以直接扩展到其他再识别任务中。
https://arxiv.org/abs/2504.11798
Medical Visual Question Answering (MVQA) systems can interpret medical images in response to natural language queries. However, linguistic variability in question phrasing often undermines the consistency of these systems. To address this challenge, we propose a Semantically Equivalent Question Augmentation (SEQA) framework, which leverages large language models (LLMs) to generate diverse yet semantically equivalent rephrasings of questions. Specifically, this approach enriches linguistic diversity while preserving semantic meaning. We further introduce an evaluation metric, Total Agreement Rate with Semantically Equivalent Input and Correct Answer (TAR-SC), which assesses a model's capability to generate consistent and correct responses to semantically equivalent linguistic variations. In addition, we also propose three other diversity metrics - average number of QA items per image (ANQI), average number of questions per image with the same answer (ANQA), and average number of open-ended questions per image with the same semantics (ANQS). Using the SEQA framework, we augmented the benchmarked MVQA public datasets of SLAKE, VQA-RAD, and PathVQA. As a result, all three datasets achieved significant improvements by incorporating more semantically equivalent questions: ANQI increased by an average of 86.1, ANQA by 85.1, and ANQS by 46. Subsequent experiments evaluate three MVQA models (M2I2, MUMC, and BiomedGPT) under both zero-shot and fine-tuning settings on the enhanced datasets. Experimental results in MVQA datasets show that fine-tuned models achieve an average accuracy improvement of 19.35%, while our proposed TAR-SC metric shows an average improvement of 11. 61%, indicating a substantial enhancement in model consistency.
医学视觉问答(MVQA)系统能够根据自然语言查询来解释医学影像。然而,问题表述中的语言变化常常会影响这些系统的稳定性。为了解决这一挑战,我们提出了一种语义等价问题增强(SEQA)框架,该框架利用大规模语言模型(LLMs)生成多样化的但语义上等价的问题重述。这种方法在保持语义含义的同时丰富了语言多样性。此外,我们还引入了一个评估指标——与语义等价输入和正确答案一致的总比率(TAR-SC),用以衡量模型对语义相似的语言变化的一致性和准确度的能力。同时,我们也提出了三个其他多样性指标:每张图像平均问答项数量(ANQI)、具有相同回答的每张图像平均问题数(ANQA)以及具有相同语义的开放式问题数量(ANQS)。利用SEQA框架,我们增强了SLAKE、VQA-RAD和PathVQA等基准MVQA公开数据集。结果表明,所有三个数据集中通过引入更多语义上等价的问题取得了显著改进:ANQI平均增加了86.1,ANQA增加了85.1,而ANQS则增加了46。随后的实验评估了在增强后的数据集上,在零样本和微调设置下三种MVQA模型(M2I2、MUMC以及BiomedGPT)的表现。在MVQA数据集中的实验证明,经过微调的模型平均准确性提高了19.35%,而我们提出的TAR-SC指标则显示了平均提升为11.61%的结果,这表明模型的一致性有了显著提高。
https://arxiv.org/abs/2504.11777
Recently, learning-based robotic navigation systems have gained extensive research attention and made significant progress. However, the diversity of open-world scenarios poses a major challenge for the generalization of such systems to practical scenarios. Specifically, learned systems for scene measurement and state estimation tend to degrade when the application scenarios deviate from the training data, resulting to unreliable depth and pose estimation. Toward addressing this problem, this work aims to develop a visual odometry system that can fast adapt to diverse novel environments in an online manner. To this end, we construct a self-supervised online adaptation framework for monocular visual odometry aided by an online-updated depth estimation module. Firstly, we design a monocular depth estimation network with lightweight refiner modules, which enables efficient online adaptation. Then, we construct an objective for self-supervised learning of the depth estimation module based on the output of the visual odometry system and the contextual semantic information of the scene. Specifically, a sparse depth densification module and a dynamic consistency enhancement module are proposed to leverage camera poses and contextual semantics to generate pseudo-depths and valid masks for the online adaptation. Finally, we demonstrate the robustness and generalization capability of the proposed method in comparison with state-of-the-art learning-based approaches on urban, in-house datasets and a robot platform. Code is publicly available at: this https URL.
最近,基于学习的机器人导航系统得到了广泛的研究关注,并取得了显著的进步。然而,开放世界场景的多样性为这些系统的泛化能力带来了重大挑战,尤其是在实际应用中。具体而言,用于场景测量和状态估计的学习系统在应用场景与训练数据不同步时性能会下降,导致深度和姿态估计不可靠。为了应对这一问题,本研究旨在开发一种视觉里程计系统,该系统能够在线快速适应各种新的环境。为此,我们构建了一个自监督的在线适应框架,用于单目视觉里程计,并辅以一个在线更新的深度估计模块。 首先,设计了一种带有轻量级细化模块的单目深度估计网络,这使得高效的在线适应成为可能。其次,基于视觉里程计系统的输出和场景上下文语义信息构建了一个自监督学习的目标函数来训练深度估计模块。具体来说,提出了一个稀疏深度稠密化模块和一个动态一致性增强模块,利用相机姿态和上下文语义生成伪深度图和有效掩码,以支持在线适应过程。 最后,在城市环境、室内数据集以及机器人平台上的实验表明,所提出的方法在鲁棒性和泛化能力方面优于最先进的学习方法。代码公开发布在:[此处提供URL](请将实际的网址链接替换到"this https URL")。
https://arxiv.org/abs/2504.11698
Multimodal Aspect-Based Sentiment Analysis (MABSA) seeks to extract fine-grained information from image-text pairs to identify aspect terms and determine their sentiment polarity. However, existing approaches often fall short in simultaneously addressing three core challenges: Sentiment Cue Perception (SCP), Multimodal Information Misalignment (MIM), and Semantic Noise Elimination (SNE). To overcome these limitations, we propose DASCO (\textbf{D}ependency Structure \textbf{A}ugmented \textbf{Sco}ping Framework), a fine-grained scope-oriented framework that enhances aspect-level sentiment reasoning by leveraging dependency parsing trees. First, we designed a multi-task pretraining strategy for MABSA on our base model, combining aspect-oriented enhancement, image-text matching, and aspect-level sentiment-sensitive cognition. This improved the model's perception of aspect terms and sentiment cues while achieving effective image-text alignment, addressing key challenges like SCP and MIM. Furthermore, we incorporate dependency trees as syntactic branch combining with semantic branch, guiding the model to selectively attend to critical contextual elements within a target-specific scope while effectively filtering out irrelevant noise for addressing SNE problem. Extensive experiments on two benchmark datasets across three subtasks demonstrate that DASCO achieves state-of-the-art performance in MABSA, with notable gains in JMASA (+3.1\% F1 and +5.4\% precision on Twitter2015).
多模态基于方面的情感分析(MABSA)旨在从图像-文本对中提取细粒度信息,以识别方面术语并确定其情感极性。然而,现有方法在同时应对三个核心挑战时往往力不从心:情感线索感知(SCP)、跨模态信息错配(MIM)和语义噪声消除(SNE)。为了克服这些限制,我们提出了DASCO(依赖结构增强的细粒度范围框架),这是一种基于细粒度范围导向的架构,通过利用依存句法树来提升方面级别的情感推理。首先,我们在基础模型上设计了一种多任务预训练策略用于MABSA,结合了面向方面的增强、图像-文本匹配和针对方面的情感敏感认知。这提高了模型对方面术语和情感线索的理解能力,并实现了有效的图像-文本对齐,从而解决了包括SCP和MIM在内的关键挑战。 此外,我们将依赖树作为句法分支与语义分支相结合,指导模型在特定目标的范围内选择性地关注重要的上下文元素,同时有效过滤掉不相关的噪声以解决SNE问题。在两个基准数据集上的三个子任务中进行的广泛实验表明,DASCO在MABSA方面达到了最先进的性能,在Twitter2015数据集上获得了显著的改进(F1值提高了3.1%,精确度提高5.4%)。
https://arxiv.org/abs/2504.11331
Neural Networks (NNs) trained through supervised learning struggle with managing edge-case scenarios common in real-world driving due to the intractability of exhaustive datasets covering all edge-cases, making knowledge-driven approaches, akin to how humans intuitively detect unexpected driving behavior, a suitable complement to data-driven methods. This work proposes a hybrid architecture combining low-level Model Predictive Controller (MPC) with locally deployed Large Language Models (LLMs) to enhance decision-making and Human Machine Interaction (HMI). The DecisionxLLM module evaluates robotic state information against natural language instructions to ensure adherence to desired driving behavior. The MPCxLLM module then adjusts MPC parameters based on LLM-generated insights, achieving control adaptability while preserving the safety and constraint guarantees of traditional MPC systems. Further, to enable efficient on-board deployment and to eliminate dependency on cloud connectivity, we shift processing to the on-board computing platform: We propose an approach that exploits Retrieval Augmented Generation (RAG), Low Rank Adaptation (LoRA) fine-tuning, and quantization. Experimental results demonstrate that these enhancements yield significant improvements in reasoning accuracy by up to 10.45%, control adaptability by as much as 52.2%, and up to 10.5x increase in computational efficiency (tokens/s), validating the proposed framework's practicality for real-time deployment even on down-scaled robotic platforms. This work bridges high-level decision-making with low-level control adaptability, offering a synergistic framework for knowledge-driven and adaptive Autonomous Driving Systems (ADS).
通过监督学习训练的神经网络(NNs)在处理现实世界驾驶中常见的边缘情况时遇到困难,因为无法生成包含所有可能边缘情况的详尽数据集。因此,类似于人类直观地识别意外驾驶行为的知识驱动方法可以作为数据驱动方法的有效补充。本文提出了一种结合低级模型预测控制器(MPC)与本地部署的大语言模型(LLMs)的混合架构,以增强决策制定和人机交互(HMI)。DecisionxLLM模块评估机器人状态信息是否符合自然语言指令,确保遵守预期驾驶行为。随后,MPCxLLM模块根据LLM生成的见解调整MPC参数,在保持传统MPC系统的安全性和约束保证的同时实现控制灵活性。 为了在车载平台上高效部署并减少对云端连接的依赖,我们将处理转移到了车载计算平台:我们提出了一种利用检索增强生成(RAG)、低秩适应性(LoRA)微调和量化的方法。实验结果表明,这些改进显著提高了推理准确性(最多提高10.45%),增强了控制灵活性(最多提高52.2%),并实现了高达10.5倍的计算效率提升(每秒标记数量)。这验证了所提出的框架在即使是在简化的机器人平台上的实时部署中也具有实用性。 这项工作将高级决策制定与低级控制适应性结合起来,为知识驱动和自适应自动驾驶系统提供了协同架构。
https://arxiv.org/abs/2504.11514