Paper Reading AI Learner

FinLMM-R1: Enhancing Financial Reasoning in LMM through Scalable Data and Reward Design

2025-06-16 03:19:31
Kai Lan, Jiayong Zhu, Jiangtong Li, Dawei Cheng, Guang Chen, Changjun Jiang

Abstract

Large Multimodal Models (LMMs) demonstrate significant cross-modal reasoning capabilities. However, financial applications face challenges due to the lack of high-quality multimodal reasoning datasets and the inefficiency of existing training paradigms for reasoning enhancement. To address these issues, we propose an integrated framework, FinLMM-R1, combining an automated and scalable pipeline for data construction with enhanced training strategies to improve the multimodal reasoning of LMM. The Automated and Scalable Pipeline (ASP) resolves textual-visual misalignment in financial reports through a separate paradigm of question-answer generation and image-question alignment, ensuring data integrity and extraction efficiency. Through ASP, we collect 89,378 aligned image-question pairs from 23,397 financial reports, covering tasks such as arithmetic reasoning, statistics reasoning, financial explanation, and financial knowledge. Moreover, we introduce the Thinking with Adversarial Reward in LMM (TAR-LMM), extending the prior two-stage training framework [1] with additional reward mechanisms. In the first stage, we focus on text-only tasks with format and accuracy rewards to guide the model in generating well-structured thinking contents. In the second stage, we construct multi-image contrastive samples with additional reward components including image selection, thinking content length, and adversarial reward to jointly optimize the LMM across visual perception, reasoning efficiency, and logical coherence. Extensive experiments on 7 benchmarks show ASP-derived dataset and training framework significantly improve answer accuracy and reasoning depth over existing reasoning LMMs in both general and financial multimodal contexts.

Abstract (translated)

大型多模态模型(LMM)展示了显著的跨模态推理能力。然而,金融应用面临着由于缺乏高质量的多模态推理数据集以及现有训练范式中用于增强推理效率低下的挑战。为了解决这些问题,我们提出了一种集成框架FinLMM-R1,该框架结合了自动化和可扩展的数据构建管道与增强的训练策略以提高大型多模态模型(LMM)的多模态推理能力。 **自动化的可扩展数据管道(ASP)通过单独的问题-答案生成及图像-问题对齐范式解决了金融报告中的文本-视觉错位,确保了数据完整性和提取效率。** 通过ASP,我们从23,397份财务报告中收集了89,378个对齐的图像-问题对,涵盖了算术推理、统计推理、财务解释和金融知识等任务。 此外,我们还提出了“LMM中的对抗奖励思考”(TAR-LMM),它在先前的两阶段训练框架基础上增加了额外的奖励机制。**第一阶段主要关注仅文本的任务,通过格式和准确性奖励来指导模型生成结构良好的思维内容;第二阶段则构建多图像对比样本,并引入包括图像选择、思维内容长度以及对抗性奖励在内的附加奖励成分,以同时优化视觉感知、推理效率及逻辑一致性。** 在7个基准测试上的广泛实验表明,ASP产生的数据集和训练框架显著提高了现有推理LMM在通用与金融多模态环境下的答案准确性和推理深度。

URL

https://arxiv.org/abs/2506.13066

PDF

https://arxiv.org/pdf/2506.13066.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot