Paper Reading AI Learner

Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection

2026-02-10 18:10:08
Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu

Abstract

Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model's ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.

Abstract (translated)

最近的研究表明,在检测过程中加入链式思维(Chain-of-Thought,CoT)推理可以提高模型识别合成图像的能力。然而,过长的推理过程会带来巨大的资源开销,包括令牌消耗和延迟问题,并且对于明显伪造的处理来说这是不必要的浪费。为了解决这个问题,我们提出了Fake-HR1,这是一个大规模混合推理模型,在我们的知识范围内,它是第一个能够根据生成式检测任务的特点自适应地判断是否需要进行推理的模型。为了实现这一目标,我们设计了一个两阶段训练框架:首先进行混合微调(Hybrid Fine-Tuning,HFT)以完成冷启动初始化,然后使用带有混合推理分组策略优化(Hybrid-Reasoning Grouped Policy Optimization,HGRPO)的在线强化学习来隐式地学习何时选择合适的推理模式。实验结果表明,Fake-HR1能够在不同类型的查询中自适应地执行推理,并且在推理能力和生成检测性能方面都超越了现有的大型语言模型,同时显著提高了响应效率。

URL

https://arxiv.org/abs/2602.10042

PDF

https://arxiv.org/pdf/2602.10042.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot