Paper Reading AI Learner

Transition Matching: Scalable and Flexible Generative Modeling

2025-06-30 07:51:58
Neta Shaul, Uriel Singer, Itai Gat, Yaron Lipman

Abstract

Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Abstract (translated)

扩散和流匹配模型在媒体生成方面取得了显著进展,但它们的设计空间已经被充分探索,这限制了进一步的改进。与此同时,自回归(AR)模型,特别是那些生成连续令牌的模型,已经成为了统一文本和媒体生成的一个有前景的方向。本文介绍了过渡匹配(TM),这是一种新颖的时间离散、状态连续的生成范式,它统一并推进了扩散/流模型以及连续自回归生成。 TM将复杂的生成任务分解为更简单的马尔可夫转换过程,允许表达式的非确定性概率转移核和任意非连续监督过程,从而解锁新的灵活设计途径。通过三种TM变体来探索这些选择:(i)差分过渡匹配(DTM),它通过直接学习转换概率将流匹配推广到离散时间,在图像质量和文本一致性的表现上达到业界领先水平,并且提高了采样效率。(ii)自回归过渡匹配(ARTM)和(iii) 全历史记录过渡匹配(FHTM)分别是部分因果模型和完全因果模型,它们扩展了连续AR方法。这些变体实现了与非因果方法相当的连续因果AR生成质量,并可能使现有AR文本生成技术无缝集成。值得注意的是,FHTM是第一个在连续领域中,在文本到图像任务上达到或超过流基方法性能的全因果模型。 我们通过严格的大型规模对比测试TM变体和相关基准的方法来展示这些贡献,同时保持固定的架构、训练数据和超参数设置不变。

URL

https://arxiv.org/abs/2506.23589

PDF

https://arxiv.org/pdf/2506.23589.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Time_Series Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot