Paper Reading AI Learner

Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models

2024-04-18 17:59:48
Aitor Ormazabal, Che Zheng, Cyprien de Masson d'Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, Zhihui Xie

Abstract

We introduce Reka Core, Flash, and Edge, a series of powerful multimodal language models trained from scratch by Reka. Reka models are able to process and reason with text, images, video, and audio inputs. This technical report discusses details of training some of these models and provides comprehensive evaluation results. We show that Reka Edge and Reka Flash are not only state-of-the-art but also outperform many much larger models, delivering outsized values for their respective compute class. Meanwhile, our most capable and largest model, Reka Core, approaches the best frontier models on both automatic evaluations and blind human evaluations. On image question answering benchmarks (e.g. MMMU, VQAv2), Core performs competitively to GPT4-V. Meanwhile, on multimodal chat, Core ranks as the second most preferred model under a blind third-party human evaluation setup, outperforming other models such as Claude 3 Opus. On text benchmarks, Core not only performs competitively to other frontier models on a set of well-established benchmarks (e.g. MMLU, GSM8K) but also outperforms GPT4-0613 on human evaluation. On video question answering (Perception-Test), Core outperforms Gemini Ultra. Models are shipped in production at this http URL . A showcase of non cherry picked qualitative examples can also be found at this http URL .

Abstract (translated)

我们介绍了由Reka开发的 Reka Core、Flash 和 Edge 一系列强大的多模态语言模型。这些模型是由 Reka 从头训练的,能够处理和推理文本、图像、视频和音频输入。本技术报告讨论了训练这些模型的细节,并提供了全面评估结果。我们发现,Reka Edge 和 Reka Flash 不仅是最先进的,而且在各自的计算类别中还表现出色,为它们各自的计算类别带来了超额价值。与此同时,我们最强大和最大的模型 Reka Core,在自动评估和盲人评估方面都接近最佳前沿模型。在图像问答基准(如MMMU,VQAv2)上,Core与GPT4-V竞争相当。在多模态聊天中,Core在盲第三方人类评估设置下排名第二,超过了像Claude 3 Opus这样的其他模型。在文本基准上,Core不仅在一批经过良好检验的基准(如MMLU,GSM8K)上表现竞争力,而且也在人类评估中超过了GPT4-0613。在视频问答(Perception-Test)中,Core超过了Gemini Ultra。模型现在在生产环境中通过这个链接发送:http://www.reka.ai/。也可以在这个链接找到展示非 cherry-picked 高质量示例的示例:http://www.reka.ai/quality-examples/。

URL

https://arxiv.org/abs/2404.12387

PDF

https://arxiv.org/pdf/2404.12387.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot