Paper Reading AI Learner

Composer's Assistant: Interactive Transformers for Multi-Track MIDI Infilling

2023-01-29 19:45:10
Martin E. Malandro

Abstract

We consider the task of multi-track MIDI infilling when arbitrary (track, measure) pairs of information have been deleted from a contiguous slice of measures from a MIDI file. We train two T5-like models to solve this task, one using a basic MIDI-like event vocabulary and one using a joined word-like version of this vocabulary. We introduce a new test set, created from the Lakh MIDI dataset, consisting of 9 multi-track MIDI infilling tasks. We evaluate our models on these tasks and find that one model works better on some tasks while the other works better on others. Our results have implications for the training of neural networks in other small-vocabulary domains, such as byte sequence modeling and protein sequence modeling. We release our source code, and we demonstrate that our models are capable of enabling real-time human-computer interactive composition in the REAPER digital audio workstation.

Abstract (translated)

当从一个MIDI文件中删除任意(轨道,测量)对信息时,我们考虑解决多轨MIDI填充任务。我们训练了两个T5-like模型来解决该任务,一个使用基本的MIDI事件词汇表,另一个使用这个词汇表的连接词版本。我们介绍了从一千份MIDI数据集中创建的新测试集,其中包括9个多轨MIDI填充任务。我们对这些任务进行评估,发现一个模型在某些任务上表现更好,另一个模型在另一些任务上表现更好。我们的结果对其他小型词汇 domains,例如字节序列建模和蛋白质序列建模,产生了影响。我们发布了我们的源代码,并展示了我们的模型能够在REAPER数字音频工作站中实现实时人类计算机互动创作。

URL

https://arxiv.org/abs/2301.12525

PDF

https://arxiv.org/pdf/2301.12525.pdf


Tags
3D Action Action_Localization Action_Recognition Activity Adversarial Agent Attention Autonomous Bert Boundary_Detection Caption Chat Classification CNN Compressive_Sensing Contour Contrastive_Learning Deep_Learning Denoising Detection Dialog Diffusion Drone Dynamic_Memory_Network Edge_Detection Embedding Embodied Emotion Enhancement Face Face_Detection Face_Recognition Facial_Landmark Few-Shot Gait_Recognition GAN Gaze_Estimation Gesture Gradient_Descent Handwriting Human_Parsing Image_Caption Image_Classification Image_Compression Image_Enhancement Image_Generation Image_Matting Image_Retrieval Inference Inpainting Intelligent_Chip Knowledge Knowledge_Graph Language_Model LLM Matching Medical Memory_Networks Multi_Modal Multi_Task NAS NMT Object_Detection Object_Tracking OCR Ontology Optical_Character Optical_Flow Optimization Person_Re-identification Point_Cloud Portrait_Generation Pose Pose_Estimation Prediction QA Quantitative Quantitative_Finance Quantization Re-identification Recognition Recommendation Reconstruction Regularization Reinforcement_Learning Relation Relation_Extraction Represenation Represenation_Learning Restoration Review RNN Robot Salient Scene_Classification Scene_Generation Scene_Parsing Scene_Text Segmentation Self-Supervised Semantic_Instance_Segmentation Semantic_Segmentation Semi_Global Semi_Supervised Sence_graph Sentiment Sentiment_Classification Sketch SLAM Sparse Speech Speech_Recognition Style_Transfer Summarization Super_Resolution Surveillance Survey Text_Classification Text_Generation Tracking Transfer_Learning Transformer Unsupervised Video_Caption Video_Classification Video_Indexing Video_Prediction Video_Retrieval Visual_Relation VQA Weakly_Supervised Zero-Shot