Introduction

Deep learning is a trend topic in recent years. With the developing of deep learning algorithms, it starts to be used from every aspect. The most two commonly used areas are CV and NLP, but it doesn’t mean researchers have not use it for other areas, audio is one aspect that they are working on. Sageev Oore [1] and his team proposed a Long-short term memory based recurrent neural network to generate music, this algorithm will learn from the input music representation and study the pattern of music and then compose a new musical expression. For deep learning approach that the size of input is really important, if the input size is too big then the training time and memory capacity would be an issue. In order to deal with this problem, MIDI file is used as the raw data type. MIDI is not a normal musical file like mp3, it can be described as music score. A synthesizer will generate music according what’s inside a MIDI file. Other reason that using MIDI is music generation is either aiming to create music scores or directly interpreting them, but Sageev Oore [1] proposed that in fact jointly predicting the notes and also their expressive timing and dynamics is more valuable. That will be discuss at coming section.
The history of music generation can be call back to the 17th century. People had this game called “musical dice game ”[5], it is just a basic dice game that the only difference is the result of rolling dice is to pick a pre-composed music options and after several rounds the music options are putting together as a complete music. The first automatic music generation is a very simple game. Now through the development of machine learning, deep learning algorithms, music generation can be carried out with different ways. For our research, we focus on training a machine-learning system which is Performance RNN to generate music. They had a great success on the goal of generate music with both timing and feeling, mention that “given the current state of the art in music generation systems, it is effective to generate the expressive timing and dynamics information concurrently with the music.” So, the approach is to directly generate improvised performances rather than creating or interpreting scores.