Performance Rnn

Figure 5 shows how we use Performance RNN to generate music. The first stage shown in this figure is score composition, which is generating music score. The music score then goes through a synthesizer, and it will be transformed to an audio format that can hear by human listener.
Fig. 5 The process of music generation. The music starting with the composition of a score; that score gets turned into a performance (shown as a MIDI piano roll); that MIDI roll, in turn, gets rendered into sound using a synthesizer, and finally the resulting audio gets perceived as music by a human listener.
In this music generation process, performance RNN act as a composer and a performer combined. It will generate the music score

Input representation

The representation of a MIDI file is converting to a sequence of events form 413 different events.
The raw MIDI file will be converted to Note Sequence then to Sequence Examples. The Sequence Examples is a 413-dimensional one hot encoding vector as input and output as well.
Fig. 6 Example of Representation used for PerformanceRNN. The progression illustrates how a MIDI sequence (e.g. shown as a MIDI roll consisting of a short note followed by a longer note) is converted into a sequence of commands (on the right hand side) in our event vocabulary. Note that an arbitrary number of events can in principle occur between two time shifts.

Training process

The first step will be to convert a collection of MIDI or MusicXML files into NoteSequences. In this project we use MAESTRO dataset. NoteSequences are protocol buffers, which is a fast and efficient data format, and easier to work with than MIDI files. See Building your Dataset for instructions on generating a TFRecord file of NoteSequences.
We transform the row data to NoteSequences and to SequenceExamples. The first transformation is to serialize the dataset. You can define how you want your data to be structured once and then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages. The second transformation is to generate the input of the model.
SequenceExamples are fed into the model during training and evaluation. Each SequenceExample will contain a sequence of inputs and a sequence of labels that represent a performance. SequenceExample then will be fed in to train the model.
Fig. 7 The basic RNN architecture consists of three hidden layers of LSTMs, each layer with 512 cells. The input is a 413-dimensional one-hot vector, as is the target, and the model outputs a categorical distribution over the same dimensionality as well. For generation, the output is sampled stochastically with beam search, while teacher forcing is used for training.

Predicting Pedal

Pedal is commonly used in piano performance. It will extend the piano note when the pedal is being pressed. In the RNN model, we experimented with predicting sustain pedal. We applied Pedal On by directly extending the lengths of the notes: for any notes on during or after a Pedal On signal, we delay their corresponding Note Off events until the next Pedal Off signal. This made it a lot easier for the system to accurately predict a whole set of Note Off events all at once, as well as to predict the corresponding delay preceding this. Doing so may have also freed up resources to focus on better prediction of other events as well. Finally, as one might expect, including pedal made a significant subjective improvement in the quality of the resulting output.

Synthesizer

A synthesizer is a sample library, it has thousands of music instrument samples in it. And in the MIDI file, there is only instructions such as note-on and note-off, intensity of how to reproduce the performance, the synthesizer takes that instructions and use the sound samples in the library to reproduce the audio, turns that into sound that we can hear.
In the original paper, the author used FluidSynth as the synthesizer. FluidSynth is a real-time software synthesizer based on the SoundFont 2. FluidSynth itself does not have a graphical user interface, but due to its powerful API several applications utilize it and it has even found its way onto embedded systems and is used in some mobile apps. In our project, in order to have better visualization and editability, we are using FL Studio to visualize the MIDI files, also used it as the synthesizer.