Performance Rnn
Figure 5 shows how we use Performance RNN to generate music. The first
stage shown in this figure is score composition, which is generating
music score. The music score then goes through a synthesizer, and it
will be transformed to an audio format that can hear by human listener.
Fig. 5 The process of music generation. The music starting with the
composition of a score; that score gets turned into a performance (shown
as a MIDI piano roll); that MIDI roll, in turn, gets rendered into sound
using a synthesizer, and finally the resulting audio gets perceived as
music by a human listener.
In this music generation process, performance RNN act as a composer and
a performer combined. It will generate the music score
Input representation
The representation of a MIDI file is converting to a sequence of events
form 413 different events.
- 128 Note-On events: represent MIDI pitches. Each one starts a new
note.
- 128 Note-off events: represent MIDI pitches. Each one releases a note.
- 125 Time-Shift events: each one moves the time step forward by
increments of 8ms up to 1s.
- 32 Velocity events: each one changes the velocity applied to all
subsequent notes until next velocity event.
The raw MIDI file will be converted to Note Sequence then to Sequence
Examples. The Sequence Examples is a 413-dimensional one hot encoding
vector as input and output as well.
Fig. 6 Example of Representation used for PerformanceRNN. The
progression illustrates how a MIDI sequence (e.g. shown as a MIDI roll
consisting of a short note followed by a longer note) is converted into
a sequence of commands (on the right hand side) in our event vocabulary.
Note that an arbitrary number of events can in principle occur between
two time shifts.
Training process
The first step will be to convert a collection of MIDI or MusicXML files
into NoteSequences. In this project we use MAESTRO dataset.
NoteSequences are protocol buffers, which is a fast and efficient data
format, and easier to work with than MIDI files. See Building your
Dataset for instructions on generating a TFRecord file of NoteSequences.
We transform the row data to NoteSequences and to SequenceExamples. The
first transformation is to serialize the dataset. You can define how you
want your data to be structured once and then you can use special
generated source code to easily write and read your structured data to
and from a variety of data streams and using a variety of languages. The
second transformation is to generate the input of the model.
SequenceExamples are fed into the model during training and evaluation.
Each SequenceExample will contain a sequence of inputs and a sequence of
labels that represent a performance. SequenceExample then will be fed in
to train the model.
Fig. 7 The basic RNN architecture consists of three hidden layers of
LSTMs, each layer with 512 cells. The input is a 413-dimensional one-hot
vector, as is the target, and the model outputs a categorical
distribution over the same dimensionality as well. For generation, the
output is sampled stochastically with beam search, while teacher forcing
is used for training.
Predicting Pedal
Pedal is commonly used in piano performance. It will extend the piano
note when the pedal is being pressed. In the RNN model, we experimented
with predicting sustain pedal. We applied Pedal On by directly extending
the lengths of the notes: for any notes on during or after a Pedal On
signal, we delay their corresponding Note Off events until the next
Pedal Off signal. This made it a lot easier for the system to accurately
predict a whole set of Note Off events all at once, as well as to
predict the corresponding delay preceding this. Doing so may have also
freed up resources to focus on better prediction of other events as
well. Finally, as one might expect, including pedal made a significant
subjective improvement in the quality of the resulting output.
Synthesizer
A synthesizer is a sample library, it has thousands of music instrument
samples in it. And in the MIDI file, there is only instructions such as
note-on and note-off, intensity of how to reproduce the performance, the
synthesizer takes that instructions and use the sound samples in the
library to reproduce the audio, turns that into sound that we can hear.
In the original paper, the author used FluidSynth as the synthesizer.
FluidSynth is a real-time software synthesizer based on the SoundFont 2.
FluidSynth itself does not have a graphical user interface, but due to
its powerful API several applications utilize it and it has even found
its way onto embedded systems and is used in some mobile apps. In our
project, in order to have better visualization and editability, we are
using FL Studio to visualize the MIDI files, also used it as the
synthesizer.