Data
In the training process of Performance RNN, we want to get a model that
can generate expressive performance, model that can generate score with
dynamics and more nature timing. In order to achieve that goal, we need
proper training data. In this project, we use MAESTRO (MIDI and Audio
Edited for Synchronous TRacks and Organization) [4] dataset, this is
a dataset which include over 200 hours of virtuosic piano performances
captured with fine alignment (~3 ms) between note labels
and audio waveforms.
Homogeneous
The MAESTRO dataset is homogeneous in a few different ways:
First, the MIDI files in the dataset is all classical music. This will
help the outputs to be consisting.
Second, it is all solo instrumental music. The performance RNN model we
want to train is to generate solo piano performance, so it makes sense
that all the training data is solo piano performance. If the training
data includes data that is for two or more instruments, then it no
longer makes sense for this training purpose (solo piano performance),
some part of the data, the instruments will depends on each other in
order to sound harmony. So even if we only use the piano part of the
data, it still does not fit well with the training process.
Third, the solo instrument is consistently piano. Different instrument
has different styles of performing, therefore, classical composers
generally write in a way that is very specific to whichever instrument
they are writing for. The classical music scores are closely related to
the timbre of that instrument, due to the different catachrestic of the
instrument.
Forth, the piano performances were all done by humans. All the MIDI
files in the dataset are recorded from human performer during the piano
competition. In order to mimic the expressive timing that human
performers have, it is much better for the system to learn from
performance done by human pianists.
Fifth, all the performances were done by experts. If we wish the system
to learn about human performance, that human performance must match the
listener’s concept of what “human performance” sounds like, which is
usually performances by experts. The casual evaluator might find
themselves slightly underwhelmed were they to listen to a system that
has learned to play like a beginning pianist, even if the system has
done so with remarkable fidelity to the dynamic and velocity patterns
that occur in that situation.
Realizable
Using a dataset that the solo instrument is piano has other benefits.
Synthesizing audio from MIDI can be a challenging problem for some
instruments. For example, having velocities and note durations and
timing of violin music would not immediately lead to good-sounding
violin audio at all. The problems are even more evident if one considers
synthesizing vocals from MIDI. Here, that the piano is a percussive
instrument buys us an important benefit: synthesizing piano music from
MIDI can sound quite realistic (compared to synthesizing instruments
that allow continuous timbral control). Thus, when we generate data, we
can properly realize it in audio space and therefore have a good point
of comparison. Conversely, capturing the MIDI data of piano playing
provides us with a sufficiently rich set of parameters that we can later
learn enough in order to be able to render audio. Note that with violin
or voice, for example, we would need to capture many more parameters
than those typically available in the MIDI protocol in order to get a
sufficiently meaningful set of parameters for expressive performance.
Evaluation matrix
Musical expression evaluation it is not quite like basic static data
evaluation. There are two different types of machine learning
algorithms, for the classification and the regression problem. Simply
the outputs are signal value that may means house price, temperature, or
categories. But the musical expression, it is not a signal value which
may represent the result. For this reason, the evaluation matrix is
using log-loss.
Also, Sageev Oore [1] and his team used another way to
evaluate the generated samples. They compared hand selected statistical
features with the same features that from human music performance. They
mentioned that the relationship between MIDI pitch number and MIDI
velocity the best fit first three orders of polynomials are remarkably
similar. That could mean the performance is good. For our research, we
only share the generative performance to our class. Then let the
audience to decide whether it is a good performance.