Data

In the training process of Performance RNN, we want to get a model that can generate expressive performance, model that can generate score with dynamics and more nature timing. In order to achieve that goal, we need proper training data. In this project, we use MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) [4] dataset, this is a dataset which include over 200 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms.

Homogeneous

The MAESTRO dataset is homogeneous in a few different ways:
First, the MIDI files in the dataset is all classical music. This will help the outputs to be consisting.
Second, it is all solo instrumental music. The performance RNN model we want to train is to generate solo piano performance, so it makes sense that all the training data is solo piano performance. If the training data includes data that is for two or more instruments, then it no longer makes sense for this training purpose (solo piano performance), some part of the data, the instruments will depends on each other in order to sound harmony. So even if we only use the piano part of the data, it still does not fit well with the training process.
Third, the solo instrument is consistently piano. Different instrument has different styles of performing, therefore, classical composers generally write in a way that is very specific to whichever instrument they are writing for. The classical music scores are closely related to the timbre of that instrument, due to the different catachrestic of the instrument.
Forth, the piano performances were all done by humans. All the MIDI files in the dataset are recorded from human performer during the piano competition. In order to mimic the expressive timing that human performers have, it is much better for the system to learn from performance done by human pianists.
Fifth, all the performances were done by experts. If we wish the system to learn about human performance, that human performance must match the listener’s concept of what “human performance” sounds like, which is usually performances by experts. The casual evaluator might find themselves slightly underwhelmed were they to listen to a system that has learned to play like a beginning pianist, even if the system has done so with remarkable fidelity to the dynamic and velocity patterns that occur in that situation.

Realizable

Using a dataset that the solo instrument is piano has other benefits. Synthesizing audio from MIDI can be a challenging problem for some instruments. For example, having velocities and note durations and timing of violin music would not immediately lead to good-sounding violin audio at all. The problems are even more evident if one considers synthesizing vocals from MIDI. Here, that the piano is a percussive instrument buys us an important benefit: synthesizing piano music from MIDI can sound quite realistic (compared to synthesizing instruments that allow continuous timbral control). Thus, when we generate data, we can properly realize it in audio space and therefore have a good point of comparison. Conversely, capturing the MIDI data of piano playing provides us with a sufficiently rich set of parameters that we can later learn enough in order to be able to render audio. Note that with violin or voice, for example, we would need to capture many more parameters than those typically available in the MIDI protocol in order to get a sufficiently meaningful set of parameters for expressive performance.

Evaluation matrix

Musical expression evaluation it is not quite like basic static data evaluation. There are two different types of machine learning algorithms, for the classification and the regression problem. Simply the outputs are signal value that may means house price, temperature, or categories. But the musical expression, it is not a signal value which may represent the result. For this reason, the evaluation matrix is using log-loss.
Also, Sageev Oore [1] and his team used another way to evaluate the generated samples. They compared hand selected statistical features with the same features that from human music performance. They mentioned that the relationship between MIDI pitch number and MIDI velocity the best fit first three orders of polynomials are remarkably similar. That could mean the performance is good. For our research, we only share the generative performance to our class. Then let the audience to decide whether it is a good performance.