Related knowladge

Scores

Musical score is the representation of music, it is before sound, but a way to transform this representation into music into sound. Such transformation may be the conversion from digital to analog waves, or the human performance of written score. A musical score shows which notes to play and when to play them relative to each other at Fig.1. The timing in a score is a very strict concept. For example, quarter notes are the same duration as quarter note rests, twice the duration of eighth notes, and so on. But there are no specific seconds for how long a note should last. Music is creative art; musical score is only the base line for the music performance. The mapping from score to music is full of subtlety and complexity.
Fig. 1 A typical musical score

MIDI

MIDI (Musical Instrument Digital Interface) is a technical standard that describes a communication protocol for a wide variety of electronic musical instruments, computers, and related audio devices for playing, editing, and recording music.
A midi file contains two things, Header chunks and Track chunks. Header chunk describing the file format and the number of track chunks. Each track chunks have one header and can contain as many midi commands as possible. Following the header are midi events. These events are identical to the actual data sent and received by MIDI ports on a synth with one addition. A midi event is preceded by a delta-time. A delta time is the number of ticks after which the midi event is to be executed. The number of ticks per quarter note was defined previously in the file header chunk. This delta-time is a variable-length encoded value [2]. And for the MIDI event commands, there are two parts, first 4 bits contain the actual command, and the rest contain the midi channel where the command will be executed. There are 16 midi channels, and 8 midi commands. Commands like Note off (1000xxxx), Note On (1001xxxx), and Key after-touch (1010xxxx), etc . There are total 128 notes that can be represent from a table which corresponding the numbers to notes.
Fig. 2 A visualized MIDI file in FL Studio Piano roll. The horizontal axis represents time; the vertical axis represents pitch; each rectangle is a note; and the length of the rectangle corresponds to the duration of the note.
We can think of a score as a highly abstract representation of music, MIDI can be visualized as a piano roll. Figure 2 is an example of the piano roll displayed in FL Studio. Each row corresponds to one of the 128 possible MIDI pitches. Each column corresponds to a uniform time step.

Recurrent Neural Network

Recurrent neural network (RNN) is a generalization of feed-forward neural network that has an internal memory. It performs the same function for every input of data while the output of the current input depends on the past one computation [3, 7]. After producing the output, the output is copied and send back into the recurrent network. In the hidden stage, the new input will be calculated with the previous weighted output then generate new output.
Fig. 3 RNN structure
For sequence leaning, RNN can really do it work. But there are still some limitations and disadvantages. RNN architecture can capture the dependencies in only one direction of its learning target. RNN are not good at capturing long term dependencies and the problem of vanishing gradients may occur in RNN. As in every hidden stage it will calculate pervious output, as the length of the sequence grow, the training time will also grow.

Long short-term memory network

In order to solve the vanishing gradient problem of RNN, a Long Short-Term Memory networks (LSTM) was proposed. The difference between RNN and LSTM are not fundamentally, the optimization is using different functions to compute hidden stage. Memory in LSTM are called cells, it decides what to keep in memory. With more functions that LSTM can add or remove information to the cell state. Those functions are called gates. LSTM has three gates to preserve and discard information at cell state. Forget gate, Input gate, and Output gate.
The first layer is forgetting gate, it decides what information will be thrown away, new input will be computed with previous hidden stage output with a new weight, then using sigmoid function output range of [0,1]. 1 means completely keep this, 0 means get rid of this. The next gate is input gate, it decides what new information will be stored in the cell state. It consists of two layers, which are sigmoid layer using to decide what value to be update, and tanh layer is to create a vector of new candidate values. Next step is to update the previous output with the value we get from forget gate then add the new candidate. To the final step, the output gate decides what going to be output, first run the sigmoid layer to decide what parts of cell state will output then put the cell state through tanh and multiply it by the output of the sigmoid layer.
Fig. 4 LSTM structure
LSTM solve the vanishing gradient problem of RNN and can learn a better relationship between each input. The performance will be better for long sequences input.

Problems in this area

Metric Abstraction : Many compositional systems abstract rhythm in relation to an underlying grid, with metric-based units such as eighth notes and triplets. Often this is further restricted to step sizes at powers of two. Such abstraction is oblivious to many essential musical devices, including, for example, rubato and swing as described in Sect. 1.1.1. We choose a temporal representation based on absolute time intervals between events, rounded to 8 ms.
No Dynamics : Nearly every compositional system represents notes as ON or OFF. This binary representation ignores dynamics, which constitute an essential aspect of how music is perceived. Even if dynamic level were treated a global parameter applied equally to simultaneous notes, this would still defeat the ability of dynamics to differentiate between voices, or to compensate for a dense accompaniment (that is best played quietly) underneath a sparse melody. We allow each note to have its own dynamic level.
Monophony : Some systems only generate monophonic sequences. Admittedly, this is a natural starting point: the need to limit to monophonic output is in this sense entirely understandable. This can work very well for instruments such as voice and violin, where the performer also has sophisticated control beyond quantized pitch and the velocity of the note attack. The perceived quality of monophonic sequences may be inextricably tied to these other dimensions that are difficult to capture and usually absent from MIDI sequences.