Related knowladge
Scores
Musical score is the representation of music, it is before sound, but a
way to transform this representation into music into sound. Such
transformation may be the conversion from digital to analog waves, or
the human performance of written score. A musical score shows which
notes to play and when to play them relative to each other at Fig.1. The
timing in a score is a very strict concept. For example, quarter notes
are the same duration as quarter note rests, twice the duration of
eighth notes, and so on. But there are no specific seconds for how long
a note should last. Music is creative art; musical score is only the
base line for the music performance. The mapping from score to music is
full of subtlety and complexity.
Fig. 1 A typical musical score
MIDI
MIDI (Musical Instrument Digital Interface) is a technical standard that
describes a communication protocol for a wide variety of electronic
musical instruments, computers, and related audio devices for playing,
editing, and recording music.
A midi file contains two things, Header chunks and Track chunks. Header
chunk describing the file format and the number of track chunks. Each
track chunks have one header and can contain as many midi commands as
possible. Following the header are midi events. These events are
identical to the actual data sent and received by MIDI ports on a synth
with one addition. A midi event is preceded by a delta-time. A delta
time is the number of ticks after which the midi event is to be
executed. The number of ticks per quarter note was defined previously in
the file header chunk. This delta-time is a variable-length encoded
value [2]. And for the MIDI event commands, there are two parts,
first 4 bits contain the actual command, and the rest contain the midi
channel where the command will be executed. There are 16 midi channels,
and 8 midi commands. Commands like Note off (1000xxxx), Note On
(1001xxxx), and Key after-touch (1010xxxx), etc . There are total 128
notes that can be represent from a table which corresponding the numbers
to notes.
Fig. 2 A visualized MIDI file in FL Studio Piano roll. The horizontal
axis represents time; the vertical axis represents pitch; each rectangle
is a note; and the length of the rectangle corresponds to the duration
of the note.
We can think of a score as a highly abstract representation of music,
MIDI can be visualized as a piano roll. Figure 2 is an example of the
piano roll displayed in FL Studio. Each row corresponds to one of the
128 possible MIDI pitches. Each column corresponds to a uniform time
step.
Recurrent Neural Network
Recurrent neural network (RNN) is a generalization of feed-forward
neural network that has an internal memory. It performs the same
function for every input of data while the output of the current input
depends on the past one computation [3, 7]. After producing the
output, the output is copied and send back into the recurrent network.
In the hidden stage, the new input will be calculated with the previous
weighted output then generate new output.
Fig. 3 RNN structure
For sequence leaning, RNN can really do it work. But there are still
some limitations and disadvantages. RNN architecture can capture the
dependencies in only one direction of its learning target. RNN are not
good at capturing long term dependencies and the problem of vanishing
gradients may occur in RNN. As in every hidden stage it will calculate
pervious output, as the length of the sequence grow, the training time
will also grow.
Long short-term memory
network
In order to solve the vanishing gradient problem of RNN, a Long
Short-Term Memory networks (LSTM) was proposed. The difference between
RNN and LSTM are not fundamentally, the optimization is using different
functions to compute hidden stage. Memory in LSTM are called cells, it
decides what to keep in memory. With more functions that LSTM can add or
remove information to the cell state. Those functions are called gates.
LSTM has three gates to preserve and discard information at cell state.
Forget gate, Input gate, and Output gate.
The first layer is forgetting gate, it decides what information will be
thrown away, new input will be computed with previous hidden stage
output with a new weight, then using sigmoid function output range of
[0,1]. 1 means completely keep this, 0 means get rid of this. The
next gate is input gate, it decides what new information will be stored
in the cell state. It consists of two layers, which are sigmoid layer
using to decide what value to be update, and tanh layer is to create a
vector of new candidate values. Next step is to update the previous
output with the value we get from forget gate then add the new
candidate. To the final step, the output gate decides what going to be
output, first run the sigmoid layer to decide what parts of cell state
will output then put the cell state through tanh and multiply it by the
output of the sigmoid layer.
Fig. 4 LSTM structure
LSTM solve the vanishing gradient problem of RNN and can learn a better
relationship between each input. The performance will be better for long
sequences input.
Problems in this area
Metric Abstraction : Many compositional systems abstract rhythm in
relation to an underlying grid, with metric-based units such as eighth
notes and triplets. Often this is further restricted to step sizes at
powers of two. Such abstraction is oblivious to many essential musical
devices, including, for example, rubato and swing as described in Sect.
1.1.1. We choose a temporal representation based on absolute time
intervals between events, rounded to 8 ms.
No Dynamics : Nearly every compositional system represents notes
as ON or OFF. This binary representation ignores dynamics, which
constitute an essential aspect of how music is perceived. Even if
dynamic level were treated a global parameter applied equally to
simultaneous notes, this would still defeat the ability of dynamics to
differentiate between voices, or to compensate for a dense accompaniment
(that is best played quietly) underneath a sparse melody. We allow each
note to have its own dynamic level.
Monophony : Some systems only generate monophonic sequences.
Admittedly, this is a natural starting point: the need to limit to
monophonic output is in this sense entirely understandable. This can
work very well for instruments such as voice and violin, where the
performer also has sophisticated control beyond quantized pitch and the
velocity of the note attack. The perceived quality of monophonic
sequences may be inextricably tied to these other dimensions that are
difficult to capture and usually absent from MIDI sequences.