Dilated Convolutional Model for Melody Extraction Dilated Convolutional Model for Melody Extraction

—Melody extraction is a challenging task in music information retrieval that enables many down-stream applica-tions. In this paper we propose a simple dilated convolutional model for melody extraction. It takes variable-q transforms as inputs. It ﬁrst uses consecutive layers of convolution to capture local temporal-frequency patterns. Afterward, it relies only a single layer of dilated convolution for capturing global frequency patterns formed by the pitches and harmonics of active notes. This model is effective in that it achieves the-state-of-the-art performance on most datasets, for both general and vocal melody extraction. In addition, it gets the best performance with the least training data.


I. INTRODUCTION
M ELODY extraction is an import, challenging task in music information retrieval. It underpins many other applications, such as query-by-humming, cover song identification, genre classification and singer identification [1]. With the rapid development of deep learning, recent years have seen great progress of deep learning applied to melody extraction. In the literature a majority of existing studies were focused on vocal/singing melody extraction, because singing voice has many unique features that can be exploited for vocal melody extraction, such as formants, vibrato and tremolo [2].
On the other hand, there are few studies on general melody extraction. In [14] a traditional signal processing-based approach was proposed for general melody extraction. This approach used the weighted sum of harmonics as pitch salience functions, from which pitch contours were constructed. Afterward, melody contours were selected from pitch contours X. Wang based on some criteria. Ref. [15] proposed a long shortterm memory (LSTM) network for extracting melody and simultaneously detecting regions of melody using a harmonic sum loss. In [16] a fully convolutional structure was proposed for multi-pitch detection and melody extraction. This structure uses harmonic constant-q transforms (HCQTs) as the input representation. Compared with CQTs, HCQTs have an additional dimension, namely, the harmonic dimension, which for each fundamental frequency f 0 includes the energies at rf 0 for r ∈ {0.5, 1, 2, 3, 4, 5}. This structure was improved in [17] by arranging the output of each hidden layer into something similar to the HCQT. In [18] a CRNN was proposed for general melody extraction. It takes non-negative matrix factorization (NMF)-based features as inputs, and uses a CNN as the acoustic model and an RNN as the language model.
In this paper we propose a dilated convolutional model for melody extraction. Its input representation is the variable-q transform (VQT). This model consists of two parts. In the first part, four convolution layers are used for modeling temporal and local frequency patterns. To cover a large time span that is essential for melody detection, the convolutions in this part gradually increase the dilation rate over the time axis. In the second part, a single convolution layer is devoted to modeling global frequency patterns arising from the interaction among the pitches and harmonics of notes active simultaneously. To this end, this convolution uses dilation and large kernel size over the frequency axis so as to cover a large frequency range and at the same time control overfitting. To further curb potential overfitting due to the large kernel size, L 2 regularization is applied to the kernel of this convolution. In extensive experiments under various settings, this model outperforms most existing models on most datasets, for both general and vocal melody extraction. In addition, it has better generalization capability in that its performance deteriorates the least when tested on music not well represented in the training data. Furthermore, it performs the best with the least training data.

A. Input Representation
Our model takes VQTs as inputs. We prefer VQT over CQT, because VQT allows more flexible control of the bandwidths for individual filters [19]. This enables us to improve the time resolution for lower frequencies. We open source our GPU implementation of the VQT which can significantly benefit following research. The VQT is configured as follows. The sampling frequency f s is 44,100 Hz. The frequency resolution B is 60 bins per octave. The minimum frequency f min is m2f(C1) × 2 −2/B , where m2f(·) is a function that converts music notes to frequencies. The maximum frequency f max is m2f(B9) × 2 2/B . The bandwidth of filter k is configured as where f k if the center frequency of this filter, and Q is the quality factor expressed as The hop size h is 256 samples. Except Ω k , the configuration of the VQT is the same as the configuration used in [16] for the CQT. Consequently, the VQT has 540 frequency bins in total. We further cut the VQT into snippets of 1200 frames and feed them into our neural network as mini examples. Thus, each mini example has size 1200 × 540.

B. Model for General Melody Extraction
Fig . 1 gives the structure of the proposed dilated convolutional model for general melody extraction. The legend for this figure is as follows.
• rectangle: a computational block • C@T × F above a block: the output of the block has size T × F × C which is in order frames × frequency bins × channels. • name under a block: the block is given a name name.
kernel size k t × k f which is in order frames × frequency bins, and dilation rate d t ×d f . When d t or d f is missing, it is equal to one. • bn: batch normalization. • dropout: dropping with probability 0.2. The inputs to this model are VQT snippets of size 1200 × 540. Along the frequency dimension, we assume that only the first 360 frequency bins can be candidates for melody pitches. This range covers notes C1 to B6. local-0 to local-3 are four convolution layers used to capture local time-frequency patterns. The inputs to these blocks are properly padded to keep the time and frequency dimensions unchanged. local-0 uses kernel size 1 × 5, and the other three use kernel size 3 × 5. For melody extraction, it is necessary to consider a large time span. To this end and inspired by [17], we use non-uniform time-dimension dilation rates 2, 4 and 8 for local-1, local-2, and local-3, respectively. Consequently, after local-3 we achieve an overall receptive field of 29 × 17. This time span is the same or approximately the same as those used in [16], [17].
In polyphonic music, notes active simultaneously can overlap in their pitches and harmonics. This means that to single out the strongest note as the potential melody note, it is essential to aggregate information over a large frequency context. In the literature, various approaches were applied for this purpose, with most of them drawing inspiration from image semantic segmentation. Some approaches are as follows.
Our model uses a single layer of dilated convolution to capture global frequency patterns. In particular, block global employs a convolution with kernel size 1 × 97 and dilation rate 1 × 5. The resulting receptive field is 4 octaves under and 4 octaves above each pitch. Compared with the above existing approaches, ours is the simplest. Here the use of dilated convolution is feasible, because the receptive filed over the frequency dimension is 17 after local-3 so that even after using dilated convolution the frequency range can still be covered seamlessly.
Melody extraction is sparsely labeled in that at a time point only the melody pitch is known even if multiple pitches could be active simultaneously. On the other hand, the kernel of the dilated convolution in global is still quite large in that it has about 0.2 million parameters, accounting for about 91% of the total parameters of the model. For a better generalization performance, we apply L 2 regularization to the kernel of this convolution. The coefficient of the regularization is 10 −4 .
Before feeding the output of local-3 into global, we need to pad it manually along the frequency dimension. The number of paddings is 240 before frequency bin 0 and 80 after frequency bin 539. This is done in padding. The output of global has 360 frequency bins, all of which are candidates for the melody pitch. After global, in order to increase the nonlinearity in fusion we use a dense layer to project the number of feature maps to a smaller number. Afterward, in output we use a dense layer with one unit and sigmoid activation to get the probability of being the melody pitch for each pitch.
In [20] we applied a similar model to multi-pitch detection (MPE). The model proposed here differs from the model in [20] in the following aspects.
• Despite being related in that they both need to detect pitches, MPE and melody extraction differ in that the former is densely labeled, where as the latter is only sparsely labeled. In MPE for each time point the pitches of all the active notes are known. • Given the above difference, for each layer of the proposed model we used a smaller number of feature maps, because there are no enough positive labels to support the large number of feature maps used in [20]. • To determine if a pitch is the melody pitch, we need to consider a larger time span. Therefore, we used dilated convolution along the time dimension in local-1 to local-3. • To further alleviate the problem of sparse labeling, we proposed to use L2 regularization in global.

C. Model for Vocal Melody Extraction
The proposed model itself can be used unmodified for vocal melody extraction. In this case we need to modify the configuration of the VQT to reflect the fact that vocal pitches usually have a narrower frequency range than instrumental pitches. Specifically, we lower f max to m2f(G9) × 2 2/B and keep all the other parameters unchanged. Consequently, now the VQT has 520 frequency bins. We limit the pitches of vocal melody to be within the first 320 frequency bins, or equivalently, in the range of C1 to D#6.

D. Loss Function
We formulate melody detection as a multi-labeling problem and use binary cross entropy as the loss function. Specifically, for a pitch p the loss is calculated as where Pr g (p) and Pr d (p) are the ground-truth and the detected probability of pitch p, respectively. For each frame the pitch with the maximum probability is selected as a candidate for melody pitches. To decide whether a frame is voiced or not, we select a threshold that maximizes the voicing accuracy on the validation split as the voicing threshold. Candidate pitches with probabilities larger than the voicing threshold are then selected as melody pitches.

III. EXPERIMENTS
In this section we evaluate the performance of the proposed model for general and vocal melody extraction. In both cases we use only the MedleyDB dataset ( [21]) for training and validation, as it is the only dataset that is large scale, of professional quality, has enough genre variety, and was accurately annotated [21]. We use overall accuracy as the sole performance measure. It is calculated with mir eval [22]. We use Adam optimizer ( [23]) and set the learning rate to be 10 −4 . The batch size is one. The checkpoint that yields the best validation performance is selected for testing. Training stops if the validation performance does not improve in 10 epochs. The code is available at https://shorturl.at/svUX8.

A. Performance for General Melody Extraction
For the performance of general melody extraction, we will compare the proposed model with [11], [14], [16]- [18]. Ref. [14] is an approach based on signal processing so that it does not need a dataset for training. All the other models are deep learning-based approaches and used MedleyDB for training and validation. MedleyDB has 108 recordings that have melody. Ref. [16] used a partition of (71, 9,27), which is in the order of training split, validation split, and test split.
• All the partitions have the same test split.
To compare different models, we must make sure that the same partition is used and that there is no intersection between any pair of splits in this partition. Under this principle, we decide to train the proposed model under two partitions, namely, (67, 15,26) and (66, 15,27). They are obtained by starting from partition (67, 15,27) and removing the intersection recording, respectively, from the test and the training split. Below listed datasets are used for testing.
• ORCHSET: the Orchset dataset that has 64 excerpts of 10 to 32 s [24]. • MDB-S: The MDB-melody-synth dataset ( [25]) has a total of 65 recordings, of which we select those that also appear in the test split of MedleyDB for testing. There are 13 such recordings under partition (67, 15, 26) and 14 under partition (66, 15, 27). • WJD: a subset of the WJazzD dataset [26]. We use the same subset as the one used in [17]. This subset has 74 recordings. Tables I(a) and I(b) compare the proposed model with the existing models under each partition, respectively. The performance measure is the overall accuracy in percentage. The highest result on a dataset is highlighted in bold. The results in Table I(a) were obtained in the following ways.
• Salamon [14]: melodies were extracted with the library of Essentia 2 . • Bittner [16]: We retrained the model. • Basaran [18] and Balhar [17]: We used the melodies extracted in [17]. In Table I(b), the results for Hsieh [11] were obtained by running the trained model accompanying [11]. We observe the following in the two tables.
• Under partition (67, 15, 26) the proposed model was the best on all the datasets except MDB-S, on which it was the runner-up after Basaran [18]. • Under the other partition, the proposed model and Hsieh [11] achieved the-state-of-the-art performance on four and two datasets, respectively. • Among all the datasets, ORCHSET was the most challenging for all the models, with their performance plummeting on this dataset. The proposed model was stabler than the others in that its performance deteriorated the least on this dataset. Two factors contribute to the current situation. First, the content of ORCHSET is not well represented in MedleyDB. Second, nearly 94% of the frames in ORCHSET are voiced, far higher than the situation in MedleyDB. • Among all the datasets, ADC04 and WJD were the easiest for all the models. ADC04 is tiny and has no enough variety in genres, whereas WJD is large-scale but consists only of jazz music.

B. Performance for Vocal Melody Extraction
To evaluate the performance of the proposed model for vocal melody extraction, we will compare it with [1], [6], [11], [12]. We use only MedleyDB for training and validation. The partition is (35, 13,12), the same as the one used in [11]. We have to retrain these opponent models, because they were trained originally on different datasets and some even adopted dataset augmentation. Moreover, the iKala dataset ( [27]) has been unavailable since 2017. The following datasets are used for testing.
• MDB: the test split of MedleyDB. • ADC04: 12 excerpts of ADC04 with vocal melody.  13,12). Except ADC04 on which it was the second after Hsieh [11], the proposed model was the best on all the other datasets, in particular with a larger margin on MIR-1K over all the other models.
From another perspective Table II(b) compares the performance for vocal melody extraction of some existing models with the proposed. In their original papers, all these existing models were trained on different datasets, but were tested at least on the test split of MedleyDB under partition (35, 13,12). Thus, we can compare them with the proposed in terms of the testing performance on MedleyDB. In this case we need not retrain the existing models, but instead cite the results published in the literature. Table II(b) presents the results under the this setting. The performance measure is the overall accuracy (OA) in percentage. Regarding the origins of the results in Table II(b), if there is a reference cited after a result, then the result is from the reference. Otherwise, the result is from the original paper where the model was proposed. It is amazing that the proposed model performed the best with the least training data.

IV. CONCLUSIONS
In this paper we proposed a dilated convolutional model for melody extraction. At the heart of this model is the use of dilated convolution to capture large scale global frequency patterns. This model had exceptional performance on most of the datasets for both general and vocal melody extraction. It also achieved the best performance with the least training data. One aspect may hinder the adoption of this model, namely, the large amount of computation brought about by the the dilated convolution, because of the large kernel size used. We are working to tackle this problem.