Source Number Estimation via Deep Salience Analysis:

March 2020



Source number estimation for single-channel recording is an important yet relatively unexplored aspect of MIR. Most of the proposed approaches are directly derived from source localization methodologies and are thus relying on the spatial information the recording may convey. In the field of MIR, this task mainly relies on the fact that the spectra of common musical sources in a song are largely uncorrelated. The case of choral singing however, involves groups of people singing in harmony, which leads to higher correlation amongst the constituent signals. This project presents an approach consisting in extracting the multiple fundamental frequency tracks from a given monoaural recording, which then allows for the detection of the underlying number of sources present in the given mixture. The efficacy of this approach is demonstrated by comparing the results with a previously-computed baseline using MFCCs.


The experiment ran below entails two distinct parts:

  1. The creation of the baseline using a traditional machine learning approach, using MFCCs as features

  2. A multi-F0s contour detection approach, using a DNN model pre-trained as part of [1].

We feed the resulting vectors into a generic neural network designed using the open-source library Keras. The network architecture consists in three hidden layers, each of which includes dense layers composed of 20, 40, and 60 units respectively. The model is trained over 100 epochs using a batch size of size 25.

II. multi-f0 contour detection:

In the second stage of our experiment, we attempt to improve upon our initial baseline by framing our approach as a multi-F0 detection problem. One way to find the total number of sources present in a given recordings is to look at the signal’s multi-F0 contour, which happens to be an extremely challenging task in itself. We will be using the model pre-trained as part of [1], which has been trained on the MedleyDB dataset. This isn't ideal, however as a preliminary step this can give us a good idea on the potential success of the approach. 

The input representation of the model is defined as the harmonic constant-Q transform (HCQT). CQT representations happen to be ideal in cases involving any type of audio signals as their bins are equally distributed across musical octaves. Unlike the regular CQT, which work in a two dimensional space [t,f ], the HCQT adds another dimension to its representation space, h, which measures the th harmonic at frequency f and time t.

As a last step, this representation is fed to the network which then output the pitch salience track for the mix, that is the track for each F0 contour detected over the analyzed audio segments. In an optimal scenario, we would simply extract the number of detected F0 in order to find the number of singers in the recording. However the task happens to be a bit more difficult that that, which is why we propose three different ways of predicting the number of sources from the predicted pitch salience tracks:

where v is the vector carrying the various length of the multi-F0 vectors over all frames. For example, v = [1, 2, 4, 1] means that frame 1 has one F0, frame 2 has two, and so on.

  • In (1) we take the mean of all the numbers of simultaneous F0’s found across all frames in the mix. 

  • In (2) we simply take the maximum number of concurrent F0’s.

  • In (3) we take the most common number of concurrent F0's across all frames.


Taking the example above, each of the methods we just described would return the following predictions: mean(v) = 2, max(v) = 4, mostcommon(v) = 1.


For the MFCCs, we first trained a neural networks over 100 epochs. The two plots below show the evolution of the loss and accuracy function over the training process. We observe that, while the fitting on the train set shows promising loss and accuracy curves, with an accuracy reaching 85%, the valid set doesn’t perform nearly as well and remains static around 60%.

The bar-plot below shows the accuracy results of the four different approaches taken. We clearly see that deep salience is a more effective way of predicting the number of present source in a given choral recording, with an accuracy of 70% against the MFCC approach which barely reaches 40%.

in progress:

  • The DNN model used as part of this work was trained on the MedleyDB dataset, which mainly contains Pop/Rock songs. My next goal would be to train a model specifically on choral music. I believe that by doing so, the accuracy results as seen above could drastically improve.


[1]    Bittner, R. M., McFee, B., Salamon, J., Li, P., & Bello, J. (2017). Deep salience representations for F0 estimation in polyphonic music. In Z. Duan, D. Turnbull, X. Hu, & S. J. Cunningham (Eds.), Proceedings of the 18th International Society for Music Information Retrieval Conference, ISMIR 2017 (pp. 63-70)

© 2020 Darius Petermann