top of page

Source Separation for Monoaural Recordings:

January 2020

background:

 

Although the task of source separation remains similar independently of the type of musical setting involved, the nature of the sources and their relations may entail various challenges and, consequently, require different separation methodologies to be used (e.g.: speech, music). My master thesis addresses a specific use-case of source separation in which all the mixture constituents are highly correlated; choral music.

Unlike the typical instrument settings involved in musical source separation tasks (guitar - bass - vocals - drums), where the constituents’ spectra happen to be largely uncorrelated and each of them represent a distinct and unique profile as part of the mixture spectrogram, the case of SATB recordings becomes much more challenging as it entails identical sources performing in harmony, leading to overlapping spectral components. While existing Deep Learning based approaches perform very well on common sources, they tend to deliver much less promising results with SATB recordings. 

satb_specs.png
speech_specs.png

SATB mixture spectrogram, with overlapping harmonics

Speech mixture spectrogram, with uncorrelated spectra

state-of-the-art:

 

The idea behind this research is to try addressing this issue by adapting the current state-of-the-art DNN models specifically to this task.

The adapted architectures along their metrics on MUSDB are briefly described below:

dataset:

 

The experiment is achieved using the Choral Singing Dataset along two other proprietary datasets. Because this subfield of source separation is relatively unexplored, the available datasets are extremely limited. Hence the spectrogram-based models are expected to perform better than the waveform-based ones (i.e.: Wave-U-Net).

experiment with sota models:

We first limit this experiment to a maximum of one singer per singing group (i.e.: mixture of four singers). This allows to apply data augmentation on the datasets by creating artificial mixes from the various singers available for each songs. As we are in the midst of the evaluation process, the final results will be included below in a near future. In the meantime, below are some preliminary results we achieved using both U-Net and Wave-U-Net architectures:

Soprano

U-Net (Spectrogram) :

Alto
Tenor
Bass
speech_mix_spec.png
arrows.png
SATB Mixture
Soprano

Wave-U-Net (Waveform) :

Alto
Tenor
Bass
speech_mix_time.png
arrows.png
arrows2.png

results:

Included below are the resulting SDR metrics for each of the sources described above. We observe that these results are not too far from the ones originally computed on the MUSDB dataset (described earlier). 

Wave-U-Net Results_boxplot_results_white
Spectrogram U-Net Results_boxplot_result

in progress:

  • Open-Unmix adaptation (time-frequency w/ bidirectional LSTMs approach)

  • U-Net conditioned on the sources' F0 tracks (see part 2)

  • More comparative results

bottom of page