Source Separation for Monoaural Recordings:
Although the task of source separation remains similar independently of the type of musical setting involved, the nature of the sources and their relations may entail various challenges and, consequently, require different separation methodologies to be used (e.g.: speech, music). My master thesis addresses a specific use-case of source separation in which all the mixture constituents are highly correlated; choral music.
Unlike the typical instrument settings involved in musical source separation tasks (guitar - bass - vocals - drums), where the constituents’ spectra happen to be largely uncorrelated and each of them represent a distinct and unique profile as part of the mixture spectrogram, the case of SATB recordings becomes much more challenging as it entails identical sources performing in harmony, leading to overlapping spectral components. While existing Deep Learning based approaches perform very well on common sources, they tend to deliver much less promising results with SATB recordings.
SATB mixture spectrogram, with overlapping harmonics
Speech mixture spectrogram, with uncorrelated spectra
The idea behind this research is to try addressing this issue by adapting the current state-of-the-art DNN models specifically to this task.
The adapted architectures along their metrics on MUSDB are briefly described below:
The experiment is achieved using the Choral Singing Dataset along two other proprietary datasets. Because this subfield of source separation is relatively unexplored, the available datasets are extremely limited. Hence the spectrogram-based models are expected to perform better than the waveform-based ones (i.e.: Wave-U-Net).
experiment with sota models:
We first limit this experiment to a maximum of one singer per singing group (i.e.: mixture of four singers). This allows to apply data augmentation on the datasets by creating artificial mixes from the various singers available for each songs. As we are in the midst of the evaluation process, the final results will be included below in a near future. In the meantime, below are some preliminary results we achieved using both U-Net and Wave-U-Net architectures:
U-Net (Spectrogram) :
Wave-U-Net (Waveform) :
Included below are the resulting SDR metrics for each of the sources described above. We observe that these results are not too far from the ones originally computed on the MUSDB dataset (described earlier).
Open-Unmix adaptation (time-frequency w/ bidirectional LSTMs approach)
U-Net conditioned on the sources' F0 tracks (see part 2)
More comparative results