Source Separation for Monoaural Recordings (Part 2): Conditioned U-Net and the FiLM layer.
In the first part of this post series, we've briefly described the problematic and challenges entailed by source separation tasks involving highly-correlated sources (such as found in SATB recordings). We've assessed how well some of the state-of-the-art DNN models, including Wave-U-Net and U-Net, were performing on choral recordings. Even though the predicted sources showed some promising results, we observed that some harmonics were predicted as part of the wrong source.
In this post, we go over how we can address this limitation, for example by integrating a control mechanism to the initial U-Net architecture, which would allow the layers parameters to be conditioned (i.e: learned) from a given control input. As a starting point, we use the implementation found here, in which the authors condition a single U-Net architecture given a one-hot-encoding input vector representing the source(s) to isolate. The input vector is embedded to obtain the parameters that control Feature-wise Linear Modulation (FiLM) layers. Below we briefly describe this architecture.
conditioned u-net and the FiLM layer:
The spectrogram-based U-Net adaptation described earlier trains a separate model for each of the mixture sources (in our case, 4 total). Depending on the nature of the separation task, this can easily lead to scaling issues. The conditioned U-Net (C-U-Net) architecture, described in , aims at addressing this limitations by introducing a control mechanism controlled by external data which would then govern a single U-Net instance. This architecture does not diverge much from the initial U-Net one; as an alternative to the multiple instances of the model, each of which is specialized in isolating a specific source, C-U-Net proposes the insertion of feature-wise linear modulation (FiLM) layers across the architecture. This allows the application of linear transformations to intermediate features, based on the source to separate. These specialized layers conserve the shape of the original intermediate feature input while modifying the underlying mapping of the filters themselves.
The FiLM layer can be described as a affine transformation performed on the feature map input X, such as described above. A set of betas and gammas will be learned from an input condition vector Z. These values will then be passed to the FiLM layer, which will then modulate the feature map at different level in the architecture. In  the authors propose two variants of the concept; one of which implements a single transformation per layer in the contracting path, resulting in 6 FiLM layers for a total of 12 scalars (6 γi and 6 βi). The other approach entails learning a set of scalars for every single feature maps, resulting in 2016 scalars.
target source fundamental frequency conditioning:
Our proposed approach consists in four variants of the C-U-Net architecture, each differing slightly, first in the way they embed the control input data, but also in the way the resulting parameters are injected into the main network. In this subsection we go over the details of these variants.
Pre-Encoding and Post-Decoding Conditioning
The control model used in our proposed architecture embeds the one-hot encoded CQT F0 representation for a given time-step into a set of transforms whose shape is either partly or fully identical to the spectrogram input's. This is achieved by modeling the condition vectors as 1-D data with multiple feature channels. The condition vectors are then fed into a convolutional neural network (CNN) with a kernel of size 10 which seizes contextual information from the adjacent time-steps. As a result, all input channels of the initial convolution contribute to all resulting feature maps in the output of the first convolutional layer. Finally a dense layer provides a specific type of conditioning to be applied to the input spectrogram, taking into account the contextual information previously captured by the CNN filters. The Figure below shows the condition generator architecture in greater details.
We propose three variants of the architecture described above. The first one applies a single transform for all frequency bins at a given time-step, resulting in a set of scalars of shape γ(128,1) and β(128,1) on the output of the control model. The second variant applies a unique affine transform for each individual frequency bin at every input time-step. Consequently the number of affine transform is increased by the number of input frequency bins and result in the following control model output shape: γ(128,512) and β(128,512). Finally, our third approach borrows the same pattern as the second variant, but applies its transform at both the input as well as the output levels. This results in doubling the amount of scalars codified by the control input model and resulting in two sets of γ(128,512) and β(128,512).
We refer to these three approaches as "C-U-Net Local", "C-U-Net Global" and "C-U-Net Global x2", respectively. The figure above depicts these three approaches in more details.
As the temporal relations between the external control input data and the input spectrogram are closely intertwined, it is crucial to apply these affine transformations while the receptive field of the input is still intact. Hence the FiLM layer is applied prior to the encoding path. The figure below shows the overall structure behind the proposed conditioning architecture.
Preliminary Results on Objective Metric (SDR,SIR,SAR):