We usually claim that audio-visual methods performs better than audio-only in blind sound source separation. We are gonna check the performance of audio-only methods by doing a simple experiment. To code a U-Net to perform source separation using identiy embeddings.
To compare the performance of audio-only vs audio-visual methods we are gonna use Acappella Dataset, a dataset for audio-visual singing voice separation. There are pretrained models (Y-Net and its variants) and metrics available. Y-Net is an audio-visual sound separation network which uses a U-Net as backbone. This U-Net is conditioned with either face landmarks processed by a graph CNN or raw video processed by a spatio-temporal CNN. The core idea is the network can use lips motion to guide source separation leading to nice results.
On the other side, we are gonna train exactly the same backbone U-Net contrained in identity embeddings. These embeddings summarizes the voice identity. This way, the U-Net should learn to identify the voice characteristics indicated by the embeddings to carry out the separation.
These embeddings are extracted with Resemblyzer, a implementation from Resemble.ai of Generalized end-to-end loss for speaker verification.
The code is available in GitHub.