[R] VisualVoice Uses Facial Appearance to Boost SOTA in Speech Separation

  • by

Recently, a team from the University of Texas at Austin and Facebook AI Research has introduced an approach that takes as its input video of a target speaker in an environment with overlapping voices or sounds and generates an isolated soundtrack of the speaker.

VisualVoice is a novel multi-task learning framework that jointly learns audio-visual speech separation together with cross-modal speaker embeddings, effectively using a person’s facial appearance to predict their vocal sounds.

