[R] VisualVoice Uses Facial Appearance to Boost SOTA in Speech Separation

  • by

Recently, a team from the University of Texas at Austin and Facebook AI Research has introduced an approach that takes as its input video of a target speaker in an environment with overlapping voices or sounds and generates an isolated soundtrack of the speaker.

VisualVoice is a novel multi-task learning framework that jointly learns audio-visual speech separation together with cross-modal speaker embeddings, effectively using a person’s facial appearance to predict their vocal sounds.

Read a summary here

The full paper on arXiv

submitted by /u/Yuqing7
[link] [comments]

Leave a Reply

Your email address will not be published. Required fields are marked *