Recently, a team from the University of Texas at Austin and Facebook AI Research has introduced an approach that takes as its input video of a target speaker in an environment with overlapping voices or sounds and generates an isolated soundtrack of the speaker.
VisualVoice is a novel multi-task learning framework that jointly learns audio-visual speech separation together with cross-modal speaker embeddings, effectively using a person’s facial appearance to predict their vocal sounds.
submitted by /u/Yuqing7
[link] [comments]