Audio Speech Enhancement by Computer Vision Analytics
Cluster4 Cluster5 This was filed due to a meeting that Shmuel had Project ID : 10-2018-4602
Summary of the technology
This was filed due to a meeting that Shmuel had
Project ID : 10-2018-4602
Description of the technology
Deep Learning, Neural Networks, Speech processing
Current development stage
TRL7System prototype demonstration
A novel audio-visual approach that enhances the speaker's voice based on its correlation with his mouth and face movements.
Figure 1 Illustration of our encoder-decoder model architecture. A sequence of 5 video frames centered on the mouth region is fed into a convolutional neural network creating a video encoding. The corresponding spectrogram of the noisy speech is encoded in a similar fashion into an audio encoding. A single shared embedding is obtained by concatenating the video and audio encodings, and is fed into 3 consecutive fully-connected layers. Finally, a spectrogram of the enhanced speech is decoded using an audio decoder.