YouTube has been captioning videos for eight years now. You can see captions on almost one billion videos. The feature was introduced back in 2009, to provide hearing impaired people with the best viewing experience.
The captioning was awful at the beginning. Nonetheless, Google didn’t ditch it but worked on the same to improve further. Their tries went fruitful as we see fewer errors on YouTube captions.
Google wants to take the captions to the next level. They want people who face hearing problems to capture every bit of the video via text. That’s why they have introduced sound effect recognition now.
YouTube’s Automatic Captions can Recognize Sound Effects
Google has collaborated with accessibility teams to build an AI algorithm that can understand the sound effects getting played with a video. You don’t have to manually do anything.
The company has used the techniques of machine learning to bring this feature to reality. Even though the feature is live now, you can’t get it to recognize every non-conversing sound, though.
Now, YouTube can only recognize and label three classes of sound; Laughter, Applause, and Music. When such sounds get played, you can see [laughter], [applause], or [music] as the caption.
Google chose those three sound categories because they are easy to capture and understand by the machine learning algorithm.
You may not use the captions. But according to the inside statistics of YouTube, videos with captions are played at least 15 million times a day. And, about 360 million people with hearing problems use the internet as well. Apparently, they will all receive the benefit of this new feature YouTube has rolled out.
According to Google engineers, developing the sound effect recognizing feature is the toughest as compared to speech recognition. And, the developments happened in machine learning field in the last couple of years have augmented the sound effect recognition.
During the developmental stages, Google engineers have trained the algorithm with thousands of hours of videos to fetch the best results. The company has also developed a deep neural network too. And, they faced challenges when it came to events happening at the same time (like laughter and applause).
As stated earlier, the feature is still in its infancy now. Once it gets mature, we can see captions for more sound effects. It may even recognize the device, from which the sound comes out (for example, alarm ringing, phone ringing, doorbell, etc.).
Do you want to check out the feature now? Play the video below and hit the CC button.