Many of us around the world now have a smart speaker in our homes, but there are also many people who do not fully understand how they work. Here, we take a closer look at the technology behind these new virtual assistants.
There are now many different types of voice recognition software embedded into the smart speakers of various manufacturers – Amazon has their Echo range of smart speakers, Apple provides Siri through their phones and Apple HomePods, Google gives us their Home series, and Microsoft products provide their virtual assistant Cortana.
So how do these small and unobtrusive speakers allow us to order pizza, tell you the time or weather, change the lighting in our homes and order products from Amazon, all without lifting a finger?
The key is in their voice recognition software.
Let’s look at the Amazon Echo as an example. Once turned on, the Echo listens to all speech, waiting for what is known as a ‘wake word’ before it springs into action. Once it ‘hears’ this word, it begins to record you speech before sending it over the Internet. The speech file is sent to Amazon’s AVS (Alexa Voice Services) in the cloud.
AVS is a voice recognition service, which deciphers what you are asking for, and sends a response back to your Echo smart speaker.
The voice recognition services of all these smart speakers use algorithms to become more familiar with your use of words and individual speech patterns. Users are also able to provide feedback as to the accuracy of the responses that Alexa and Echo provide using AVS.
When you buy an Alexa enabled product, you are usually asked to perform a ‘voice training’ whilst you set up your smart speaker, where through reading 25 key commands to your device, the AVS starts to learn your speech patterns.
Effective speech recognition by a computer is actually a very complex process, especially given the huge variety of different speech patterns. The basis though, is in recognising sections of words known as ‘phones’, which build into ‘phonemes’ which we quickly identify as individual words.
Alexa instead has automatic speech recognition (ASR) and natural language understanding (NLU) which allows your Echo smart speaker to recognise these phonemes as words and phrases, allowing it to quickly respond to your request.
Some people are of course concerned about the security of using smart speakers, especially given that users could potentially be victims of voice hacking. This involves the recording or mimicking of a user’s voice and then hijacking their accounts.
Researchers in this field have noted that many automatic speaker verification (ASV) systems are not always able to verify whether speech has been previously recorded, which would imply that any commands are not genuine. In recent research, a team based in both the US and China developed a system which can differentiate between genuine speech and a voice recording, which will hopefully flag up any potential voice hacking and stop this from occurring.
It certainly looks as though smart speakers and the technology behind them are here to stay, and the chances are the voice recognition software behind them will continue to improve, potentially making these unobtrusive devices part of our everyday life for a long time to come.
Top image: Apple and Amazon smart speaker. (CC BY-SA 4.0)(CC BY-SA 3.0)